Command Palette
Search for a command to run...
Cai Zhixi ; Ghosh Shreya ; Stefanov Kalin ; Dhall Abhinav ; Cai Jianfei ; Rezatofighi Hamid ; Haffari Reza ; Hayat Munawar

Abstract
This paper proposes a self-supervised approach to learn universal facialrepresentations from videos, that can transfer across a variety of facialanalysis tasks such as Facial Attribute Recognition (FAR), Facial ExpressionRecognition (FER), DeepFake Detection (DFD), and Lip Synchronization (LS). Ourproposed framework, named MARLIN, is a facial video masked autoencoder, thatlearns highly robust and generic facial embeddings from abundantly availablenon-annotated web crawled facial videos. As a challenging auxiliary task,MARLIN reconstructs the spatio-temporal details of the face from the denselymasked facial regions which mainly include eyes, nose, mouth, lips, and skin tocapture local and global aspects that in turn help in encoding generic andtransferable features. Through a variety of experiments on diverse downstreamtasks, we demonstrate MARLIN to be an excellent facial video encoder as well asfeature extractor, that performs consistently well across a variety ofdownstream tasks including FAR (1.13% gain over supervised benchmark), FER(2.64% gain over unsupervised benchmark), DFD (1.86% gain over unsupervisedbenchmark), LS (29.36% gain for Frechet Inception Distance), and even in lowdata regime. Our code and models are available athttps://github.com/ControlNet/MARLIN .
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-classification-on-celebv-hq | MARLIN | AUC: 0.9406 Accuracy: 95.48 |
| deepfake-detection-on-faceforensics-1 | MARLIN (ViT-B) | AUC: 0.9305 |
| deepfake-detection-on-faceforensics-1 | MARLIN (ViT-L) | AUC: 0.9377 |
| deepfake-detection-on-faceforensics-1 | MARLIN (ViT-S) | AUC: 0.8863 |
| emotion-classification-on-cmu-mosei | MARLIN (ViT-S) | Accuracy: 80.38 |
| emotion-classification-on-cmu-mosei | MARLIN (ViT-B) | Accuracy: 80.6 |
| emotion-classification-on-cmu-mosei | MARLIN (ViT-L) | Accuracy: 80.63 |
| facial-attribute-classification-on-celebv-hq | MARLIN | AUC: 0.9561 Accuracy: 93.9 |
| lip-sync-on-lrs2 | Wav2Lip + ViT + MARLIN | FID: 3.452 LSE-C: 5.528 LSE-D: 7.127 |
| multimodal-sentiment-analysis-on-cmu-mosei-1 | MARLIN (ViT-B) | Accuracy: 73.7 |
| multimodal-sentiment-analysis-on-cmu-mosei-1 | MARLIN (ViT-S) | Accuracy: 72.69 |
| multimodal-sentiment-analysis-on-cmu-mosei-1 | MARLIN (ViT-L) | Accuracy: 74.83 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.