Command Palette
Search for a command to run...
Fan Yingruo ; Lin Zhaojiang ; Saito Jun ; Wang Wenping ; Komura Taku

Abstract
Speech-driven 3D facial animation is challenging due to the complex geometryof human faces and the limited availability of 3D audio-visual data. Priorworks typically focus on learning phoneme-level features of short audio windowswith limited context, occasionally resulting in inaccurate lip movements. Totackle this limitation, we propose a Transformer-based autoregressive model,FaceFormer, which encodes the long-term audio context and autoregressivelypredicts a sequence of animated 3D face meshes. To cope with the data scarcityissue, we integrate the self-supervised pre-trained speech representations.Also, we devise two biased attention mechanisms well suited to this specifictask, including the biased cross-modal multi-head (MH) attention and the biasedcausal MH self-attention with a periodic positional encoding strategy. Theformer effectively aligns the audio-motion modalities, whereas the latteroffers abilities to generalize to longer audio sequences. Extensive experimentsand a perceptual user study show that our approach outperforms the existingstate-of-the-arts. The code will be made available.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| 3d-face-animation-on-beat2 | FaceFormer | MSE: 7.787 |
| 3d-face-animation-on-biwi-3d-audiovisual | FaceFormer | FDD: 4.6408 Lip Vertex Error: 5.3077 |
| 3d-face-animation-on-vocaset | FaceFormer | Lip Vertex Error: 5.3742 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.