5 months ago

FaceFormer: Speech-Driven 3D Facial Animation with Transformers

Fan Yingruo ; Lin Zhaojiang ; Saito Jun ; Wang Wenping ; Komura Taku

Abstract

Speech-driven 3D facial animation is challenging due to the complex geometryof human faces and the limited availability of 3D audio-visual data. Priorworks typically focus on learning phoneme-level features of short audio windowswith limited context, occasionally resulting in inaccurate lip movements. Totackle this limitation, we propose a Transformer-based autoregressive model,FaceFormer, which encodes the long-term audio context and autoregressivelypredicts a sequence of animated 3D face meshes. To cope with the data scarcityissue, we integrate the self-supervised pre-trained speech representations.Also, we devise two biased attention mechanisms well suited to this specifictask, including the biased cross-modal multi-head (MH) attention and the biasedcausal MH self-attention with a periodic positional encoding strategy. Theformer effectively aligns the audio-motion modalities, whereas the latteroffers abilities to generalize to longer audio sequences. Extensive experimentsand a perceptual user study show that our approach outperforms the existingstate-of-the-arts. The code will be made available.

Code Repositories

EvelynFan/FaceFormer

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
3d-face-animation-on-beat2	FaceFormer	MSE: 7.787
3d-face-animation-on-biwi-3d-audiovisual	FaceFormer	FDD: 4.6408 Lip Vertex Error: 5.3077
3d-face-animation-on-vocaset	FaceFormer	Lip Vertex Error: 5.3742

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette