Command Palette
Search for a command to run...
GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling
Liu Pinxin ; Song Luchuan ; Huang Junhua ; Liu Haiyang ; Xu Chenliang

Abstract
Generating full-body human gestures based on speech signals remainschallenges on quality and speed. Existing approaches model different bodyregions such as body, legs and hands separately, which fail to capture thespatial interactions between them and result in unnatural and disjointedmovements. Additionally, their autoregressive/diffusion-based pipelines showslow generation speed due to dozens of inference steps. To address these twochallenges, we propose GestureLSM, a flow-matching-based approach for Co-SpeechGesture Generation with spatial-temporal modeling. Our method i) explicitlymodel the interaction of tokenized body regions through spatial and temporalattention, for generating coherent full-body gestures. ii) introduce the flowmatching to enable more efficient sampling by explicitly modeling the latentvelocity space. To overcome the suboptimal performance of flow matchingbaseline, we propose latent shortcut learning and beta distribution time stampsampling during training to enhance gesture synthesis quality and accelerateinference. Combining the spatial-temporal modeling and improved flowmatching-based framework, GestureLSM achieves state-of-the-art performance onBEAT2 while significantly reducing inference time compared to existing methods,highlighting its potential for enhancing digital humans and embodied agents inreal-world applications. Project Page:https://andypinxinliu.github.io/GestureLSM
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| gesture-generation-on-beat2 | GestureLSM | FGD: 0.4040 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.