Command Palette
Search for a command to run...
{Haizhou Li Mike Zheng Shou Xinyuan Qian Rohan Kumar Das Zexu Pan Ruijie Tao}

Abstract
Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation of short-term and long-term audio and visual information, as well as audiovisual interaction. Unlike the prior work where systems makedecision instantaneously using short-term features, we propose a novel framework, named TalkNet, that makes decision by taking both short-term and long-term features into consideration. TalkNet consists of audio and visual temporal encoders for feature representation, audio-visual cross-attention mechanism forinter-modality interaction, and a self-attention mechanism to capture long-term speaking evidence. The experiments demonstrate that TalkNet achieves 3.5% and 3.0% improvement over the state-of-the-art systems on the AVA-ActiveSpeaker validation and test dataset, respectively. We will release the codes, the models and data logs.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| audio-visual-active-speaker-detection-on-ava | TalkNet | validation mean average precision: 92.3% |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.