HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection

Min Kyle ; Roy Sourya ; Tripathi Subarna ; Guha Tanaya ; Majumdar Somdeb

Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection

Abstract

Active speaker detection (ASD) in videos with multiple speakers is achallenging task as it requires learning effective audiovisual features andspatial-temporal correlations over long temporal windows. In this paper, wepresent SPELL, a novel spatial-temporal graph learning framework that can solvecomplex tasks such as ASD. To this end, each person in a video frame is firstencoded in a unique node for that frame. Nodes corresponding to a single personacross frames are connected to encode their temporal dynamics. Nodes within aframe are also connected to encode inter-person relationships. Thus, SPELLreduces ASD to a node classification task. Importantly, SPELL is able to reasonover long temporal contexts for all nodes without relying on computationallyexpensive fully connected graph neural networks. Through extensive experimentson the AVA-ActiveSpeaker dataset, we demonstrate that learning graph-basedrepresentations can significantly improve the active speaker detectionperformance owing to its explicit spatial and temporal structure. SPELLoutperforms all previous state-of-the-art approaches while requiringsignificantly lower memory and computational resources. Our code is publiclyavailable at https://github.com/SRA2/SPELL

Code Repositories

sra2/spell
Official
pytorch
Mentioned in GitHub
kylemin/SPELL
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
audio-visual-active-speaker-detection-on-avaSPELL
validation mean average precision: 94.2%
audio-visual-active-speaker-detection-on-avaSPELL+
validation mean average precision: 94.9%
node-classification-on-avaUniCon [zhang2021unicon]
mAP: 92
node-classification-on-avaMAAS-TAN [MAAS2021]
mAP: 88.8
node-classification-on-avaASDNet [ASDNet_ICCV2021]
mAP: 93.5
node-classification-on-avaTalkNet [tao2021someone]
mAP: 92.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection | Papers | HyperAI