HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

MAAS: Multi-modal Assignation for Active Speaker Detection

León-Alcázar Juan ; Heilbron Fabian Caba ; Thabet Ali ; Ghanem Bernard

MAAS: Multi-modal Assignation for Active Speaker Detection

Abstract

Active speaker detection requires a solid integration of multi-modal cues.While individual modalities can approximate a solution, accurate predictionscan only be achieved by explicitly fusing the audio and visual features andmodeling their temporal progression. Despite its inherent muti-modal nature,current methods still focus on modeling and fusing short-term audiovisualfeatures for individual speakers, often at frame level. In this paper wepresent a novel approach to active speaker detection that directly addressesthe multi-modal nature of the problem, and provides a straightforward strategywhere independent visual features from potential speakers in the scene areassigned to a previously detected speech event. Our experiments show that, ansmall graph data structure built from a single frame, allows to approximate aninstantaneous audio-visual assignment problem. Moreover, the temporal extensionof this initial graph achieves a new state-of-the-art on the AVA-ActiveSpeakerdataset with a mAP of 88.8\%.

Code Repositories

fuankarion/maas
Official
pytorch

Benchmarks

BenchmarkMethodologyMetrics
audio-visual-active-speaker-detection-on-avaMAAS-TAN
validation mean average precision: 88.8%
audio-visual-active-speaker-detection-on-avaMAAS-LAN
validation mean average precision: 85.1%

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
MAAS: Multi-modal Assignation for Active Speaker Detection | Papers | HyperAI