8 months ago

Video Understanding

Multimodal Representation

Computer Vision

Juan León Alcázar Fabian Caba Heilbron Long Mai Federico Perazzi Joon-Young Lee Pablo Arbeláez Bernard Ghanem

Abstract

Current methods for active speak er detection focus on modeling short-termaudiovisual information from a single speaker. Although this strategy can beenough for addressing single-speaker scenarios, it prevents accurate detectionwhen the task is to identify who of many candidate speakers are talking. Thispaper introduces the Active Speaker Context, a novel representation that modelsrelationships between multiple speakers over long time horizons. Our ActiveSpeaker Context is designed to learn pairwise and temporal relations from anstructured ensemble of audio-visual observations. Our experiments show that astructured feature ensemble already benefits the active speaker detectionperformance. Moreover, we find that the proposed Active Speaker Contextimproves the state-of-the-art on the AVA-ActiveSpeaker dataset achieving a mAPof 87.1%. We present ablation studies that verify that this result is a directconsequence of our long-term multi-speaker analysis.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Video Understanding

Multimodal Representation

Computer Vision

Juan León Alcázar Fabian Caba Heilbron Long Mai Federico Perazzi Joon-Young Lee Pablo Arbeláez Bernard Ghanem

Abstract

Current methods for active speak er detection focus on modeling short-termaudiovisual information from a single speaker. Although this strategy can beenough for addressing single-speaker scenarios, it prevents accurate detectionwhen the task is to identify who of many candidate speakers are talking. Thispaper introduces the Active Speaker Context, a novel representation that modelsrelationships between multiple speakers over long time horizons. Our ActiveSpeaker Context is designed to learn pairwise and temporal relations from anstructured ensemble of audio-visual observations. Our experiments show that astructured feature ensemble already benefits the active speaker detectionperformance. Moreover, we find that the proposed Active Speaker Contextimproves the state-of-the-art on the AVA-ActiveSpeaker dataset achieving a mAPof 87.1%. We present ablation studies that verify that this result is a directconsequence of our long-term multi-speaker analysis.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

Active Speakers in Context | Papers | HyperAI