HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

LASER: Lip Landmark Assisted Speaker Detection for Robustness

Le Thien Phuc Nguyen Zhuoran Yu Yong Jae Lee

LASER: Lip Landmark Assisted Speaker Detection for Robustness

Abstract

Active Speaker Detection (ASD) aims to identify speaking individuals in complex visual scenes. While humans can easily detect speech by matching lip movements to audio, current ASD models struggle to establish this correspondence, often misclassifying non-speaking instances when audio and lip movements are unsynchronized. To address this limitation, we propose Lip landmark Assisted Speaker dEtection for Robustness (LASER). Unlike models that rely solely on facial frames, LASER explicitly focuses on lip movements by integrating lip landmarks in training. Specifically, given a face track, LASER extracts frame-level visual features and the 2D coordinates of lip landmarks using a lightweight detector. These coordinates are encoded into dense feature maps, providing spatial and structural information on lip positions. Recognizing that landmark detectors may sometimes fail under challenging conditions (e.g., low resolution, occlusions, extreme angles), we incorporate an auxiliary consistency loss to align predictions from both lip-aware and face-only features, ensuring reliable performance even when lip data is absent. Extensive experiments across multiple datasets show that LASER outperforms state-of-the-art models, especially in scenarios with desynchronized audio and visuals, demonstrating robust performance in real-world video contexts. Code is available at \url{https://github.com/plnguyen2908/LASER_ASD}.

Code Repositories

plnguyen2908/laser_asd
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
audio-visual-active-speaker-detection-on-avaLoCoNet + Laser
validation mean average precision: 95.3%

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
LASER: Lip Landmark Assisted Speaker Detection for Robustness | Papers | HyperAI