8 months ago

Computer Vision

Multimodal Representation

Computer Vision

Min Kyle ; Roy Sourya ; Tripathi Subarna ; Guha Tanaya ; Majumdar Somdeb

Abstract

Active speaker detection (ASD) in videos with multiple speakers is achallenging task as it requires learning effective audiovisual features andspatial-temporal correlations over long temporal windows. In this paper, wepresent SPELL, a novel spatial-temporal graph learning framework that can solvecomplex tasks such as ASD. To this end, each person in a video frame is firstencoded in a unique node for that frame. Nodes corresponding to a single personacross frames are connected to encode their temporal dynamics. Nodes within aframe are also connected to encode inter-person relationships. Thus, SPELLreduces ASD to a node classification task. Importantly, SPELL is able to reasonover long temporal contexts for all nodes without relying on computationallyexpensive fully connected graph neural networks. Through extensive experimentson the AVA-ActiveSpeaker dataset, we demonstrate that learning graph-basedrepresentations can significantly improve the active speaker detectionperformance owing to its explicit spatial and temporal structure. SPELLoutperforms all previous state-of-the-art approaches while requiringsignificantly lower memory and computational resources. Our code is publiclyavailable at https://github.com/SRA2/SPELL

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Computer Vision

Multimodal Representation

Computer Vision

Min Kyle ; Roy Sourya ; Tripathi Subarna ; Guha Tanaya ; Majumdar Somdeb

Abstract

Active speaker detection (ASD) in videos with multiple speakers is achallenging task as it requires learning effective audiovisual features andspatial-temporal correlations over long temporal windows. In this paper, wepresent SPELL, a novel spatial-temporal graph learning framework that can solvecomplex tasks such as ASD. To this end, each person in a video frame is firstencoded in a unique node for that frame. Nodes corresponding to a single personacross frames are connected to encode their temporal dynamics. Nodes within aframe are also connected to encode inter-person relationships. Thus, SPELLreduces ASD to a node classification task. Importantly, SPELL is able to reasonover long temporal contexts for all nodes without relying on computationallyexpensive fully connected graph neural networks. Through extensive experimentson the AVA-ActiveSpeaker dataset, we demonstrate that learning graph-basedrepresentations can significantly improve the active speaker detectionperformance owing to its explicit spatial and temporal structure. SPELLoutperforms all previous state-of-the-art approaches while requiringsignificantly lower memory and computational resources. Our code is publiclyavailable at https://github.com/SRA2/SPELL

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection | Papers | HyperAI