3 months ago

Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation

Kaining Ying Henghui Ding Guanquan Jie Yu-Gang Jiang

Abstract

Referring audio-visual segmentation (RAVS) has recently seen significantadvancements, yet challenges remain in integrating multimodal information anddeeply understanding and reasoning about audiovisual content. To extend theboundaries of RAVS and facilitate future research in this field, we proposeOmnimodal Referring Audio-Visual Segmentation (OmniAVS), a new datasetcontaining 2,098 videos and 59,458 multimodal referring expressions. OmniAVSstands out with three key innovations: (1) 8 types of multimodal expressionsthat flexibly combine text, speech, sound, and visual cues; (2) an emphasis onunderstanding audio content beyond just detecting their presence; and (3) theinclusion of complex reasoning and world knowledge in expressions. Furthermore,we introduce Omnimodal Instructed Segmentation Assistant (OISA), to address thechallenges of multimodal reasoning and fine-grained understanding ofaudiovisual content in OmniAVS. OISA uses MLLM to comprehend complex cues andperform reasoning-based segmentation. Extensive experiments show that OISAoutperforms existing methods on OmniAVS and achieves competitive results onother related tasks.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation

Kaining Ying Henghui Ding Guanquan Jie Yu-Gang Jiang

Abstract

Build AI with AI

Hyper Newsletters