Command Palette
Search for a command to run...
Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints
Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints
Ming Dai Jian Li Jiedong Zhuang Xian Zhang Wankou Yang
Abstract
Multi-task visual grounding involves the simultaneous execution oflocalization and segmentation in images based on textual expressions. Themajority of advanced methods predominantly focus on transformer-basedmultimodal fusion, aiming to extract robust multimodal representations.However, ambiguity between referring expression comprehension (REC) andreferring image segmentation (RIS) is error-prone, leading to inconsistenciesbetween multi-task predictions. Besides, insufficient multimodal understandingdirectly contributes to biased target perception. To overcome these challenges,we propose a Coarse-to-fine Consistency Constraints Visual Groundingarchitecture (C3VG), which integrates implicit and explicitmodeling approaches within a two-stage framework. Initially, query and pixeldecoders are employed to generate preliminary detection and segmentationoutputs, a process referred to as the Rough Semantic Perception (RSP) stage.These coarse predictions are subsequently refined through the proposedMask-guided Interaction Module (MIM) and a novel explicit bidirectionalconsistency constraint loss to ensure consistent representations across tasks,which we term the Refined Consistency Interaction (RCI) stage. Furthermore, toaddress the challenge of insufficient multimodal understanding, we leveragepre-trained models based on visual-linguistic fusion representations. Empiricalevaluations on the RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate theefficacy and soundness of C3VG, which significantlyoutperforms state-of-the-art REC and RIS methods by a substantial margin. Codeand model will be available at \url{https://github.com/Dmmm1997/C3VG}.