8 months ago

Abstract

Multi-task visual grounding involves the simultaneous execution oflocalization and segmentation in images based on textual expressions. Themajority of advanced methods predominantly focus on transformer-basedmultimodal fusion, aiming to extract robust multimodal representations.However, ambiguity between referring expression comprehension (REC) andreferring image segmentation (RIS) is error-prone, leading to inconsistenciesbetween multi-task predictions. Besides, insufficient multimodal understandingdirectly contributes to biased target perception. To overcome these challenges,we propose a Coarse-to-fine Consistency Constraints Visual Groundingarchitecture ( $\text{C}^3\text{VG}$ ), which integrates implicit and explicitmodeling approaches within a two-stage framework. Initially, query and pixeldecoders are employed to generate preliminary detection and segmentationoutputs, a process referred to as the Rough Semantic Perception (RSP) stage.These coarse predictions are subsequently refined through the proposedMask-guided Interaction Module (MIM) and a novel explicit bidirectionalconsistency constraint loss to ensure consistent representations across tasks,which we term the Refined Consistency Interaction (RCI) stage. Furthermore, toaddress the challenge of insufficient multimodal understanding, we leveragepre-trained models based on visual-linguistic fusion representations. Empiricalevaluations on the RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate theefficacy and soundness of $\text{C}^3\text{VG}$ , which significantlyoutperforms state-of-the-art REC and RIS methods by a substantial margin. Codeand model will be available at \url{https://github.com/Dmmm1997/C3VG}.