8 months ago

Abstract

Text-to-image person re-identification (ReID) aims to retrieve images of aperson based on a given textual description. The key challenge is to learn therelations between detailed information from visual and textual modalities.Existing works focus on learning a latent space to narrow the modality gap andfurther build local correspondences between two modalities. However, thesemethods assume that image-to-text and text-to-image associations aremodality-agnostic, resulting in suboptimal associations. In this work, we showthe discrepancy between image-to-text association and text-to-image associationand propose CADA: Cross-Modal Adaptive Dual Association that finely buildsbidirectional image-text detailed associations. Our approach features adecoder-based adaptive dual association module that enables full interactionbetween visual and textual modalities, allowing for bidirectional and adaptivecross-modal correspondence associations. Specifically, the paper proposes abidirectional association mechanism: Association of text Tokens to imagePatches (ATP) and Association of image Regions to text Attributes (ARA). Weadaptively model the ATP based on the fact that aggregating cross-modalfeatures based on mistaken associations will lead to feature distortion. Formodeling the ARA, since the attributes are typically the first distinguishingcues of a person, we propose to explore the attribute-level association bypredicting the masked text phrase using the related image region. Finally, welearn the dual associations between texts and images, and the experimentalresults demonstrate the superiority of our dual formulation. Codes will be madepublicly available.

Source PDF