Command Palette
Search for a command to run...
Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval
Jiang Ding ; Ye Mang

Abstract
Text-to-image person retrieval aims to identify the target person based on agiven textual description query. The primary challenge is to learn the mappingof visual and textual modalities into a common latent space. Prior works haveattempted to address this challenge by leveraging separately pre-trainedunimodal models to extract visual and textual features. However, theseapproaches lack the necessary underlying alignment capabilities required tomatch multimodal data effectively. Besides, these works use prior informationto explore explicit part alignments, which may lead to the distortion ofintra-modality information. To alleviate these issues, we present IRRA: across-modal Implicit Relation Reasoning and Aligning framework that learnsrelations between local visual-textual tokens and enhances global image-textmatching without requiring additional prior supervision. Specifically, we firstdesign an Implicit Relation Reasoning module in a masked language modelingparadigm. This achieves cross-modal interaction by integrating the visual cuesinto the textual tokens with a cross-modal multimodal interaction encoder.Secondly, to globally align the visual and textual embeddings, SimilarityDistribution Matching is proposed to minimize the KL divergence betweenimage-text similarity distributions and the normalized label matchingdistributions. The proposed method achieves new state-of-the-art results on allthree public datasets, with a notable margin of about 3%-9% for Rank-1 accuracycompared to prior methods.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| nlp-based-person-retrival-on-cuhk-pedes | IRRA | R@1: 73.38 R@10: 93.71 R@5: 89.93 mAP: 66.13 mINP: 50.24 |
| text-based-person-retrieval-on-icfg-pedes | IRRA | R@1: 63.46 R@10: 85.82 R@5: 80.25 mAP: 38.06 mINP: 7.93 |
| text-based-person-retrieval-on-rstpreid-1 | IRRA | R@1: 60.20 R@10: 81.30 R@5: 88.20 |
| text-based-person-retrieval-with-noisy | IRRA | Rank 10: 92.20 Rank-1: 69.74 Rank-5: 87.09 mAP: 62.28 mINP: 45.84 |
| text-based-person-retrieval-with-noisy-1 | IRRA | Rank 1: 60.76 Rank-10: 84.01 Rank-5: 78.26 mAP: 35.87 mINP: 6.80 |
| text-based-person-retrieval-with-noisy-2 | IRRA | Rank 1: 58.75 Rank 10: 88.25 Rank 5: 81.90 mAP: 46.38 mINP: 24.78 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.