HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval

Shu Xiujun ; Wen Wei ; Wu Haoqian ; Chen Keyu ; Song Yiran ; Qiao Ruizhi ; Ren Bo ; Wang Xiao

See Finer, See More: Implicit Modality Alignment for Text-based Person
  Retrieval

Abstract

Text-based person retrieval aims to find the query person based on a textualdescription. The key is to learn a common latent space mapping betweenvisual-textual modalities. To achieve this goal, existing works employsegmentation to obtain explicitly cross-modal alignments or utilize attentionto explore salient alignments. These methods have two shortcomings: 1) Labelingcross-modal alignments are time-consuming. 2) Attention methods can exploresalient cross-modal alignments but may ignore some subtle and valuable pairs.To relieve these issues, we introduce an Implicit Visual-Textual (IVT)framework for text-based person retrieval. Different from previous models, IVTutilizes a single network to learn representation for both modalities, whichcontributes to the visual-textual interaction. To explore the fine-grainedalignment, we further propose two implicit semantic alignment paradigms:multi-level alignment (MLA) and bidirectional mask modeling (BMM). The MLAmodule explores finer matching at sentence, phrase, and word levels, while theBMM module aims to mine \textbf{more} semantic alignments between visual andtextual modalities. Extensive experiments are carried out to evaluate theproposed IVT on public datasets, i.e., CUHK-PEDES, RSTPReID, and ICFG-PEDES.Even without explicit body part alignment, our approach still achievesstate-of-the-art performance. Code is available at:https://github.com/TencentYoutuResearch/PersonRetrieval-IVT.

Code Repositories

tencentyouturesearch/personretrieval-ivt
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
text-based-person-retrieval-with-noisyIVT
Rank 10: 85.61
Rank-1: 58.59
Rank-5: 78.51
mAP: 57.19
mINP: 45.78
text-based-person-retrieval-with-noisy-1IVT
Rank 1: 50.21
Rank-10: 76.18
Rank-5: 69.14
mAP: 34.72
mINP: 8.77
text-based-person-retrieval-with-noisy-2IVT
Rank 1: 43.65
Rank 10: 75.70
Rank 5: 66.50
mAP: 37.22
mINP: 20.47

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval | Papers | HyperAI