HyperAIHyperAI

Command Palette

Search for a command to run...

See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval

Xiujun Shu Wei Wen Haoqian Wu Keyu Chen Yiran Song Ruizhi Qiao Bo Ren Xiao Wang

Abstract

Text-based person retrieval aims to find the query person based on a textualdescription. The key is to learn a common latent space mapping betweenvisual-textual modalities. To achieve this goal, existing works employsegmentation to obtain explicitly cross-modal alignments or utilize attentionto explore salient alignments. These methods have two shortcomings: 1) Labelingcross-modal alignments are time-consuming. 2) Attention methods can exploresalient cross-modal alignments but may ignore some subtle and valuable pairs.To relieve these issues, we introduce an Implicit Visual-Textual (IVT)framework for text-based person retrieval. Different from previous models, IVTutilizes a single network to learn representation for both modalities, whichcontributes to the visual-textual interaction. To explore the fine-grainedalignment, we further propose two implicit semantic alignment paradigms:multi-level alignment (MLA) and bidirectional mask modeling (BMM). The MLAmodule explores finer matching at sentence, phrase, and word levels, while theBMM module aims to mine \textbf{more} semantic alignments between visual andtextual modalities. Extensive experiments are carried out to evaluate theproposed IVT on public datasets, i.e., CUHK-PEDES, RSTPReID, and ICFG-PEDES.Even without explicit body part alignment, our approach still achievesstate-of-the-art performance. Code is available at:https://github.com/TencentYoutuResearch/PersonRetrieval-IVT.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval | Papers | HyperAI