3 months ago

Global–Local Information Soft-Alignment for Cross-Modal Remote-Sensing Image–Text Retrieval

{Qian Wu Jianting Zhang Yafei Lv Zaidao Wen Gang Hu}

Abstract

Cross-modal remote-sensing image–text retrieval (CMRSITR) is a challenging task that aims to retrieve target remote-sensing (RS) images based on textual descriptions. However, the modal gap between texts and RS images poses a significant challenge. RS images comprise multiple targets and complex backgrounds, necessitating the mining of both global and local information (GaLR) for effective CMRSITR. Existing approaches primarily focus on local image features while disregarding the local features of the text and their correspondence. These methods typically fuse global and local image features and align them with global text features. However, they struggle to eliminate the influence of cluttered backgrounds and may overlook crucial targets. To address these limitations, we propose a novel framework for CMRSITR based on a transformer architecture, which leverages global–local information soft alignment (GLISA) to enhance retrieval performance. Our framework incorporates a global image extraction module, which captures the global semantic features of image–text pairs and effectively represents the relationships among multiple targets in RS images. In addition, we introduce an adaptive local information extraction (ALIE) module that adaptively mines discriminative local clues from both RS images and texts, aligning the corresponding fine-grained information. To mitigate semantic ambiguities during the alignment of local features, we design a local information soft-alignment (LISA) module. In comparative evaluations using two public CMRSITR datasets, our proposed method achieves state-of-the-art results, surpassing not only traditional cross-modal retrieval methods by a substantial margin but also other contrastive language-image pretraining (CLIP)-based methods.

Benchmarks

Benchmark	Methodology	Metrics
cross-modal-retrieval-on-rsicd	GLISA	Image-to-text R@1: 20.68% Mean Recall: 37.69% text-to-image R@1: 14.73%
cross-modal-retrieval-on-rsitmd	GLISA	Image-to-text R@1: 32.08% Mean Recall: 50.69% text-to-imageR@1: 23.36%

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning