Command Palette
Search for a command to run...
Han Haochen ; Miao Kaiyao ; Zheng Qinghua ; Luo Minnan

Abstract
Despite the success of multimodal learning in cross-modal retrieval task, theremarkable progress relies on the correct correspondence among multimedia data.However, collecting such ideal data is expensive and time-consuming. Inpractice, most widely used datasets are harvested from the Internet andinevitably contain mismatched pairs. Training on such noisy correspondencedatasets causes performance degradation because the cross-modal retrievalmethods can wrongly enforce the mismatched data to be similar. To tackle thisproblem, we propose a Meta Similarity Correction Network (MSCN) to providereliable similarity scores. We view a binary classification task as themeta-process that encourages the MSCN to learn discrimination from positive andnegative meta-data. To further alleviate the influence of noise, we design aneffective data purification strategy using meta-data as prior knowledge toremove the noisy samples. Extensive experiments are conducted to demonstratethe strengths of our method in both synthetic and real-world noises, includingFlickr30K, MS-COCO, and Conceptual Captions.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| cross-modal-retrieval-with-noisy-1 | MSCN | Image-to-text R@1: 40.1 Image-to-text R@10: 76.6 Image-to-text R@5: 65.7 R-Sum: 366.7 Text-to-image R@1: 40.6 Text-to-image R@10: 76.3 Text-to-image R@5: 67.4 |
| cross-modal-retrieval-with-noisy-2 | MSCN | Image-to-text R@1: 77.4 Image-to-text R@10: 97.6 Image-to-text R@5: 94.9 R-Sum: 501.9 Text-to-image R@1: 59.6 Text-to-image R@10: 89.2 Text-to-image R@5: 83.2 |
| cross-modal-retrieval-with-noisy-3 | MSCN | Image-to-text R@1: 78.1 Image-to-text R@10: 98.8 Image-to-text R@5: 97.2 R-Sum: 524.6 Text-to-image R@1: 64.3 Text-to-image R@10: 95.8 Text-to-image R@5: 90.4 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.