4 months ago

MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

Xi Chen Mingkang Zhu Shaoteng Liu Xiaoyang Wu Xiaogang Xu Yu Liu Xiang Bai Hengshuang Zhao

Abstract

This work explores enabling Chain-of-Thought (CoT) reasoning to link visualcues across multiple images. A straightforward solution is to adapt rule-basedreinforcement learning for Vision-Language Models (VLMs). However, such methodstypically rely on manually curated question-answer pairs, which can beparticularly challenging when dealing with fine grained visual details andcomplex logic across images. Inspired by self-supervised visual representationlearning, we observe that images contain inherent constraints that can serve assupervision. Based on this insight, we construct image triplets comprising twoaugmented views of the same image and a third, similar but distinct image.During training, the model is prompted to generate a reasoning process tocompare these images (i.e., determine same or different). Then we optimize themodel with rule-based reinforcement learning. Due to the high visual similarityand the presence of augmentations, the model must attend to subtle visualchanges and perform logical reasoning to succeed. Experiments show that,although trained solely on visual comparison tasks, the learned reasoningability generalizes effectively to a wide range of questions. Without relyingon any human-annotated question-answer pairs, our method achieves significantimprovements on multi-image reasoning benchmarks and shows strong performanceon general vision tasks.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

Xi Chen Mingkang Zhu Shaoteng Liu Xiaoyang Wu Xiaogang Xu Yu Liu Xiang Bai Hengshuang Zhao

Abstract

Build AI with AI

Hyper Newsletters