2 months ago

Self-Rewarding Vision-Language Model via Reasoning Decomposition

Zongxia Li Wenhao Yu Chengsong Huang Rui Liu Zhenwen Liang Fuxiao Liu Jingxi Che Dian Yu Jordan Boyd-Graber Haitao Mi

Abstract

Vision-Language Models (VLMs) often suffer from visual hallucinations, sayingthings that are not actually in the image, and language shortcuts, where theyskip the visual part and just rely on text priors. These issues arise becausemost post-training methods for VLMs rely on simple verifiable answer matchingand supervise only final outputs, leaving intermediate visual reasoning withoutexplicit guidance. As a result, VLMs receive sparse visual signals and oftenlearn to prioritize language-based reasoning over visual perception. Tomitigate this, some existing methods add visual supervision using humanannotations or distilled labels from external large models. However, humanannotations are labor-intensive and costly, and because external signals cannotadapt to the evolving policy, they cause distributional shifts that can lead toreward hacking. In this paper, we introduce Vision-SR1, a self-rewarding methodthat improves visual reasoning without relying on external visual supervisionsvia reinforcement learning. Vision-SR1 decomposes VLM reasoning into twostages: visual perception and language reasoning. The model is first promptedto produce self-contained visual perceptions that are sufficient to answer thequestion without referring back the input image. To validate thisself-containment, the same VLM model is then re-prompted to perform languagereasoning using only the generated perception as input to compute reward. Thisself-reward is combined with supervision on final outputs, providing a balancedtraining signal that strengthens both visual perception and language reasoning.Our experiments demonstrate that Vision-SR1 improves visual reasoning,mitigates visual hallucinations, and reduces reliance on language shortcutsacross diverse vision-language tasks.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Self-Rewarding Vision-Language Model via Reasoning Decomposition

Zongxia Li Wenhao Yu Chengsong Huang Rui Liu Zhenwen Liang Fuxiao Liu Jingxi Che Dian Yu Jordan Boyd-Graber Haitao Mi1 more

Abstract

Build AI with AI

Hyper Newsletters

Zongxia Li Wenhao Yu Chengsong Huang Rui Liu Zhenwen Liang Fuxiao Liu Jingxi Che Dian Yu Jordan Boyd-Graber Haitao Mi