Command Palette
Search for a command to run...

Abstract
Vision-Language Models (VLMs) often suffer from visual hallucinations, sayingthings that are not actually in the image, and language shortcuts, where theyskip the visual part and just rely on text priors. These issues arise becausemost post-training methods for VLMs rely on simple verifiable answer matchingand supervise only final outputs, leaving intermediate visual reasoning withoutexplicit guidance. As a result, VLMs receive sparse visual signals and oftenlearn to prioritize language-based reasoning over visual perception. Tomitigate this, some existing methods add visual supervision using humanannotations or distilled labels from external large models. However, humanannotations are labor-intensive and costly, and because external signals cannotadapt to the evolving policy, they cause distributional shifts that can lead toreward hacking. In this paper, we introduce Vision-SR1, a self-rewarding methodthat improves visual reasoning without relying on external visual supervisionsvia reinforcement learning. Vision-SR1 decomposes VLM reasoning into twostages: visual perception and language reasoning. The model is first promptedto produce self-contained visual perceptions that are sufficient to answer thequestion without referring back the input image. To validate thisself-containment, the same VLM model is then re-prompted to perform languagereasoning using only the generated perception as input to compute reward. Thisself-reward is combined with supervision on final outputs, providing a balancedtraining signal that strengthens both visual perception and language reasoning.Our experiments demonstrate that Vision-SR1 improves visual reasoning,mitigates visual hallucinations, and reduces reliance on language shortcutsacross diverse vision-language tasks.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.