HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

Listener-Rewarded Thinking in VLMs for Image Preferences

Alexander Gambashidze Li Pengyi Matvey Skripkin Andrey Galichin Anton Gusarov Konstantin Sobolev Andrey Kuznetsov Ivan Oseledets

Listener-Rewarded Thinking in VLMs for Image Preferences

Abstract

Training robust and generalizable reward models for human visual preferencesis essential for aligning text-to-image and text-to-video generative modelswith human intent. However, current reward models often fail to generalize, andsupervised fine-tuning leads to memorization, demanding complex annotationpipelines. While reinforcement learning (RL), specifically Group RelativePolicy Optimization (GRPO), improves generalization, we uncover a key failuremode: a significant drop in reasoning accuracy occurs when a model's reasoningtrace contradicts that of an independent, frozen vision-language model("listener") evaluating the same output. To address this, we introduce alistener-augmented GRPO framework. Here, the listener re-evaluates thereasoner's chain-of-thought to provide a dense, calibrated confidence score,shaping the RL reward signal. This encourages the reasoner not only to answercorrectly, but to produce explanations that are persuasive to an independentmodel. Our listener-shaped reward scheme achieves best accuracy on theImageReward benchmark (67.4%), significantly improves out-of-distribution (OOD)performance on a large-scale human preference dataset (1.2M votes, up to +6%over naive reasoner), and reduces reasoning contradictions compared to strongGRPO and SFT baselines. These results demonstrate that listener-based rewardsprovide a scalable, data-efficient path to aligning vision-language models withnuanced human preferences. We will release our reasoning model here:https://huggingface.co/alexgambashidze/qwen2.5vl_image_preference_reasoner.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Listener-Rewarded Thinking in VLMs for Image Preferences | Papers | HyperAI