Abstract

Training robust and generalizable reward models for human visual preferencesis essential for aligning text-to-image and text-to-video generative modelswith human intent. However, current reward models often fail to generalize, andsupervised fine-tuning leads to memorization, demanding complex annotationpipelines. While reinforcement learning (RL), specifically Group RelativePolicy Optimization (GRPO), improves generalization, we uncover a key failuremode: a significant drop in reasoning accuracy occurs when a model's reasoningtrace contradicts that of an independent, frozen vision-language model("listener") evaluating the same output. To address this, we introduce alistener-augmented GRPO framework. Here, the listener re-evaluates thereasoner's chain-of-thought to provide a dense, calibrated confidence score,shaping the RL reward signal. This encourages the reasoner not only to answercorrectly, but to produce explanations that are persuasive to an independentmodel. Our listener-shaped reward scheme achieves best accuracy on theImageReward benchmark (67.4%), significantly improves out-of-distribution (OOD)performance on a large-scale human preference dataset (1.2M votes, up to +6%over naive reasoner), and reduces reasoning contradictions compared to strongGRPO and SFT baselines. These results demonstrate that listener-based rewardsprovide a scalable, data-efficient path to aligning vision-language models withnuanced human preferences. We will release our reasoning model here:https://huggingface.co/alexgambashidze/qwen2.5vl_image_preference_reasoner.

Source PDF View Code