Command Palette
Search for a command to run...
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning
Yibin Wang Zhimin Li Yuhang Zang Yujie Zhou Jiazi Bu Chunyu Wang Qinglin Lu Cheng Jin Jiaqi Wang

Abstract
Recent advancements highlight the importance of GRPO-based reinforcementlearning methods and benchmarking in enhancing text-to-image (T2I) generation.However, current methods using pointwise reward models (RM) for scoringgenerated images are susceptible to reward hacking. We reveal that this happenswhen minimal score differences between images are amplified afternormalization, creating illusory advantages that drive the model toover-optimize for trivial gains, ultimately destabilizing the image generationprocess. To address this, we propose Pref-GRPO, a pairwise preferencereward-based GRPO method that shifts the optimization objective from scoremaximization to preference fitting, ensuring more stable training. InPref-GRPO, images are pairwise compared within each group using preference RM,and the win rate is used as the reward signal. Extensive experimentsdemonstrate that PREF-GRPO differentiates subtle image quality differences,providing more stable advantages and mitigating reward hacking. Additionally,existing T2I benchmarks are limited by coarse evaluation criteria, hinderingcomprehensive model assessment. To solve this, we introduce UniGenBench, aunified T2I benchmark comprising 600 prompts across 5 main themes and 20subthemes. It evaluates semantic consistency through 10 primary and 27sub-criteria, leveraging MLLM for benchmark construction and evaluation. Ourbenchmarks uncover the strengths and weaknesses of both open and closed-sourceT2I models and validate the effectiveness of Pref-GRPO.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.