Command Palette
Search for a command to run...
DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization
Shuaijie She Yu Bao Yu Lu Lu Xu Tao Li Wenhao Zhu Shujian Huang Shanbo Cheng Lu Lu Yuxuan Wang

Abstract
We present DuPO, a dual learning-based preference optimization framework thatgenerates annotation-free feedback via a generalized duality. DuPO addressestwo key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)'sreliance on costly labels and applicability restricted to verifiable tasks, andtraditional dual learning's restriction to strictly dual task pairs (e.g.,translation and back-translation). Specifically, DuPO decomposes a primaltask's input into known and unknown components, then constructs its dual taskto reconstruct the unknown part using the primal output and known information(e.g., reversing math solutions to recover hidden variables), broadeningapplicability to non-invertible tasks. The quality of this reconstructionserves as a self-supervised reward to optimize the primal task, synergizingwith LLMs' ability to instantiate both tasks via a single model. Empirically,DuPO achieves substantial gains across diverse tasks: it enhances the averagetranslation quality by 2.13 COMET over 756 directions, boosts the mathematicalreasoning accuracy by an average of 6.4 points on three challenge benchmarks,and enhances performance by 9.3 points as an inference-time reranker (tradingcomputation for accuracy). These results position DuPO as a scalable, general,and annotation-free paradigm for LLM optimization.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.