2 months ago

DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

Shuaijie She Yu Bao Yu Lu Lu Xu Tao Li Wenhao Zhu Shujian Huang Shanbo Cheng Lu Lu Yuxuan Wang

Abstract

We present DuPO, a dual learning-based preference optimization framework thatgenerates annotation-free feedback via a generalized duality. DuPO addressestwo key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)'sreliance on costly labels and applicability restricted to verifiable tasks, andtraditional dual learning's restriction to strictly dual task pairs (e.g.,translation and back-translation). Specifically, DuPO decomposes a primaltask's input into known and unknown components, then constructs its dual taskto reconstruct the unknown part using the primal output and known information(e.g., reversing math solutions to recover hidden variables), broadeningapplicability to non-invertible tasks. The quality of this reconstructionserves as a self-supervised reward to optimize the primal task, synergizingwith LLMs' ability to instantiate both tasks via a single model. Empirically,DuPO achieves substantial gains across diverse tasks: it enhances the averagetranslation quality by 2.13 COMET over 756 directions, boosts the mathematicalreasoning accuracy by an average of 6.4 points on three challenge benchmarks,and enhances performance by 9.3 points as an inference-time reranker (tradingcomputation for accuracy). These results position DuPO as a scalable, general,and annotation-free paradigm for LLM optimization.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

Shuaijie She Yu Bao Yu Lu Lu Xu Tao Li Wenhao Zhu Shujian Huang Shanbo Cheng Lu Lu Yuxuan Wang

Abstract

Build AI with AI

Hyper Newsletters