HyperAIHyperAI

Command Palette

Search for a command to run...

2 months ago

DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

Shuaijie She Yu Bao Yu Lu Lu Xu Tao Li Wenhao Zhu Shujian Huang Shanbo Cheng Lu Lu Yuxuan Wang

DuPO: Enabling Reliable LLM Self-Verification via Dual Preference
  Optimization

Abstract

We present DuPO, a dual learning-based preference optimization framework thatgenerates annotation-free feedback via a generalized duality. DuPO addressestwo key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)'sreliance on costly labels and applicability restricted to verifiable tasks, andtraditional dual learning's restriction to strictly dual task pairs (e.g.,translation and back-translation). Specifically, DuPO decomposes a primaltask's input into known and unknown components, then constructs its dual taskto reconstruct the unknown part using the primal output and known information(e.g., reversing math solutions to recover hidden variables), broadeningapplicability to non-invertible tasks. The quality of this reconstructionserves as a self-supervised reward to optimize the primal task, synergizingwith LLMs' ability to instantiate both tasks via a single model. Empirically,DuPO achieves substantial gains across diverse tasks: it enhances the averagetranslation quality by 2.13 COMET over 756 directions, boosts the mathematicalreasoning accuracy by an average of 6.4 points on three challenge benchmarks,and enhances performance by 9.3 points as an inference-time reranker (tradingcomputation for accuracy). These results position DuPO as a scalable, general,and annotation-free paradigm for LLM optimization.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization | Papers | HyperAI