3 months ago

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Xin Lai Zhuotao Tian Yukang Chen Senqiao Yang Xiangru Peng Jiaya Jia

Abstract

Mathematical reasoning presents a significant challenge for Large LanguageModels (LLMs) due to the extensive and precise chain of reasoning required foraccuracy. Ensuring the correctness of each reasoning step is critical. Toaddress this, we aim to enhance the robustness and factuality of LLMs bylearning from human feedback. However, Direct Preference Optimization (DPO) hasshown limited benefits for long-chain mathematical reasoning, as modelsemploying DPO struggle to identify detailed errors in incorrect answers. Thislimitation stems from a lack of fine-grained process supervision. We propose asimple, effective, and data-efficient method called Step-DPO, which treatsindividual reasoning steps as units for preference optimization rather thanevaluating answers holistically. Additionally, we have developed a dataconstruction pipeline for Step-DPO, enabling the creation of a high-qualitydataset containing 10K step-wise preference pairs. We also observe that in DPO,self-generated data is more effective than data generated by humans or GPT-4,due to the latter's out-of-distribution nature. Our findings demonstrate thatas few as 10K preference data pairs and fewer than 500 Step-DPO training stepscan yield a nearly 3% gain in accuracy on MATH for models with over 70Bparameters. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achievesscores of 70.8% and 94.0% on the test sets of MATH and GSM8K, respectively,surpassing a series of closed-source models, including GPT-4-1106,Claude-3-Opus, and Gemini-1.5-Pro. Our code, data, and models are available athttps://github.com/dvlab-research/Step-DPO.

Code Repositories

dvlab-research/step-dpo

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
arithmetic-reasoning-on-gsm8k	Qwen2-72B-Instruct-Step-DPO (0-shot CoT)	Accuracy: 94.0
math-word-problem-solving-on-math	Qwen2-72B-Instruct-Step-DPO (0-shot CoT, w/o code)	Accuracy: 70.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette