HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Xin Lai Zhuotao Tian Yukang Chen Senqiao Yang Xiangru Peng Jiaya Jia

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of
  LLMs

Abstract

Mathematical reasoning presents a significant challenge for Large LanguageModels (LLMs) due to the extensive and precise chain of reasoning required foraccuracy. Ensuring the correctness of each reasoning step is critical. Toaddress this, we aim to enhance the robustness and factuality of LLMs bylearning from human feedback. However, Direct Preference Optimization (DPO) hasshown limited benefits for long-chain mathematical reasoning, as modelsemploying DPO struggle to identify detailed errors in incorrect answers. Thislimitation stems from a lack of fine-grained process supervision. We propose asimple, effective, and data-efficient method called Step-DPO, which treatsindividual reasoning steps as units for preference optimization rather thanevaluating answers holistically. Additionally, we have developed a dataconstruction pipeline for Step-DPO, enabling the creation of a high-qualitydataset containing 10K step-wise preference pairs. We also observe that in DPO,self-generated data is more effective than data generated by humans or GPT-4,due to the latter's out-of-distribution nature. Our findings demonstrate thatas few as 10K preference data pairs and fewer than 500 Step-DPO training stepscan yield a nearly 3% gain in accuracy on MATH for models with over 70Bparameters. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achievesscores of 70.8% and 94.0% on the test sets of MATH and GSM8K, respectively,surpassing a series of closed-source models, including GPT-4-1106,Claude-3-Opus, and Gemini-1.5-Pro. Our code, data, and models are available athttps://github.com/dvlab-research/Step-DPO.

Code Repositories

dvlab-research/step-dpo
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
arithmetic-reasoning-on-gsm8kQwen2-72B-Instruct-Step-DPO (0-shot CoT)
Accuracy: 94.0
math-word-problem-solving-on-mathQwen2-72B-Instruct-Step-DPO (0-shot CoT, w/o code)
Accuracy: 70.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs | Papers | HyperAI