
摘要
解决数学问题需要高级推理能力,这对大型语言模型提出了显著挑战。以往的研究通常通过专有模型合成数据来扩充现有数据集,随后进行指令调优以实现顶级结果。然而,我们对这些数据集的分析显示,它们严重偏向于简单查询,对于最具挑战性的查询则经常无法生成任何正确答案。我们认为,困难查询对于学习复杂推理至关重要,因此提出了一种称为“难度感知拒绝调优”(Difficulty-Aware Rejection Tuning, DART)的方法,在合成阶段为困难查询分配更多的尝试机会,从而能够在困难样本上进行更广泛的训练。利用DART方法,我们创建了新的数学问题解决数据集,这些数据集更加关注困难查询,并且比之前的同类数据集小得多。值得注意的是,我们的合成过程仅依赖于一个70亿参数的开源权重模型,而没有使用常用的专有GPT-4模型。我们在70亿到700亿参数大小的各种基础模型上进行了微调,生成了一系列强大的模型,命名为DART-MATH。在针对6个数学基准的全面域内和域外评估中,DART-MATH显著优于普通的拒绝调优方法,并且在使用更小的数据集且不依赖专有模型的情况下,其性能优于或接近先前的最佳水平。此外,我们的研究结果表明,这些合成数据集是目前最有效且成本最低的公开资源之一,有助于推动数学问题解决技术的发展。
代码仓库
hkust-nlp/dart-math
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| arithmetic-reasoning-on-gsm8k | DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code) | Accuracy: 82.5 Parameters (Billion): 8 |
| arithmetic-reasoning-on-gsm8k | DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code) | Accuracy: 82.6 Parameters (Billion): 7 |
| arithmetic-reasoning-on-gsm8k | DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code) | Accuracy: 90.4 Parameters (Billion): 70 |
| arithmetic-reasoning-on-gsm8k | DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code) | Accuracy: 88.2 Parameters (Billion): 7 |
| arithmetic-reasoning-on-gsm8k | DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code) | Accuracy: 81.1 Parameters (Billion): 7 |
| arithmetic-reasoning-on-gsm8k | DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code) | Accuracy: 81.1 Parameters (Billion): 8 |
| arithmetic-reasoning-on-gsm8k | DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code) | Accuracy: 89.6 Parameters (Billion): 70 |
| arithmetic-reasoning-on-gsm8k | DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code) | Accuracy: 86.8 Parameters (Billion): 7 |
| math-word-problem-solving-on-math | DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code) | Accuracy: 45.5 Parameters (Billions): 7 |
| math-word-problem-solving-on-math | DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code) | Accuracy: 45.3 Parameters (Billions): 8 |
| math-word-problem-solving-on-math | DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code) | Accuracy: 43.5 Parameters (Billions): 7 |
| math-word-problem-solving-on-math | DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code) | Accuracy: 54.9 Parameters (Billions): 70 |
| math-word-problem-solving-on-math | DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code) | Accuracy: 56.1 Parameters (Billions): 70 |
| math-word-problem-solving-on-math | DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code) | Accuracy: 53.6 Parameters (Billions): 7 |
| math-word-problem-solving-on-math | DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code) | Accuracy: 46.6 Parameters (Billions): 8 |
| math-word-problem-solving-on-math | DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code) | Accuracy: 52.9 Parameters (Billions): 7 |
| natural-questions-on-theoremqa | DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code) | Accuracy: 15.4 |
| natural-questions-on-theoremqa | DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code) | Accuracy: 27.4 |
| natural-questions-on-theoremqa | DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code) | Accuracy: 16.4 |
| natural-questions-on-theoremqa | DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code) | Accuracy: 28.2 |
| natural-questions-on-theoremqa | DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code) | Accuracy: 32.2 |
| natural-questions-on-theoremqa | DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code) | Accuracy: 19.4 |
| natural-questions-on-theoremqa | DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code) | Accuracy: 17.0 |
| natural-questions-on-theoremqa | DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code) | Accuracy: 32.5 |