4 个月前

DART-Math:基于难度感知的数学问题求解拒绝调优

DART-Math:基于难度感知的数学问题求解拒绝调优

摘要

解决数学问题需要高级推理能力,这对大型语言模型提出了显著挑战。以往的研究通常通过专有模型合成数据来扩充现有数据集,随后进行指令调优以实现顶级结果。然而,我们对这些数据集的分析显示,它们严重偏向于简单查询,对于最具挑战性的查询则经常无法生成任何正确答案。我们认为,困难查询对于学习复杂推理至关重要,因此提出了一种称为“难度感知拒绝调优”(Difficulty-Aware Rejection Tuning, DART)的方法,在合成阶段为困难查询分配更多的尝试机会,从而能够在困难样本上进行更广泛的训练。利用DART方法,我们创建了新的数学问题解决数据集,这些数据集更加关注困难查询,并且比之前的同类数据集小得多。值得注意的是,我们的合成过程仅依赖于一个70亿参数的开源权重模型,而没有使用常用的专有GPT-4模型。我们在70亿到700亿参数大小的各种基础模型上进行了微调,生成了一系列强大的模型,命名为DART-MATH。在针对6个数学基准的全面域内和域外评估中,DART-MATH显著优于普通的拒绝调优方法,并且在使用更小的数据集且不依赖专有模型的情况下,其性能优于或接近先前的最佳水平。此外,我们的研究结果表明,这些合成数据集是目前最有效且成本最低的公开资源之一,有助于推动数学问题解决技术的发展。

代码仓库

hkust-nlp/dart-math
官方
pytorch
GitHub 中提及

基准测试

基准方法指标
arithmetic-reasoning-on-gsm8kDART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)
Accuracy: 82.5
Parameters (Billion): 8
arithmetic-reasoning-on-gsm8kDART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)
Accuracy: 82.6
Parameters (Billion): 7
arithmetic-reasoning-on-gsm8kDART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)
Accuracy: 90.4
Parameters (Billion): 70
arithmetic-reasoning-on-gsm8kDART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)
Accuracy: 88.2
Parameters (Billion): 7
arithmetic-reasoning-on-gsm8kDART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)
Accuracy: 81.1
Parameters (Billion): 7
arithmetic-reasoning-on-gsm8kDART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)
Accuracy: 81.1
Parameters (Billion): 8
arithmetic-reasoning-on-gsm8kDART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)
Accuracy: 89.6
Parameters (Billion): 70
arithmetic-reasoning-on-gsm8kDART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)
Accuracy: 86.8
Parameters (Billion): 7
math-word-problem-solving-on-mathDART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)
Accuracy: 45.5
Parameters (Billions): 7
math-word-problem-solving-on-mathDART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)
Accuracy: 45.3
Parameters (Billions): 8
math-word-problem-solving-on-mathDART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)
Accuracy: 43.5
Parameters (Billions): 7
math-word-problem-solving-on-mathDART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)
Accuracy: 54.9
Parameters (Billions): 70
math-word-problem-solving-on-mathDART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)
Accuracy: 56.1
Parameters (Billions): 70
math-word-problem-solving-on-mathDART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)
Accuracy: 53.6
Parameters (Billions): 7
math-word-problem-solving-on-mathDART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)
Accuracy: 46.6
Parameters (Billions): 8
math-word-problem-solving-on-mathDART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)
Accuracy: 52.9
Parameters (Billions): 7
natural-questions-on-theoremqaDART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)
Accuracy: 15.4
natural-questions-on-theoremqaDART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)
Accuracy: 27.4
natural-questions-on-theoremqaDART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)
Accuracy: 16.4
natural-questions-on-theoremqaDART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)
Accuracy: 28.2
natural-questions-on-theoremqaDART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)
Accuracy: 32.2
natural-questions-on-theoremqaDART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)
Accuracy: 19.4
natural-questions-on-theoremqaDART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)
Accuracy: 17.0
natural-questions-on-theoremqaDART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)
Accuracy: 32.5

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
DART-Math:基于难度感知的数学问题求解拒绝调优 | 论文 | HyperAI超神经