
摘要
参数量小于1000亿的语言模型(LMs)在解决未见过的任务时,其链式思维(Chain-of-Thought, CoT)推理能力普遍弱于大型语言模型。在本研究中,我们旨在通过使用CoT推理过程进行指令微调(instruction tuning),赋予小型语言模型逐步推理的能力。为实现这一目标,我们首先构建了一个新的指令微调数据集——CoT Collection,该数据集在现有Flan Collection(仅包含9个CoT任务)的基础上,新增了跨越1060个任务的184万条推理过程(rationales),显著扩展了CoT数据的覆盖范围。实验结果表明,使用CoT Collection对Flan-T5(3B与11B参数版本)进行微调,能够显著提升小型语言模型在未见任务上的CoT推理能力。在BIG-Bench-Hard(BBH)基准测试中,零样本(zero-shot)任务准确率平均提升分别为+4.34%(Flan-T5 3B)和+2.60%(Flan-T5 11B)。此外,我们还发现,采用CoT Collection进行指令微调,可使语言模型在4个领域特定任务上展现出更强的少样本学习(few-shot learning)能力,准确率分别提升+2.24%(Flan-T5 3B)和+2.37%(Flan-T5 11B),甚至在使用演示样本(demonstrations)达到最大长度时,仍优于ChatGPT,性能领先达+13.98%。本研究的代码、CoT Collection数据集以及模型检查点均已公开发布,供学术界和工业界使用。
代码仓库
kaist-lklab/cot-collection
官方
pytorch
GitHub 中提及
kaistai/cot-collection
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| common-sense-reasoning-on-winogrande | T0-3B (CoT fine-tuned) | Accuracy: 57.5 |
| coreference-resolution-on-winograd-schema | T0-3B (CoT fine-tuned) | Accuracy: 66 |
| few-shot-learning-on-casehold | CoT-T5-11B (1024 Shot) | Accuracy: 68.3 |
| few-shot-learning-on-mednli | CoT-T5-11B (1024 Shot) | Accuracy: 78.02 |
| few-shot-learning-on-pubmedqa | CoT-T5-11B (1024 Shot) | Accuracy: 73.42 |
| natural-language-inference-on-anli-test | T0-3B (CoT fine-tuned) | A1: 41.7 A2: 37.2 A3: 41.9 |
| natural-language-inference-on-rte | T0-3B (CoT fine-tuned) | Accuracy: 80.8% |
| question-answering-on-copa | T0-3B (CoT fine-tuned) | Accuracy: 90.9 |
| question-answering-on-pubmedqa | CoT-T5-11B (1024 Shot) | Accuracy: 73.42 |
| question-answering-on-storycloze | T0-3B (CoT fine-tuned) | Accuracy: 94.5 |
| word-sense-disambiguation-on-words-in-context | T0-3B (CoT fine-tuned) | Accuracy: 56.7 |