3 months ago

The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning

Seungone Kim Se June Joo Doyoung Kim Joel Jang Seonghyeon Ye Jamin Shin Minjoon Seo

Abstract

Language models (LMs) with less than 100B parameters are known to perform poorly on chain-of-thought (CoT) reasoning in contrast to large LMs when solving unseen tasks. In this work, we aim to equip smaller LMs with the step-by-step reasoning capability by instruction tuning with CoT rationales. In order to achieve this goal, we first introduce a new instruction-tuning dataset called the CoT Collection, which augments the existing Flan Collection (including only 9 CoT tasks) with additional 1.84 million rationales across 1,060 tasks. We show that CoT fine-tuning Flan-T5 (3B & 11B) with CoT Collection enables smaller LMs to have better CoT capabilities on unseen tasks. On the BIG-Bench-Hard (BBH) benchmark, we report an average improvement of +4.34% (Flan-T5 3B) and +2.60% (Flan-T5 11B), in terms of zero-shot task accuracy. Furthermore, we show that instruction tuning with CoT Collection allows LMs to possess stronger few-shot learning capabilities on 4 domain-specific tasks, resulting in an improvement of +2.24% (Flan-T5 3B) and +2.37% (Flan-T5 11B), even outperforming ChatGPT utilizing demonstrations until the max length by a +13.98% margin. Our code, the CoT Collection data, and model checkpoints are publicly available.

Code Repositories

kaist-lklab/cot-collection

Official

pytorch

Mentioned in GitHub

kaistai/cot-collection

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
common-sense-reasoning-on-winogrande	T0-3B (CoT fine-tuned)	Accuracy: 57.5
coreference-resolution-on-winograd-schema	T0-3B (CoT fine-tuned)	Accuracy: 66
few-shot-learning-on-casehold	CoT-T5-11B (1024 Shot)	Accuracy: 68.3
few-shot-learning-on-mednli	CoT-T5-11B (1024 Shot)	Accuracy: 78.02
few-shot-learning-on-pubmedqa	CoT-T5-11B (1024 Shot)	Accuracy: 73.42
natural-language-inference-on-anli-test	T0-3B (CoT fine-tuned)	A1: 41.7 A2: 37.2 A3: 41.9
natural-language-inference-on-rte	T0-3B (CoT fine-tuned)	Accuracy: 80.8%
question-answering-on-copa	T0-3B (CoT fine-tuned)	Accuracy: 90.9
question-answering-on-pubmedqa	CoT-T5-11B (1024 Shot)	Accuracy: 73.42
question-answering-on-storycloze	T0-3B (CoT fine-tuned)	Accuracy: 94.5
word-sense-disambiguation-on-words-in-context	T0-3B (CoT fine-tuned)	Accuracy: 56.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette