
摘要
我们研究了在给定计算预算下训练变压器语言模型的最佳模型规模和训练令牌数量。研究发现,当前的大规模语言模型显著欠训,这是由于近期研究重点放在扩大语言模型规模的同时保持训练数据量不变所致。通过训练超过400个参数范围从7000万到160亿以上的语言模型,以及50亿到5000亿的训练令牌,我们发现对于计算最优的训练,模型规模和训练令牌数量应该等比例扩展:每次模型规模翻倍时,训练令牌数量也应翻倍。为了验证这一假设,我们使用与Gopher相同的计算预算训练了一个预测的计算最优模型Chinchilla,该模型具有700亿参数和4倍于Gopher的数据量。Chinchilla在广泛的下游评估任务中显著且一致地优于Gopher(280亿参数)、GPT-3(175亿参数)、Jurassic-1(178亿参数)和Megatron-Turing NLG(530亿参数)。这也意味着Chinchilla在微调和推理过程中使用的计算资源大大减少,极大地促进了下游应用。值得一提的是,Chinchilla在MMLU基准测试中达到了67.5%的平均准确率,比Gopher提高了超过7个百分点。
代码仓库
karpathy/llama2.c
pytorch
GitHub 中提及
nkluge-correa/teenytinyllama
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| analogical-similarity-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 38.1 |
| analytic-entailment-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 67.1 |
| common-sense-reasoning-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 54.7 |
| common-sense-reasoning-on-big-bench-causal | Chinchilla-70B (few-shot, k=5) | Accuracy: 57.4 |
| common-sense-reasoning-on-big-bench-date | Chinchilla-70B (few-shot, k=5) | Accuracy: 52.3 |
| common-sense-reasoning-on-big-bench-known | Chinchilla-70B (few-shot, k=5) | Accuracy: 65.2 |
| common-sense-reasoning-on-big-bench-logical | Chinchilla-70B (few-shot, k=5) | Accuracy: 64.1 |
| common-sense-reasoning-on-big-bench-sports | Chinchilla-70B (few-shot, k=5) | Accuracy: 71 |
| common-sense-reasoning-on-big-bench-winowhy | Chinchilla-70B (few-shot, k=5) | Accuracy: 62.5 |
| common-sense-reasoning-on-winogrande | Chinchilla 70B (0-shot) | Accuracy: 74.9 |
| crash-blossom-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy : 47.6 |
| crass-ai-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 75.0 |
| dark-humor-detection-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 66.2 |
| discourse-marker-prediction-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 13.1 |
| empirical-judgments-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 67.7 |
| english-proverbs-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy : 82.4 |
| entailed-polarity-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 94 |
| epistemic-reasoning-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 60.6 |
| evaluating-information-essentiality-on-big | Chinchilla-70B (few-shot, k=5) | Accuracy: 17.6 |
| fantasy-reasoning-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy : 69 |
| figure-of-speech-detection-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy : 63.3 |
| general-knowledge-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 94.3 |
| gre-reading-comprehension-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy : 53.1 |
| human-organs-senses-multiple-choice-on-big | Chinchilla-70B (few-shot, k=5) | Accuracy : 85.7 |
| identify-odd-metapor-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 68.8 |
| implicatures-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 75 |
| implicit-relations-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy : 49.4 |
| intent-recognition-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy : 92.8 |
| irony-identification-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 73.0 |
| lambada-on-big-bench | Chinchilla-70B (zero-shot) | Accuracy : 77.4 |
| language-modelling-on-lambada | Chinchilla (Zero-Shot) | Accuracy: 77.7 |
| logical-args-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy : 56.2 |
| logical-reasoning-on-big-bench-formal | Chinchilla-70B (few-shot, k=5) | Accuracy: 52.1 |
| logical-reasoning-on-big-bench-logic-grid | Chinchilla-70B (few-shot, k=5) | Accuracy: 44 |
| logical-reasoning-on-big-bench-logical | Chinchilla-70B (few-shot, k=5) | Accuracy: 72.1 |
| logical-reasoning-on-big-bench-penguins-in-a | Chinchilla-70B (few-shot, k=5) | Accuracy: 48.7 |
| logical-reasoning-on-big-bench-reasoning | Chinchilla-70B (few-shot, k=5) | Accuracy: 59.7 |
| logical-reasoning-on-big-bench-strategyqa | Chinchilla-70B (few-shot, k=5) | Accuracy: 68.3 |
| logical-reasoning-on-big-bench-temporal | Chinchilla-70B (few-shot, k=5) | Accuracy: 32.0 |
| mathematical-induction-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy : 47.3 |
| metaphor-boolean-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 93.1 |
| misconceptions-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 65.3 |
| moral-permissibility-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 57.3 |
| movie-dialog-same-or-different-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy : 54.5 |
| multi-task-language-understanding-on-mmlu | chatgpt/gpt3.5(20B) | Average (%): 67.5 |
| multiple-choice-question-answering-mcqa-on-27 | Chinchilla-70B (few-shot, k=5) | Accuracy: 54.2 |
| multiple-choice-question-answering-mcqa-on-28 | Chinchilla-70B (few-shot, k=5) | Accuracy: 75.6 |
| multiple-choice-question-answering-mcqa-on-29 | Chinchilla-70B (few-shot, k=5) | Accuracy: 52.6 |
| multiple-choice-question-answering-mcqa-on-30 | Chinchilla-70B (few-shot, k=5) | Accuracy: 47.1 |
| multiple-choice-question-answering-mcqa-on-31 | Chinchilla-70B (few-shot, k=5) | Accuracy: 65.6 |
| nonsense-words-grammar-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 78 |
| odd-one-out-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 70.9 |
| phrase-relatedness-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy : 94 |
| physical-intuition-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 79 |
| physics-mc-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy : 65.5 |
| presuppositions-as-nli-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 49.9 |
| question-answering-on-boolq | Chinchilla 70B (0-shot) | Accuracy: 83.7 |
| question-answering-on-natural-questions | Chinchilla (few-shot, k=64) | EM: 35.5 |
| question-answering-on-piqa | Chinchilla 70B (0-shot) | Accuracy: 81.8 |
| question-answering-on-social-iqa | Chinchilla (zero-shot) | Accuracy: 51.3 |
| question-selection-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 52.6 |
| riddle-sense-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 85.7 |
| sarcasm-detection-on-big-bench-snarks | Chinchilla-70B (few-shot, k=5) | Accuracy: 58.6 |
| sentence-ambiguity-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 71.7 |
| similarities-abstraction-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 87 |
| timedial-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 68.8 |
| understanding-fables-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy : 60.3 |
| word-sense-disambiguation-on-big-bench | Chinchilla-70B (few-shot, k=5) | Accuracy: 69.1 |