Command Palette
Search for a command to run...
Hugo Touvron; Louis Martin; Kevin Stone; Peter Albert; Amjad Almahairi; Yasmine Babaei; Nikolay Bashlykov; Soumya Batra; Prajjwal Bhargava; Shruti Bhosale; Dan Bikel; Lukas Blecher; Cristian Canton Ferrer; Moya Chen; Guillem Cucurull; David Esiobu; Jude Fernandes; Jeremy Fu; Wenyin Fu; Brian Fuller; Cynthia Gao; Vedanuj Goswami; Naman Goyal; Anthony Hartshorn; Saghar Hosseini; Rui Hou; Hakan Inan; Marcin Kardas; Viktor Kerkez; Madian Khabsa; Isabel Kloumann; Artem Korenev; Punit Singh Koura; Marie-Anne Lachaux; Thibaut Lavril; Jenya Lee; Diana Liskovich; Yinghai Lu; Yuning Mao; Xavier Martinet; Todor Mihaylov; Pushkar Mishra; Igor Molybog; Yixin Nie; Andrew Poulton; Jeremy Reizenstein; Rashi Rungta; Kalyan Saladi; Alan Schelten; Ruan Silva; Eric Michael Smith; Ranjan Subramanian; Xiaoqing Ellen Tan; Binh Tang; Ross Taylor; Adina Williams; Jian Xiang Kuan; Puxin Xu; Zheng Yan; Iliyan Zarov; Yuchen Zhang; Angela Fan; Melanie Kambadur; Sharan Narang; Aurelien Rodriguez; Robert Stojnic; Sergey Edunov; Thomas Scialom

Abstract
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| arithmetic-reasoning-on-gsm8k | LLaMA 2 70B (on-shot) | Accuracy: 56.8 Parameters (Billion): 70 |
| code-generation-on-mbpp | Llama 2 34B (0-shot) | Accuracy: 33 |
| code-generation-on-mbpp | Llama 2 7B (0-shot) | Accuracy: 20.8 |
| code-generation-on-mbpp | Llama 2 70B (zero-shot) | Accuracy: 45 |
| code-generation-on-mbpp | Llama 2 13B (0-shot) | Accuracy: 30.6 |
| math-word-problem-solving-on-mawps | LLaMA 2-Chat | Accuracy (%): 82.4 |
| math-word-problem-solving-on-svamp | LLaMA 2-Chat | Execution Accuracy: 69.2 |
| multi-task-language-understanding-on-mmlu | LLaMA 2 13B (5-shot) | Average (%): 54.8 |
| multi-task-language-understanding-on-mmlu | LLaMA 2 34B (5-shot) | Average (%): 62.6 |
| multi-task-language-understanding-on-mmlu | LLaMA 2 7B (5-shot) | Average (%): 45.3 |
| multiple-choice-question-answering-mcqa-on-25 | Llama2-7B | Accuracy: 43.38 |
| multiple-choice-question-answering-mcqa-on-25 | Llama2-7B-chat | Accuracy: 40.07 |
| question-answering-on-boolq | LLaMA 2 13B (0-shot) | Accuracy: 81.7 |
| question-answering-on-boolq | LLaMA 2 34B (0-shot) | Accuracy: 83.7 |
| question-answering-on-boolq | LLaMA 2 7B (zero-shot) | Accuracy: 77.4 |
| question-answering-on-boolq | LLaMA 2 70B (0-shot) | Accuracy: 85 |
| question-answering-on-multitq | LLaMA2 | Hits@1: 18.5 |
| question-answering-on-natural-questions | LLaMA 2 70B (one-shot) | EM: 33.0 |
| question-answering-on-piqa | LLaMA 2 13B (0-shot) | Accuracy: 80.5 |
| question-answering-on-piqa | LLaMA 2 34B (0-shot) | Accuracy: 81.9 |
| question-answering-on-piqa | LLaMA 2 7B (0-shot) | Accuracy: 78.8 |
| question-answering-on-piqa | LLaMA 2 70B (0-shot) | Accuracy: 82.8 |
| question-answering-on-pubchemqa | Llama2-7B-chat | BLEU-2: 0.075 BLEU-4: 0.009 MEATOR: 0.149 ROUGE-1: 0.184 ROUGE-2: 0.043 ROUGE-L: 0.142 |
| question-answering-on-triviaqa | LLaMA 2 70B (one-shot) | EM: 85 |
| question-answering-on-uniprotqa | Llama2-7B-chat | BLEU-2: 0.019 BLEU-4: 0.002 MEATOR: 0.052 ROUGE-1: 0.103 ROUGE-2: 0.060 ROUGE-L: 0.009 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.