
摘要
在本研究中,我们开发并发布了Llama 2,这是一系列预训练和微调的大规模语言模型(LLMs),参数规模从70亿到700亿不等。我们的微调模型称为Llama 2-Chat,专门针对对话应用场景进行了优化。在我们测试的大多数基准上,这些模型的表现优于开源聊天模型,并且根据我们在有用性和安全性方面的人类评估结果,它们可能成为闭源模型的合适替代品。我们详细描述了对Llama 2-Chat进行微调和安全改进的方法,以帮助社区在此基础上进一步发展,并促进大规模语言模型(LLMs)负责任的研发。
代码仓库
xverse-ai/xverse-13b
pytorch
GitHub 中提及
coastalcph/eu-politics-llms
pytorch
GitHub 中提及
facebookresearch/llama
官方
pytorch
IBM/Dromedary
pytorch
GitHub 中提及
squeezeailab/squeezellm
pytorch
GitHub 中提及
zurichnlp/contradecode
pytorch
GitHub 中提及
eternityyw/tram-benchmark
GitHub 中提及
xuetianci/pacit
pytorch
GitHub 中提及
young-geng/easylm
jax
GitHub 中提及
meetyou-ai-lab/can-mc-evaluate-llms
pytorch
GitHub 中提及
llamafamily/llama-chinese
pytorch
GitHub 中提及
glb400/Toy-RecLM
pytorch
GitHub 中提及
rijgersberg/geitje
pytorch
GitHub 中提及
flagalpha/llama2-chinese
pytorch
GitHub 中提及
usyd-fsalab/fp6_llm
pytorch
GitHub 中提及
idiap/abroad-re
pytorch
GitHub 中提及
ninglab/ecellm
pytorch
GitHub 中提及
Lightning-AI/lit-gpt
pytorch
GitHub 中提及
xzhang97666/alpacare
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| arithmetic-reasoning-on-gsm8k | LLaMA 2 70B (on-shot) | Accuracy: 56.8 Parameters (Billion): 70 |
| code-generation-on-mbpp | Llama 2 34B (0-shot) | Accuracy: 33 |
| code-generation-on-mbpp | Llama 2 7B (0-shot) | Accuracy: 20.8 |
| code-generation-on-mbpp | Llama 2 70B (zero-shot) | Accuracy: 45 |
| code-generation-on-mbpp | Llama 2 13B (0-shot) | Accuracy: 30.6 |
| math-word-problem-solving-on-mawps | LLaMA 2-Chat | Accuracy (%): 82.4 |
| math-word-problem-solving-on-svamp | LLaMA 2-Chat | Execution Accuracy: 69.2 |
| multi-task-language-understanding-on-mmlu | LLaMA 2 13B (5-shot) | Average (%): 54.8 |
| multi-task-language-understanding-on-mmlu | LLaMA 2 34B (5-shot) | Average (%): 62.6 |
| multi-task-language-understanding-on-mmlu | LLaMA 2 7B (5-shot) | Average (%): 45.3 |
| multiple-choice-question-answering-mcqa-on-25 | Llama2-7B | Accuracy: 43.38 |
| multiple-choice-question-answering-mcqa-on-25 | Llama2-7B-chat | Accuracy: 40.07 |
| question-answering-on-boolq | LLaMA 2 13B (0-shot) | Accuracy: 81.7 |
| question-answering-on-boolq | LLaMA 2 34B (0-shot) | Accuracy: 83.7 |
| question-answering-on-boolq | LLaMA 2 7B (zero-shot) | Accuracy: 77.4 |
| question-answering-on-boolq | LLaMA 2 70B (0-shot) | Accuracy: 85 |
| question-answering-on-multitq | LLaMA2 | Hits@1: 18.5 |
| question-answering-on-natural-questions | LLaMA 2 70B (one-shot) | EM: 33.0 |
| question-answering-on-piqa | LLaMA 2 13B (0-shot) | Accuracy: 80.5 |
| question-answering-on-piqa | LLaMA 2 34B (0-shot) | Accuracy: 81.9 |
| question-answering-on-piqa | LLaMA 2 7B (0-shot) | Accuracy: 78.8 |
| question-answering-on-piqa | LLaMA 2 70B (0-shot) | Accuracy: 82.8 |
| question-answering-on-pubchemqa | Llama2-7B-chat | BLEU-2: 0.075 BLEU-4: 0.009 MEATOR: 0.149 ROUGE-1: 0.184 ROUGE-2: 0.043 ROUGE-L: 0.142 |
| question-answering-on-triviaqa | LLaMA 2 70B (one-shot) | EM: 85 |
| question-answering-on-uniprotqa | Llama2-7B-chat | BLEU-2: 0.019 BLEU-4: 0.002 MEATOR: 0.052 ROUGE-1: 0.103 ROUGE-2: 0.060 ROUGE-L: 0.009 |