
摘要
我们推出Claude 3系列大型多模态模型——包括性能最强的Claude 3 Opus、在能力与速度之间实现良好平衡的Claude 3 Sonnet,以及速度最快、成本最低的Claude 3 Haiku。所有新模型均具备视觉处理能力,可对图像数据进行分析与理解。Claude 3系列在多项基准测试中表现出色,在推理、数学和编程等关键指标上树立了新的行业标准。其中,Claude 3 Opus在GPQA [1]、MMLU [2]、MMMU [3]等多项权威评估中取得了领先水平的成果。Claude 3 Haiku在多数纯文本任务上的表现与Claude 2 [4]相当或更优,而Sonnet和Opus则显著超越后者。此外,这些模型在非英语语言上的表达流畅性也得到显著提升,使其在全球范围内的适用性更强。本报告将深入分析我们的评估结果,重点聚焦核心能力、安全性、社会影响,以及我们在《负责任扩展政策》中承诺开展的灾难性风险评估。
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| arithmetic-reasoning-on-gsm8k | Claude 3 Sonnet (0-shot chain-of-thought) | Accuracy: 92.3 |
| arithmetic-reasoning-on-gsm8k | Claude 3 Haiku (0-shot chain-of-thought) | Accuracy: 88.9 |
| arithmetic-reasoning-on-gsm8k | Claude 3 Opus (0-shot chain-of-thought) | Accuracy: 95 |
| code-generation-on-mbpp | Claude 3 Haiku | Accuracy: 80.4 |
| code-generation-on-mbpp | Claude 3 Sonnet | Accuracy: 79.4 |
| code-generation-on-mbpp | Claude 3 Opus | Accuracy: 86.4 |
| common-sense-reasoning-on-winogrande | Claude 3 Opus (5-shot) | Accuracy: 88.5 |
| common-sense-reasoning-on-winogrande | Claude 3 Sonnet (5-shot) | Accuracy: 75.1 |
| common-sense-reasoning-on-winogrande | Claude 3 Haiku (5-shot) | Accuracy: 74.2 |
| long-context-understanding-on-mmneedle | Claude 3 Opus | 1 Image, 2*2 Stitching, Exact Accuracy: 52.25 1 Image, 4*4 Stitching, Exact Accuracy: 12.3 1 Image, 8*8 Stitching, Exact Accuracy: 1.6 10 Images, 1*1 Stitching, Exact Accuracy: 66.93 10 Images, 2*2 Stitching, Exact Accuracy: 4.6 10 Images, 4*4 Stitching, Exact Accuracy: 0.4 10 Images, 8*8 Stitching, Exact Accuracy: 0 |
| multi-task-language-understanding-on-mmlu | Claude 3 Haiku (5-shot) | Average (%): 75.2 |
| multi-task-language-understanding-on-mmlu | Claude 3 Sonnet (5-shot) | Average (%): 79 |
| question-answering-on-pubmedqa | Claude 3 Opus (5-shot) | Accuracy: 75.8 |
| question-answering-on-pubmedqa | Claude 3 Opus (zero-shot) | Accuracy: 74.9 |