Command Palette
Search for a command to run...

Abstract
We report the development of GPT-4, a large-scale, multimodal model which canaccept image and text inputs and produce text outputs. While less capable thanhumans in many real-world scenarios, GPT-4 exhibits human-level performance onvarious professional and academic benchmarks, including passing a simulated barexam with a score around the top 10% of test takers. GPT-4 is aTransformer-based model pre-trained to predict the next token in a document.The post-training alignment process results in improved performance on measuresof factuality and adherence to desired behavior. A core component of thisproject was developing infrastructure and optimization methods that behavepredictably across a wide range of scales. This allowed us to accuratelypredict some aspects of GPT-4's performance based on models trained with nomore than 1/1,000th the compute of GPT-4.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| answerability-prediction-on-peerqa | GPT-4o-2024-08-06 | Macro F1: 0.3087 |
| arithmetic-reasoning-on-gsm8k | GPT-3.5 (few-shot, k=5) | Accuracy: 57.1 |
| common-sense-reasoning-on-arc-challenge | GPT-4 (few-shot, k=25) | Accuracy: 96.4 |
| common-sense-reasoning-on-arc-challenge | GPT-3.5 (few-shot, k=25) | Accuracy: 85.2 |
| common-sense-reasoning-on-winogrande | GPT-4 (5-shot) | Accuracy: 87.5 |
| common-sense-reasoning-on-winogrande | GPT-3.5 (5-shot) | Accuracy: 81.6 |
| few-shot-learning-on-medconceptsqa | gpt-4-0125-preview | Accuracy: 61.911 |
| fs-mevqa-on-sme | GPT-4-1106-Vision-Preview | #Learning Samples (N): 16 ACC: 42.30 BLEU-4: 45.51 CIDEr: 269.68 Detection: 7.00 METEOR: 35.17 ROUGE-L: 52.67 SPICE: 37.67 |
| legal-reasoning-on-legalbench-rule-recall | GPT-4 | Balanced Accuracy: 59.2 |
| long-context-understanding-on-ada-leval | GPT-4-Turbo-0125 | 128k: 0.0 12k: 52.0 16k: 44.5 1k: 73.5 2k: 73.5 32k: 30.0 4k: 65.5 64k: 0.0 6k: 63.0 8k: 56.5 |
| long-context-understanding-on-ada-leval | GPT-4-Turbo-1106 | 128k: 0.0 12k: 49.5 16k: 44.0 1k: 74.0 2k: 73.5 32k: 16.0 4k: 67.5 64k: 0.0 6k: 59.5 8k: 53.5 |
| long-context-understanding-on-ada-leval-tsort | GPT-4-Turbo-0125 | 128k: 2.0 16k: 5.5 2k: 15.5 32k: 2.0 4k: 16.5 64k: 4.0 8k: 8.5 |
| long-context-understanding-on-ada-leval-tsort | GPT-4-Turbo-1106 | 128k: 6.0 16k: 3.5 2k: 18.5 32k: 6.0 4k: 15.5 64k: 6.0 8k: 7.5 |
| long-context-understanding-on-mmneedle | GPT-4V | 1 Image, 2*2 Stitching, Exact Accuracy: 86.09 1 Image, 4*4 Stitching, Exact Accuracy: 54.72 1 Image, 8*8 Stitching, Exact Accuracy: 7.3 10 Images, 1*1 Stitching, Exact Accuracy: 72.36 10 Images, 2*2 Stitching, Exact Accuracy: 34.24 10 Images, 4*4 Stitching, Exact Accuracy: 7.58 10 Images, 8*8 Stitching, Exact Accuracy: 0 |
| long-context-understanding-on-mmneedle | GPT-4o | 1 Image, 2*2 Stitching, Exact Accuracy: 94.6 1 Image, 4*4 Stitching, Exact Accuracy: 83 1 Image, 8*8 Stitching, Exact Accuracy: 19 10 Images, 1*1 Stitching, Exact Accuracy: 97 10 Images, 2*2 Stitching, Exact Accuracy: 81.8 10 Images, 4*4 Stitching, Exact Accuracy: 26.9 10 Images, 8*8 Stitching, Exact Accuracy: 1 |
| multi-task-language-understanding-on-mmlu | GPT-3.5 Turbo | Average (%): 70.0 |
| object-rearrangement-on-open6dor-v2 | GPT-4V | 6-DoF: - pos-level0: 39.1 pos-level1: 46.8 rot-level0: 9.1 rot-level1: 6.9 rot-level2: 11.7 |
| question-answering-on-drop-test | GPT-4 (few-shot, k=3) | F1: 80.9 |
| question-answering-on-drop-test | GPT 3.5 (few-shot, k=3) | F1: 64.1 |
| question-answering-on-peerqa | GPT-4o-2024-08-06-128k | AlignScore: 0.1224 Prometheus-2 Answer Correctness: 3.4612 Rouge-L: 0.2266 |
| question-answering-on-tiq | Gpt-4 | P@1: 28.6 |
| question-answering-on-triviaqa | GPT-4-0613 (Zero-shot) | EM: 84.8 |
| question-answering-on-truthfulqa | GPT-4 (RLHF) | MC1: 0.59 |
| spatial-reasoning-on-embspatial-bench | GPT-4V | Generation: 36.07 |
| task-1-grouping-on-ocw | GPT-3.5-turbo (0-shot) | Wasserstein Distance (WD): 82.5 # Correct Groups: 114 # Solved Walls: 0 Adjusted Mutual Information (AMI): 21.6 Adjusted Rand Index (ARI): 18.4 Fowlkes Mallows Score (FMS): 34.0 |
| task-1-grouping-on-ocw | GPT-3.5-turbo (1-shot) | Wasserstein Distance (WD): 82.3 # Correct Groups: 123 # Solved Walls: 0 Adjusted Mutual Information (AMI): 21.2 Adjusted Rand Index (ARI): 18.2 Fowlkes Mallows Score (FMS): 34.4 |
| task-1-grouping-on-ocw | GPT-4 (1-shot) | Wasserstein Distance (WD): 73.4 # Correct Groups: 262 # Solved Walls: 4 Adjusted Mutual Information (AMI): 33.5 Adjusted Rand Index (ARI): 29.7 Fowlkes Mallows Score (FMS): 43.7 |
| task-1-grouping-on-ocw | GPT-3.5-turbo (10-shot) | Wasserstein Distance (WD): 81.2 # Correct Groups: 137 # Solved Walls: 2 Adjusted Mutual Information (AMI): 24.0 Adjusted Rand Index (ARI): 20.4 Fowlkes Mallows Score (FMS): 36.1 |
| task-1-grouping-on-ocw | GPT-4 (5-shot) | Wasserstein Distance (WD): 72.9 # Correct Groups: 269 # Solved Walls: 7 Adjusted Mutual Information (AMI): 32.8 Adjusted Rand Index (ARI): 29.1 Fowlkes Mallows Score (FMS): 43.4 |
| task-1-grouping-on-ocw | GPT-3.5-turbo (5-shot) | Wasserstein Distance (WD): 80.6 # Correct Groups: 149 # Solved Walls: 2 Adjusted Mutual Information (AMI): 25.4 Adjusted Rand Index (ARI): 22.0 Fowlkes Mallows Score (FMS): 37.3 |
| task-1-grouping-on-ocw | GPT-4 (0-shot) | Wasserstein Distance (WD): 75.8 # Correct Groups: 239 # Solved Walls: 6 Adjusted Mutual Information (AMI): 30.7 Adjusted Rand Index (ARI): 27.2 Fowlkes Mallows Score (FMS): 41.5 |
| task-1-grouping-on-ocw | GPT-3.5-turbo (3-shot) | Wasserstein Distance (WD): 80.9 # Correct Groups: 140 # Solved Walls: 0 Adjusted Mutual Information (AMI): 24.7 Adjusted Rand Index (ARI): 21.3 Fowlkes Mallows Score (FMS): 36.8 |
| task-1-grouping-on-ocw | GPT-4 (100-shot) | Wasserstein Distance (WD): 73.6 # Correct Groups: 249 # Solved Walls: 3 Adjusted Mutual Information (AMI): 32.3 Adjusted Rand Index (ARI): 28.5 Fowlkes Mallows Score (FMS): 42.8 |
| task-1-grouping-on-ocw | GPT-4 (3-shot) | Wasserstein Distance (WD): 73.7 # Correct Groups: 272 # Solved Walls: 5 Adjusted Mutual Information (AMI): 33.6 Adjusted Rand Index (ARI): 29.9 Fowlkes Mallows Score (FMS): 43.9 |
| visual-question-answering-on-benchlmm | GPT-4V | GPT-3.5 score: 58.37 |
| visual-question-answering-on-mm-vet | GPT-4V-Turbo-detail:high | GPT-4 score: 67.6±0.1 |
| visual-question-answering-on-mm-vet | GPT-4o (gpt-4o-2024-05-13) | GPT-4 score: 69.3±0.1 |
| visual-question-answering-on-mm-vet | gpt-4o-mini-2024-07-18 | GPT-4 score: 68.6±0.1 |
| visual-question-answering-on-mm-vet | GPT-4V | GPT-4 score: 67.7±0.3 |
| visual-question-answering-on-mm-vet | GPT-4V-Turbo-detail:low | GPT-4 score: 60.2±0.3 |
| visual-question-answering-on-mm-vet-v2 | GPT-4o (gpt-4o-2024-11-20) | GPT-4 score: 72.1±0.2 |
| visual-question-answering-on-mm-vet-v2 | GPT-4o (gpt-4o-2024-05-13) | GPT-4 score: 71.0±0.2 |
| visual-question-answering-on-mm-vet-v2 | gpt-4o-mini-2024-07-18 | GPT-4 score: 66.8±0.3 |
| visual-question-answering-on-mm-vet-v2 | GPT-4 Turbo (gpt-4-0125-preview) | GPT-4 score: 66.3±0.2 |
| visual-question-answering-on-vip-bench | GPT-4V-turbo-detail:high (Visual Prompt) | GPT-4 score (bbox): 60.7 GPT-4 score (human): 59.9 |
| visual-question-answering-on-vip-bench | GPT-4V-turbo-detail:low (Visual Prompt) | GPT-4 score (bbox): 52.8 GPT-4 score (human): 51.4 |
| visual-question-answering-vqa-on-core-mm | GPT-4V | Abductive: 77.88 Analogical: 69.86 Deductive: 74.86 Overall score: 74.44 |
| visual-question-answering-vqa-on-core-mm-1 | GPT-4V | Abductive: 77.88 Analogical: 69.86 Deductive: 74.86 Overall score: 74.44 Params: - |
| zero-shot-learning-on-medconceptsqa | gpt-4-0125-preview | Accuracy: 52.489 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.