HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

GPT-4 Technical Report

GPT-4 Technical Report

Abstract

We report the development of GPT-4, a large-scale, multimodal model which canaccept image and text inputs and produce text outputs. While less capable thanhumans in many real-world scenarios, GPT-4 exhibits human-level performance onvarious professional and academic benchmarks, including passing a simulated barexam with a score around the top 10% of test takers. GPT-4 is aTransformer-based model pre-trained to predict the next token in a document.The post-training alignment process results in improved performance on measuresof factuality and adherence to desired behavior. A core component of thisproject was developing infrastructure and optimization methods that behavepredictably across a wide range of scales. This allowed us to accuratelypredict some aspects of GPT-4's performance based on models trained with nomore than 1/1,000th the compute of GPT-4.

Code Repositories

openai/evals
Official
lflage/openfactscore
Mentioned in GitHub
gpt4life/alpagasus
pytorch
Mentioned in GitHub
shmsw25/factscore
pytorch
Mentioned in GitHub
eternityyw/tram-benchmark
Mentioned in GitHub
emrgnt-cmplxty/zero-shot-replication
pytorch
Mentioned in GitHub
AUCOHL/RTL-Repo
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
answerability-prediction-on-peerqaGPT-4o-2024-08-06
Macro F1: 0.3087
arithmetic-reasoning-on-gsm8kGPT-3.5 (few-shot, k=5)
Accuracy: 57.1
common-sense-reasoning-on-arc-challengeGPT-4 (few-shot, k=25)
Accuracy: 96.4
common-sense-reasoning-on-arc-challengeGPT-3.5 (few-shot, k=25)
Accuracy: 85.2
common-sense-reasoning-on-winograndeGPT-4 (5-shot)
Accuracy: 87.5
common-sense-reasoning-on-winograndeGPT-3.5 (5-shot)
Accuracy: 81.6
few-shot-learning-on-medconceptsqagpt-4-0125-preview
Accuracy: 61.911
fs-mevqa-on-smeGPT-4-1106-Vision-Preview
#Learning Samples (N): 16
ACC: 42.30
BLEU-4: 45.51
CIDEr: 269.68
Detection: 7.00
METEOR: 35.17
ROUGE-L: 52.67
SPICE: 37.67
legal-reasoning-on-legalbench-rule-recallGPT-4
Balanced Accuracy: 59.2
long-context-understanding-on-ada-levalGPT-4-Turbo-0125
128k: 0.0
12k: 52.0
16k: 44.5
1k: 73.5
2k: 73.5
32k: 30.0
4k: 65.5
64k: 0.0
6k: 63.0
8k: 56.5
long-context-understanding-on-ada-levalGPT-4-Turbo-1106
128k: 0.0
12k: 49.5
16k: 44.0
1k: 74.0
2k: 73.5
32k: 16.0
4k: 67.5
64k: 0.0
6k: 59.5
8k: 53.5
long-context-understanding-on-ada-leval-tsortGPT-4-Turbo-0125
128k: 2.0
16k: 5.5
2k: 15.5
32k: 2.0
4k: 16.5
64k: 4.0
8k: 8.5
long-context-understanding-on-ada-leval-tsortGPT-4-Turbo-1106
128k: 6.0
16k: 3.5
2k: 18.5
32k: 6.0
4k: 15.5
64k: 6.0
8k: 7.5
long-context-understanding-on-mmneedleGPT-4V
1 Image, 2*2 Stitching, Exact Accuracy: 86.09
1 Image, 4*4 Stitching, Exact Accuracy: 54.72
1 Image, 8*8 Stitching, Exact Accuracy: 7.3
10 Images, 1*1 Stitching, Exact Accuracy: 72.36
10 Images, 2*2 Stitching, Exact Accuracy: 34.24
10 Images, 4*4 Stitching, Exact Accuracy: 7.58
10 Images, 8*8 Stitching, Exact Accuracy: 0
long-context-understanding-on-mmneedleGPT-4o
1 Image, 2*2 Stitching, Exact Accuracy: 94.6
1 Image, 4*4 Stitching, Exact Accuracy: 83
1 Image, 8*8 Stitching, Exact Accuracy: 19
10 Images, 1*1 Stitching, Exact Accuracy: 97
10 Images, 2*2 Stitching, Exact Accuracy: 81.8
10 Images, 4*4 Stitching, Exact Accuracy: 26.9
10 Images, 8*8 Stitching, Exact Accuracy: 1
multi-task-language-understanding-on-mmluGPT-3.5 Turbo
Average (%): 70.0
object-rearrangement-on-open6dor-v2GPT-4V
6-DoF: -
pos-level0: 39.1
pos-level1: 46.8
rot-level0: 9.1
rot-level1: 6.9
rot-level2: 11.7
question-answering-on-drop-testGPT-4 (few-shot, k=3)
F1: 80.9
question-answering-on-drop-testGPT 3.5 (few-shot, k=3)
F1: 64.1
question-answering-on-peerqaGPT-4o-2024-08-06-128k
AlignScore: 0.1224
Prometheus-2 Answer Correctness: 3.4612
Rouge-L: 0.2266
question-answering-on-tiqGpt-4
P@1: 28.6
question-answering-on-triviaqaGPT-4-0613 (Zero-shot)
EM: 84.8
question-answering-on-truthfulqaGPT-4 (RLHF)
MC1: 0.59
spatial-reasoning-on-embspatial-benchGPT-4V
Generation: 36.07
task-1-grouping-on-ocwGPT-3.5-turbo (0-shot)
Wasserstein Distance (WD): 82.5
# Correct Groups: 114
# Solved Walls: 0
Adjusted Mutual Information (AMI): 21.6
Adjusted Rand Index (ARI): 18.4
Fowlkes Mallows Score (FMS): 34.0
task-1-grouping-on-ocwGPT-3.5-turbo (1-shot)
Wasserstein Distance (WD): 82.3
# Correct Groups: 123
# Solved Walls: 0
Adjusted Mutual Information (AMI): 21.2
Adjusted Rand Index (ARI): 18.2
Fowlkes Mallows Score (FMS): 34.4
task-1-grouping-on-ocwGPT-4 (1-shot)
Wasserstein Distance (WD): 73.4
# Correct Groups: 262
# Solved Walls: 4
Adjusted Mutual Information (AMI): 33.5
Adjusted Rand Index (ARI): 29.7
Fowlkes Mallows Score (FMS): 43.7
task-1-grouping-on-ocwGPT-3.5-turbo (10-shot)
Wasserstein Distance (WD): 81.2
# Correct Groups: 137
# Solved Walls: 2
Adjusted Mutual Information (AMI): 24.0
Adjusted Rand Index (ARI): 20.4
Fowlkes Mallows Score (FMS): 36.1
task-1-grouping-on-ocwGPT-4 (5-shot)
Wasserstein Distance (WD): 72.9
# Correct Groups: 269
# Solved Walls: 7
Adjusted Mutual Information (AMI): 32.8
Adjusted Rand Index (ARI): 29.1
Fowlkes Mallows Score (FMS): 43.4
task-1-grouping-on-ocwGPT-3.5-turbo (5-shot)
Wasserstein Distance (WD): 80.6
# Correct Groups: 149
# Solved Walls: 2
Adjusted Mutual Information (AMI): 25.4
Adjusted Rand Index (ARI): 22.0
Fowlkes Mallows Score (FMS): 37.3
task-1-grouping-on-ocwGPT-4 (0-shot)
Wasserstein Distance (WD): 75.8
# Correct Groups: 239
# Solved Walls: 6
Adjusted Mutual Information (AMI): 30.7
Adjusted Rand Index (ARI): 27.2
Fowlkes Mallows Score (FMS): 41.5
task-1-grouping-on-ocwGPT-3.5-turbo (3-shot)
Wasserstein Distance (WD): 80.9
# Correct Groups: 140
# Solved Walls: 0
Adjusted Mutual Information (AMI): 24.7
Adjusted Rand Index (ARI): 21.3
Fowlkes Mallows Score (FMS): 36.8
task-1-grouping-on-ocwGPT-4 (100-shot)
Wasserstein Distance (WD): 73.6
# Correct Groups: 249
# Solved Walls: 3
Adjusted Mutual Information (AMI): 32.3
Adjusted Rand Index (ARI): 28.5
Fowlkes Mallows Score (FMS): 42.8
task-1-grouping-on-ocwGPT-4 (3-shot)
Wasserstein Distance (WD): 73.7
# Correct Groups: 272
# Solved Walls: 5
Adjusted Mutual Information (AMI): 33.6
Adjusted Rand Index (ARI): 29.9
Fowlkes Mallows Score (FMS): 43.9
visual-question-answering-on-benchlmmGPT-4V
GPT-3.5 score: 58.37
visual-question-answering-on-mm-vetGPT-4V-Turbo-detail:high
GPT-4 score: 67.6±0.1
visual-question-answering-on-mm-vetGPT-4o (gpt-4o-2024-05-13)
GPT-4 score: 69.3±0.1
visual-question-answering-on-mm-vetgpt-4o-mini-2024-07-18
GPT-4 score: 68.6±0.1
visual-question-answering-on-mm-vetGPT-4V
GPT-4 score: 67.7±0.3
visual-question-answering-on-mm-vetGPT-4V-Turbo-detail:low
GPT-4 score: 60.2±0.3
visual-question-answering-on-mm-vet-v2GPT-4o (gpt-4o-2024-11-20)
GPT-4 score: 72.1±0.2
visual-question-answering-on-mm-vet-v2GPT-4o (gpt-4o-2024-05-13)
GPT-4 score: 71.0±0.2
visual-question-answering-on-mm-vet-v2gpt-4o-mini-2024-07-18
GPT-4 score: 66.8±0.3
visual-question-answering-on-mm-vet-v2GPT-4 Turbo (gpt-4-0125-preview)
GPT-4 score: 66.3±0.2
visual-question-answering-on-vip-benchGPT-4V-turbo-detail:high (Visual Prompt)
GPT-4 score (bbox): 60.7
GPT-4 score (human): 59.9
visual-question-answering-on-vip-benchGPT-4V-turbo-detail:low (Visual Prompt)
GPT-4 score (bbox): 52.8
GPT-4 score (human): 51.4
visual-question-answering-vqa-on-core-mmGPT-4V
Abductive: 77.88
Analogical: 69.86
Deductive: 74.86
Overall score: 74.44
visual-question-answering-vqa-on-core-mm-1GPT-4V
Abductive: 77.88
Analogical: 69.86
Deductive: 74.86
Overall score: 74.44
Params: -
zero-shot-learning-on-medconceptsqagpt-4-0125-preview
Accuracy: 52.489

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
GPT-4 Technical Report | Papers | HyperAI