4 个月前

GPT-4 技术报告

OpenAI ; AchiamJosh ; AdlerSteven ; AgarwalSandhini ; AhmadLama ; AkkayaIlge ; AlemanFlorencia Leoni ; AlmeidaDiogo ; AltenschmidtJanko ; AltmanSam ; AnadkatShyamal ; AvilaRed ; BabuschkinIgor ; BalajiSuchir ; BalcomValerie ; BaltescuPaul ; BaoHaiming ; BavarianMohammad ; BelgumJeff ; BelloIrwan ; BerdineJake ; Bernadett-ShapiroGabriel ; BernerChristopher ; BogdonoffLenny ; BoikoOleg ; BoydMadelaine ; BrakmanAnna-Luisa ; BrockmanGreg ; BrooksTim ; BrundageMiles ; ButtonKevin ; CaiTrevor ; CampbellRosie ; CannAndrew ; CareyBrittany ; CarlsonChelsea ; CarmichaelRory ; ChanBrooke ; ChangChe ; ChantzisFotis ; ChenDerek ; ChenSully ; ChenRuby ; ChenJason ; ChenMark ; ChessBen ; ChoChester ; ChuCasey ; ChungHyung Won ; CummingsDave ; CurrierJeremiah ; DaiYunxing ; DecareauxCory ; DegryThomas ; DeutschNoah ; DevilleDamien ; DharArka ; DohanDavid ; DowlingSteve ; DunningSheila ; EcoffetAdrien ; EletiAtty ; EloundouTyna ; FarhiDavid ; FedusLiam ; FelixNiko ; FishmanSimón Posada ; ForteJuston ; FulfordIsabella ; GaoLeo ; GeorgesElie ; GibsonChristian ; GoelVik ; GogineniTarun ; GohGabriel ; Gontijo-LopesRapha ; GordonJonathan ; GrafsteinMorgan ; GrayScott ; GreeneRyan ; GrossJoshua ; GuShixiang Shane ; GuoYufei ; HallacyChris ; HanJesse ; HarrisJeff ; HeYuchen ; HeatonMike ; HeideckeJohannes ; HesseChris ; HickeyAlan ; HickeyWade ; HoeschelePeter ; HoughtonBrandon ; HsuKenny ; HuShengli ; HuXin ; HuizingaJoost ; JainShantanu ; JainShawn ; JangJoanne ; JiangAngela ; JiangRoger ; JinHaozhun ; JinDenny ; JomotoShino ; JonnBillie ; JunHeewoo ; KaftanTomer ; KaiserŁukasz ; KamaliAli ; KanitscheiderIngmar ; KeskarNitish Shirish ; KhanTabarak ; KilpatrickLogan ; KimJong Wook ; KimChristina ; KimYongjik ; KirchnerJan Hendrik ; KirosJamie ; KnightMatt ; KokotajloDaniel ; KondraciukŁukasz ; KondrichAndrew ; KonstantinidisAris ; KosicKyle ; KruegerGretchen ; KuoVishal ; LampeMichael ; LanIkai ; LeeTeddy ; LeikeJan ; LeungJade ; LevyDaniel ; LiChak Ming ; LimRachel ; LinMolly ; LinStephanie ; LitwinMateusz ; LopezTheresa ; LoweRyan ; LuePatricia ; MakanjuAnna ; MalfaciniKim ; ManningSam ; MarkovTodor ; MarkovskiYaniv ; MartinBianca ; MayerKatie ; MayneAndrew ; McGrewBob ; McKinneyScott Mayer ; McLeaveyChristine ; McMillanPaul ; McNeilJake ; MedinaDavid ; MehtaAalok ; MenickJacob ; MetzLuke ; MishchenkoAndrey ; MishkinPamela ; MonacoVinnie ; MorikawaEvan ; MossingDaniel ; MuTong ; MuratiMira ; MurkOleg ; MélyDavid ; NairAshvin ; NakanoReiichiro ; NayakRajeev ; NeelakantanArvind ; NgoRichard ; NohHyeonwoo ; OuyangLong ; O'KeefeCullen ; PachockiJakub ; PainoAlex ; PalermoJoe ; PantulianoAshley ; ParascandoloGiambattista ; ParishJoel ; ParparitaEmy ; PassosAlex ; PavlovMikhail ; PengAndrew ; PerelmanAdam ; PeresFilipe de Avila Belbute ; PetrovMichael ; PintoHenrique Ponde de Oliveira ; Michael ; Pokorny ; PokrassMichelle ; PongVitchyr H. ; PowellTolly ; PowerAlethea ; PowerBoris ; ProehlElizabeth ; PuriRaul ; RadfordAlec ; RaeJack ; RameshAditya ; RaymondCameron ; RealFrancis ; RimbachKendra ; RossCarl ; RotstedBob ; RoussezHenri ; RyderNick ; SaltarelliMario ; SandersTed ; SanturkarShibani ; SastryGirish ; SchmidtHeather ; SchnurrDavid ; SchulmanJohn ; SelsamDaniel ; SheppardKyla ; SherbakovToki ; ShiehJessica ; ShokerSarah ; ShyamPranav ; SidorSzymon ; SiglerEric ; SimensMaddie ; SitkinJordan ; SlamaKatarina ; SohlIan ; SokolowskyBenjamin ; SongYang ; StaudacherNatalie ; SuchFelipe Petroski ; SummersNatalie ; SutskeverIlya ; TangJie ; TezakNikolas ; ThompsonMadeleine B. ; TilletPhil ; TootoonchianAmin ; TsengElizabeth ; TugglePreston ; TurleyNick ; TworekJerry ; UribeJuan Felipe Cerón ; ValloneAndrea ; VijayvergiyaArun ; VossChelsea ; WainwrightCarroll ; WangJustin Jay ; WangAlvin ; WangBen ; WardJonathan ; WeiJason ; WeinmannCJ ; WelihindaAkila ; WelinderPeter ; WengJiayi ; WengLilian ; WiethoffMatt ; WillnerDave ; WinterClemens ; WolrichSamuel ; WongHannah ; WorkmanLauren ; WuSherwin ; WuJeff ; WuMichael ; XiaoKai ; XuTao ; YooSarah ; YuKevin ; YuanQiming ; ZarembaWojciech ; ZellersRowan ; ZhangChong ; ZhangMarvin ; ZhaoShengjia ; ZhengTianhao ; ZhuangJuntang ; ZhukWilliam ; ZophBarret
GPT-4 技术报告

摘要

我们报告了GPT-4的开发进展,这是一种大规模、多模态模型,能够接受图像和文本输入并生成文本输出。尽管在许多现实场景中其能力仍不及人类,但GPT-4在各种专业和学术基准测试中表现出接近人类水平的性能,包括在模拟律师资格考试中取得了前10%考生的成绩。GPT-4是一种基于Transformer架构的模型,预训练目标是预测文档中的下一个标记。通过后训练对齐过程,该模型在事实性和遵循期望行为方面的表现得到了显著提升。该项目的核心组成部分之一是开发能够在广泛规模范围内表现出可预测性的基础设施和优化方法。这使得我们能够根据计算量不超过GPT-4的千分之一的模型来准确预测GPT-4某些方面的性能。

基准测试

基准方法指标
answerability-prediction-on-peerqaGPT-4o-2024-08-06
Macro F1: 0.3087
arithmetic-reasoning-on-gsm8kGPT-3.5 (few-shot, k=5)
Accuracy: 57.1
common-sense-reasoning-on-arc-challengeGPT-4 (few-shot, k=25)
Accuracy: 96.4
common-sense-reasoning-on-arc-challengeGPT-3.5 (few-shot, k=25)
Accuracy: 85.2
common-sense-reasoning-on-winograndeGPT-4 (5-shot)
Accuracy: 87.5
common-sense-reasoning-on-winograndeGPT-3.5 (5-shot)
Accuracy: 81.6
few-shot-learning-on-medconceptsqagpt-4-0125-preview
Accuracy: 61.911
fs-mevqa-on-smeGPT-4-1106-Vision-Preview
#Learning Samples (N): 16
ACC: 42.30
BLEU-4: 45.51
CIDEr: 269.68
Detection: 7.00
METEOR: 35.17
ROUGE-L: 52.67
SPICE: 37.67
legal-reasoning-on-legalbench-rule-recallGPT-4
Balanced Accuracy: 59.2
long-context-understanding-on-ada-levalGPT-4-Turbo-0125
128k: 0.0
12k: 52.0
16k: 44.5
1k: 73.5
2k: 73.5
32k: 30.0
4k: 65.5
64k: 0.0
6k: 63.0
8k: 56.5
long-context-understanding-on-ada-levalGPT-4-Turbo-1106
128k: 0.0
12k: 49.5
16k: 44.0
1k: 74.0
2k: 73.5
32k: 16.0
4k: 67.5
64k: 0.0
6k: 59.5
8k: 53.5
long-context-understanding-on-ada-leval-tsortGPT-4-Turbo-0125
128k: 2.0
16k: 5.5
2k: 15.5
32k: 2.0
4k: 16.5
64k: 4.0
8k: 8.5
long-context-understanding-on-ada-leval-tsortGPT-4-Turbo-1106
128k: 6.0
16k: 3.5
2k: 18.5
32k: 6.0
4k: 15.5
64k: 6.0
8k: 7.5
long-context-understanding-on-mmneedleGPT-4V
1 Image, 2*2 Stitching, Exact Accuracy: 86.09
1 Image, 4*4 Stitching, Exact Accuracy: 54.72
1 Image, 8*8 Stitching, Exact Accuracy: 7.3
10 Images, 1*1 Stitching, Exact Accuracy: 72.36
10 Images, 2*2 Stitching, Exact Accuracy: 34.24
10 Images, 4*4 Stitching, Exact Accuracy: 7.58
10 Images, 8*8 Stitching, Exact Accuracy: 0
long-context-understanding-on-mmneedleGPT-4o
1 Image, 2*2 Stitching, Exact Accuracy: 94.6
1 Image, 4*4 Stitching, Exact Accuracy: 83
1 Image, 8*8 Stitching, Exact Accuracy: 19
10 Images, 1*1 Stitching, Exact Accuracy: 97
10 Images, 2*2 Stitching, Exact Accuracy: 81.8
10 Images, 4*4 Stitching, Exact Accuracy: 26.9
10 Images, 8*8 Stitching, Exact Accuracy: 1
multi-task-language-understanding-on-mmluGPT-3.5 Turbo
Average (%): 70.0
object-rearrangement-on-open6dor-v2GPT-4V
6-DoF: -
pos-level0: 39.1
pos-level1: 46.8
rot-level0: 9.1
rot-level1: 6.9
rot-level2: 11.7
question-answering-on-drop-testGPT-4 (few-shot, k=3)
F1: 80.9
question-answering-on-drop-testGPT 3.5 (few-shot, k=3)
F1: 64.1
question-answering-on-peerqaGPT-4o-2024-08-06-128k
AlignScore: 0.1224
Prometheus-2 Answer Correctness: 3.4612
Rouge-L: 0.2266
question-answering-on-tiqGpt-4
P@1: 28.6
question-answering-on-triviaqaGPT-4-0613 (Zero-shot)
EM: 84.8
question-answering-on-truthfulqaGPT-4 (RLHF)
MC1: 0.59
spatial-reasoning-on-embspatial-benchGPT-4V
Generation: 36.07
task-1-grouping-on-ocwGPT-3.5-turbo (0-shot)
Wasserstein Distance (WD): 82.5
# Correct Groups: 114
# Solved Walls: 0
Adjusted Mutual Information (AMI): 21.6
Adjusted Rand Index (ARI): 18.4
Fowlkes Mallows Score (FMS): 34.0
task-1-grouping-on-ocwGPT-3.5-turbo (1-shot)
Wasserstein Distance (WD): 82.3
# Correct Groups: 123
# Solved Walls: 0
Adjusted Mutual Information (AMI): 21.2
Adjusted Rand Index (ARI): 18.2
Fowlkes Mallows Score (FMS): 34.4
task-1-grouping-on-ocwGPT-4 (1-shot)
Wasserstein Distance (WD): 73.4
# Correct Groups: 262
# Solved Walls: 4
Adjusted Mutual Information (AMI): 33.5
Adjusted Rand Index (ARI): 29.7
Fowlkes Mallows Score (FMS): 43.7
task-1-grouping-on-ocwGPT-3.5-turbo (10-shot)
Wasserstein Distance (WD): 81.2
# Correct Groups: 137
# Solved Walls: 2
Adjusted Mutual Information (AMI): 24.0
Adjusted Rand Index (ARI): 20.4
Fowlkes Mallows Score (FMS): 36.1
task-1-grouping-on-ocwGPT-4 (5-shot)
Wasserstein Distance (WD): 72.9
# Correct Groups: 269
# Solved Walls: 7
Adjusted Mutual Information (AMI): 32.8
Adjusted Rand Index (ARI): 29.1
Fowlkes Mallows Score (FMS): 43.4
task-1-grouping-on-ocwGPT-3.5-turbo (5-shot)
Wasserstein Distance (WD): 80.6
# Correct Groups: 149
# Solved Walls: 2
Adjusted Mutual Information (AMI): 25.4
Adjusted Rand Index (ARI): 22.0
Fowlkes Mallows Score (FMS): 37.3
task-1-grouping-on-ocwGPT-4 (0-shot)
Wasserstein Distance (WD): 75.8
# Correct Groups: 239
# Solved Walls: 6
Adjusted Mutual Information (AMI): 30.7
Adjusted Rand Index (ARI): 27.2
Fowlkes Mallows Score (FMS): 41.5
task-1-grouping-on-ocwGPT-3.5-turbo (3-shot)
Wasserstein Distance (WD): 80.9
# Correct Groups: 140
# Solved Walls: 0
Adjusted Mutual Information (AMI): 24.7
Adjusted Rand Index (ARI): 21.3
Fowlkes Mallows Score (FMS): 36.8
task-1-grouping-on-ocwGPT-4 (100-shot)
Wasserstein Distance (WD): 73.6
# Correct Groups: 249
# Solved Walls: 3
Adjusted Mutual Information (AMI): 32.3
Adjusted Rand Index (ARI): 28.5
Fowlkes Mallows Score (FMS): 42.8
task-1-grouping-on-ocwGPT-4 (3-shot)
Wasserstein Distance (WD): 73.7
# Correct Groups: 272
# Solved Walls: 5
Adjusted Mutual Information (AMI): 33.6
Adjusted Rand Index (ARI): 29.9
Fowlkes Mallows Score (FMS): 43.9
visual-question-answering-on-benchlmmGPT-4V
GPT-3.5 score: 58.37
visual-question-answering-on-mm-vetGPT-4V-Turbo-detail:high
GPT-4 score: 67.6±0.1
visual-question-answering-on-mm-vetGPT-4o (gpt-4o-2024-05-13)
GPT-4 score: 69.3±0.1
visual-question-answering-on-mm-vetgpt-4o-mini-2024-07-18
GPT-4 score: 68.6±0.1
visual-question-answering-on-mm-vetGPT-4V
GPT-4 score: 67.7±0.3
visual-question-answering-on-mm-vetGPT-4V-Turbo-detail:low
GPT-4 score: 60.2±0.3
visual-question-answering-on-mm-vet-v2GPT-4o (gpt-4o-2024-11-20)
GPT-4 score: 72.1±0.2
visual-question-answering-on-mm-vet-v2GPT-4o (gpt-4o-2024-05-13)
GPT-4 score: 71.0±0.2
visual-question-answering-on-mm-vet-v2gpt-4o-mini-2024-07-18
GPT-4 score: 66.8±0.3
visual-question-answering-on-mm-vet-v2GPT-4 Turbo (gpt-4-0125-preview)
GPT-4 score: 66.3±0.2
visual-question-answering-on-vip-benchGPT-4V-turbo-detail:high (Visual Prompt)
GPT-4 score (bbox): 60.7
GPT-4 score (human): 59.9
visual-question-answering-on-vip-benchGPT-4V-turbo-detail:low (Visual Prompt)
GPT-4 score (bbox): 52.8
GPT-4 score (human): 51.4
visual-question-answering-vqa-on-core-mmGPT-4V
Abductive: 77.88
Analogical: 69.86
Deductive: 74.86
Overall score: 74.44
visual-question-answering-vqa-on-core-mm-1GPT-4V
Abductive: 77.88
Analogical: 69.86
Deductive: 74.86
Overall score: 74.44
Params: -
zero-shot-learning-on-medconceptsqagpt-4-0125-preview
Accuracy: 52.489

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
GPT-4 技术报告 | 论文 | HyperAI超神经