OpenAI ; AchiamJosh ; AdlerSteven ; AgarwalSandhini ; AhmadLama ; AkkayaIlge ; AlemanFlorencia Leoni ; AlmeidaDiogo ; AltenschmidtJanko ; AltmanSam ; AnadkatShyamal ; AvilaRed ; BabuschkinIgor ; BalajiSuchir ; BalcomValerie ; BaltescuPaul ; BaoHaiming ; BavarianMohammad ; BelgumJeff ; BelloIrwan ; BerdineJake ; Bernadett-ShapiroGabriel ; BernerChristopher ; BogdonoffLenny ; BoikoOleg ; BoydMadelaine ; BrakmanAnna-Luisa ; BrockmanGreg ; BrooksTim ; BrundageMiles ; ButtonKevin ; CaiTrevor ; CampbellRosie ; CannAndrew ; CareyBrittany ; CarlsonChelsea ; CarmichaelRory ; ChanBrooke ; ChangChe ; ChantzisFotis ; ChenDerek ; ChenSully ; ChenRuby ; ChenJason ; ChenMark ; ChessBen ; ChoChester ; ChuCasey ; ChungHyung Won ; CummingsDave ; CurrierJeremiah ; DaiYunxing ; DecareauxCory ; DegryThomas ; DeutschNoah ; DevilleDamien ; DharArka ; DohanDavid ; DowlingSteve ; DunningSheila ; EcoffetAdrien ; EletiAtty ; EloundouTyna ; FarhiDavid ; FedusLiam ; FelixNiko ; FishmanSimón Posada ; ForteJuston ; FulfordIsabella ; GaoLeo ; GeorgesElie ; GibsonChristian ; GoelVik ; GogineniTarun ; GohGabriel ; Gontijo-LopesRapha ; GordonJonathan ; GrafsteinMorgan ; GrayScott ; GreeneRyan ; GrossJoshua ; GuShixiang Shane ; GuoYufei ; HallacyChris ; HanJesse ; HarrisJeff ; HeYuchen ; HeatonMike ; HeideckeJohannes ; HesseChris ; HickeyAlan ; HickeyWade ; HoeschelePeter ; HoughtonBrandon ; HsuKenny ; HuShengli ; HuXin ; HuizingaJoost ; JainShantanu ; JainShawn ; JangJoanne ; JiangAngela ; JiangRoger ; JinHaozhun ; JinDenny ; JomotoShino ; JonnBillie ; JunHeewoo ; KaftanTomer ; KaiserŁukasz ; KamaliAli ; KanitscheiderIngmar ; KeskarNitish Shirish ; KhanTabarak ; KilpatrickLogan ; KimJong Wook ; KimChristina ; KimYongjik ; KirchnerJan Hendrik ; KirosJamie ; KnightMatt ; KokotajloDaniel ; KondraciukŁukasz ; KondrichAndrew ; KonstantinidisAris ; KosicKyle ; KruegerGretchen ; KuoVishal ; LampeMichael ; LanIkai ; LeeTeddy ; LeikeJan ; LeungJade ; LevyDaniel ; LiChak Ming ; LimRachel ; LinMolly ; LinStephanie ; LitwinMateusz ; LopezTheresa ; LoweRyan ; LuePatricia ; MakanjuAnna ; MalfaciniKim ; ManningSam ; MarkovTodor ; MarkovskiYaniv ; MartinBianca ; MayerKatie ; MayneAndrew ; McGrewBob ; McKinneyScott Mayer ; McLeaveyChristine ; McMillanPaul ; McNeilJake ; MedinaDavid ; MehtaAalok ; MenickJacob ; MetzLuke ; MishchenkoAndrey ; MishkinPamela ; MonacoVinnie ; MorikawaEvan ; MossingDaniel ; MuTong ; MuratiMira ; MurkOleg ; MélyDavid ; NairAshvin ; NakanoReiichiro ; NayakRajeev ; NeelakantanArvind ; NgoRichard ; NohHyeonwoo ; OuyangLong ; O'KeefeCullen ; PachockiJakub ; PainoAlex ; PalermoJoe ; PantulianoAshley ; ParascandoloGiambattista ; ParishJoel ; ParparitaEmy ; PassosAlex ; PavlovMikhail ; PengAndrew ; PerelmanAdam ; PeresFilipe de Avila Belbute ; PetrovMichael ; PintoHenrique Ponde de Oliveira ; Michael ; Pokorny ; PokrassMichelle ; PongVitchyr H. ; PowellTolly ; PowerAlethea ; PowerBoris ; ProehlElizabeth ; PuriRaul ; RadfordAlec ; RaeJack ; RameshAditya ; RaymondCameron ; RealFrancis ; RimbachKendra ; RossCarl ; RotstedBob ; RoussezHenri ; RyderNick ; SaltarelliMario ; SandersTed ; SanturkarShibani ; SastryGirish ; SchmidtHeather ; SchnurrDavid ; SchulmanJohn ; SelsamDaniel ; SheppardKyla ; SherbakovToki ; ShiehJessica ; ShokerSarah ; ShyamPranav ; SidorSzymon ; SiglerEric ; SimensMaddie ; SitkinJordan ; SlamaKatarina ; SohlIan ; SokolowskyBenjamin ; SongYang ; StaudacherNatalie ; SuchFelipe Petroski ; SummersNatalie ; SutskeverIlya ; TangJie ; TezakNikolas ; ThompsonMadeleine B. ; TilletPhil ; TootoonchianAmin ; TsengElizabeth ; TugglePreston ; TurleyNick ; TworekJerry ; UribeJuan Felipe Cerón ; ValloneAndrea ; VijayvergiyaArun ; VossChelsea ; WainwrightCarroll ; WangJustin Jay ; WangAlvin ; WangBen ; WardJonathan ; WeiJason ; WeinmannCJ ; WelihindaAkila ; WelinderPeter ; WengJiayi ; WengLilian ; WiethoffMatt ; WillnerDave ; WinterClemens ; WolrichSamuel ; WongHannah ; WorkmanLauren ; WuSherwin ; WuJeff ; WuMichael ; XiaoKai ; XuTao ; YooSarah ; YuKevin ; YuanQiming ; ZarembaWojciech ; ZellersRowan ; ZhangChong ; ZhangMarvin ; ZhaoShengjia ; ZhengTianhao ; ZhuangJuntang ; ZhukWilliam ; ZophBarret

摘要
我们报告了GPT-4的开发进展,这是一种大规模、多模态模型,能够接受图像和文本输入并生成文本输出。尽管在许多现实场景中其能力仍不及人类,但GPT-4在各种专业和学术基准测试中表现出接近人类水平的性能,包括在模拟律师资格考试中取得了前10%考生的成绩。GPT-4是一种基于Transformer架构的模型,预训练目标是预测文档中的下一个标记。通过后训练对齐过程,该模型在事实性和遵循期望行为方面的表现得到了显著提升。该项目的核心组成部分之一是开发能够在广泛规模范围内表现出可预测性的基础设施和优化方法。这使得我们能够根据计算量不超过GPT-4的千分之一的模型来准确预测GPT-4某些方面的性能。
代码仓库
lflage/openfactscore
GitHub 中提及
gpt4life/alpagasus
pytorch
GitHub 中提及
shmsw25/factscore
pytorch
GitHub 中提及
zach-zhiling-zheng/reticular_chemist
GitHub 中提及
eternityyw/tram-benchmark
GitHub 中提及
unispac/visual-adversarial-examples-jailbreak-large-language-models
pytorch
GitHub 中提及
ethz-spylab/superhuman-ai-consistency
GitHub 中提及
emrgnt-cmplxty/zero-shot-replication
pytorch
GitHub 中提及
AUCOHL/RTL-Repo
pytorch
GitHub 中提及
ethz-privsec/superhuman-ai-consistency
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| answerability-prediction-on-peerqa | GPT-4o-2024-08-06 | Macro F1: 0.3087 |
| arithmetic-reasoning-on-gsm8k | GPT-3.5 (few-shot, k=5) | Accuracy: 57.1 |
| common-sense-reasoning-on-arc-challenge | GPT-4 (few-shot, k=25) | Accuracy: 96.4 |
| common-sense-reasoning-on-arc-challenge | GPT-3.5 (few-shot, k=25) | Accuracy: 85.2 |
| common-sense-reasoning-on-winogrande | GPT-4 (5-shot) | Accuracy: 87.5 |
| common-sense-reasoning-on-winogrande | GPT-3.5 (5-shot) | Accuracy: 81.6 |
| few-shot-learning-on-medconceptsqa | gpt-4-0125-preview | Accuracy: 61.911 |
| fs-mevqa-on-sme | GPT-4-1106-Vision-Preview | #Learning Samples (N): 16 ACC: 42.30 BLEU-4: 45.51 CIDEr: 269.68 Detection: 7.00 METEOR: 35.17 ROUGE-L: 52.67 SPICE: 37.67 |
| legal-reasoning-on-legalbench-rule-recall | GPT-4 | Balanced Accuracy: 59.2 |
| long-context-understanding-on-ada-leval | GPT-4-Turbo-0125 | 128k: 0.0 12k: 52.0 16k: 44.5 1k: 73.5 2k: 73.5 32k: 30.0 4k: 65.5 64k: 0.0 6k: 63.0 8k: 56.5 |
| long-context-understanding-on-ada-leval | GPT-4-Turbo-1106 | 128k: 0.0 12k: 49.5 16k: 44.0 1k: 74.0 2k: 73.5 32k: 16.0 4k: 67.5 64k: 0.0 6k: 59.5 8k: 53.5 |
| long-context-understanding-on-ada-leval-tsort | GPT-4-Turbo-0125 | 128k: 2.0 16k: 5.5 2k: 15.5 32k: 2.0 4k: 16.5 64k: 4.0 8k: 8.5 |
| long-context-understanding-on-ada-leval-tsort | GPT-4-Turbo-1106 | 128k: 6.0 16k: 3.5 2k: 18.5 32k: 6.0 4k: 15.5 64k: 6.0 8k: 7.5 |
| long-context-understanding-on-mmneedle | GPT-4V | 1 Image, 2*2 Stitching, Exact Accuracy: 86.09 1 Image, 4*4 Stitching, Exact Accuracy: 54.72 1 Image, 8*8 Stitching, Exact Accuracy: 7.3 10 Images, 1*1 Stitching, Exact Accuracy: 72.36 10 Images, 2*2 Stitching, Exact Accuracy: 34.24 10 Images, 4*4 Stitching, Exact Accuracy: 7.58 10 Images, 8*8 Stitching, Exact Accuracy: 0 |
| long-context-understanding-on-mmneedle | GPT-4o | 1 Image, 2*2 Stitching, Exact Accuracy: 94.6 1 Image, 4*4 Stitching, Exact Accuracy: 83 1 Image, 8*8 Stitching, Exact Accuracy: 19 10 Images, 1*1 Stitching, Exact Accuracy: 97 10 Images, 2*2 Stitching, Exact Accuracy: 81.8 10 Images, 4*4 Stitching, Exact Accuracy: 26.9 10 Images, 8*8 Stitching, Exact Accuracy: 1 |
| multi-task-language-understanding-on-mmlu | GPT-3.5 Turbo | Average (%): 70.0 |
| object-rearrangement-on-open6dor-v2 | GPT-4V | 6-DoF: - pos-level0: 39.1 pos-level1: 46.8 rot-level0: 9.1 rot-level1: 6.9 rot-level2: 11.7 |
| question-answering-on-drop-test | GPT-4 (few-shot, k=3) | F1: 80.9 |
| question-answering-on-drop-test | GPT 3.5 (few-shot, k=3) | F1: 64.1 |
| question-answering-on-peerqa | GPT-4o-2024-08-06-128k | AlignScore: 0.1224 Prometheus-2 Answer Correctness: 3.4612 Rouge-L: 0.2266 |
| question-answering-on-tiq | Gpt-4 | P@1: 28.6 |
| question-answering-on-triviaqa | GPT-4-0613 (Zero-shot) | EM: 84.8 |
| question-answering-on-truthfulqa | GPT-4 (RLHF) | MC1: 0.59 |
| spatial-reasoning-on-embspatial-bench | GPT-4V | Generation: 36.07 |
| task-1-grouping-on-ocw | GPT-3.5-turbo (0-shot) | Wasserstein Distance (WD): 82.5 # Correct Groups: 114 # Solved Walls: 0 Adjusted Mutual Information (AMI): 21.6 Adjusted Rand Index (ARI): 18.4 Fowlkes Mallows Score (FMS): 34.0 |
| task-1-grouping-on-ocw | GPT-3.5-turbo (1-shot) | Wasserstein Distance (WD): 82.3 # Correct Groups: 123 # Solved Walls: 0 Adjusted Mutual Information (AMI): 21.2 Adjusted Rand Index (ARI): 18.2 Fowlkes Mallows Score (FMS): 34.4 |
| task-1-grouping-on-ocw | GPT-4 (1-shot) | Wasserstein Distance (WD): 73.4 # Correct Groups: 262 # Solved Walls: 4 Adjusted Mutual Information (AMI): 33.5 Adjusted Rand Index (ARI): 29.7 Fowlkes Mallows Score (FMS): 43.7 |
| task-1-grouping-on-ocw | GPT-3.5-turbo (10-shot) | Wasserstein Distance (WD): 81.2 # Correct Groups: 137 # Solved Walls: 2 Adjusted Mutual Information (AMI): 24.0 Adjusted Rand Index (ARI): 20.4 Fowlkes Mallows Score (FMS): 36.1 |
| task-1-grouping-on-ocw | GPT-4 (5-shot) | Wasserstein Distance (WD): 72.9 # Correct Groups: 269 # Solved Walls: 7 Adjusted Mutual Information (AMI): 32.8 Adjusted Rand Index (ARI): 29.1 Fowlkes Mallows Score (FMS): 43.4 |
| task-1-grouping-on-ocw | GPT-3.5-turbo (5-shot) | Wasserstein Distance (WD): 80.6 # Correct Groups: 149 # Solved Walls: 2 Adjusted Mutual Information (AMI): 25.4 Adjusted Rand Index (ARI): 22.0 Fowlkes Mallows Score (FMS): 37.3 |
| task-1-grouping-on-ocw | GPT-4 (0-shot) | Wasserstein Distance (WD): 75.8 # Correct Groups: 239 # Solved Walls: 6 Adjusted Mutual Information (AMI): 30.7 Adjusted Rand Index (ARI): 27.2 Fowlkes Mallows Score (FMS): 41.5 |
| task-1-grouping-on-ocw | GPT-3.5-turbo (3-shot) | Wasserstein Distance (WD): 80.9 # Correct Groups: 140 # Solved Walls: 0 Adjusted Mutual Information (AMI): 24.7 Adjusted Rand Index (ARI): 21.3 Fowlkes Mallows Score (FMS): 36.8 |
| task-1-grouping-on-ocw | GPT-4 (100-shot) | Wasserstein Distance (WD): 73.6 # Correct Groups: 249 # Solved Walls: 3 Adjusted Mutual Information (AMI): 32.3 Adjusted Rand Index (ARI): 28.5 Fowlkes Mallows Score (FMS): 42.8 |
| task-1-grouping-on-ocw | GPT-4 (3-shot) | Wasserstein Distance (WD): 73.7 # Correct Groups: 272 # Solved Walls: 5 Adjusted Mutual Information (AMI): 33.6 Adjusted Rand Index (ARI): 29.9 Fowlkes Mallows Score (FMS): 43.9 |
| visual-question-answering-on-benchlmm | GPT-4V | GPT-3.5 score: 58.37 |
| visual-question-answering-on-mm-vet | GPT-4V-Turbo-detail:high | GPT-4 score: 67.6±0.1 |
| visual-question-answering-on-mm-vet | GPT-4o (gpt-4o-2024-05-13) | GPT-4 score: 69.3±0.1 |
| visual-question-answering-on-mm-vet | gpt-4o-mini-2024-07-18 | GPT-4 score: 68.6±0.1 |
| visual-question-answering-on-mm-vet | GPT-4V | GPT-4 score: 67.7±0.3 |
| visual-question-answering-on-mm-vet | GPT-4V-Turbo-detail:low | GPT-4 score: 60.2±0.3 |
| visual-question-answering-on-mm-vet-v2 | GPT-4o (gpt-4o-2024-11-20) | GPT-4 score: 72.1±0.2 |
| visual-question-answering-on-mm-vet-v2 | GPT-4o (gpt-4o-2024-05-13) | GPT-4 score: 71.0±0.2 |
| visual-question-answering-on-mm-vet-v2 | gpt-4o-mini-2024-07-18 | GPT-4 score: 66.8±0.3 |
| visual-question-answering-on-mm-vet-v2 | GPT-4 Turbo (gpt-4-0125-preview) | GPT-4 score: 66.3±0.2 |
| visual-question-answering-on-vip-bench | GPT-4V-turbo-detail:high (Visual Prompt) | GPT-4 score (bbox): 60.7 GPT-4 score (human): 59.9 |
| visual-question-answering-on-vip-bench | GPT-4V-turbo-detail:low (Visual Prompt) | GPT-4 score (bbox): 52.8 GPT-4 score (human): 51.4 |
| visual-question-answering-vqa-on-core-mm | GPT-4V | Abductive: 77.88 Analogical: 69.86 Deductive: 74.86 Overall score: 74.44 |
| visual-question-answering-vqa-on-core-mm-1 | GPT-4V | Abductive: 77.88 Analogical: 69.86 Deductive: 74.86 Overall score: 74.44 Params: - |
| zero-shot-learning-on-medconceptsqa | gpt-4-0125-preview | Accuracy: 52.489 |