Visual Question Answering On Mm Vet V2

评估指标

GPT-4 score

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
gemini-2.0-flash-exp77.1±0.1--
GPT-4o (gpt-4o-2024-11-20)72.1±0.2GPT-4 Technical Report
Claude 3.5 Sonnet (claude-3-5-sonnet-20240620)71.8±0.2Claude 3.5 Sonnet Model Card Addendum-
GPT-4o (gpt-4o-2024-05-13)71.0±0.2GPT-4 Technical Report
InternVL2-Llama3-76B68.4±0.3--
Qwen2-VL-72B (qwen-vl-max-0809)66.9±0.3Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Gemini 1.5 Pro66.9±0.2Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
gpt-4o-mini-2024-07-1866.8±0.3GPT-4 Technical Report
GPT-4 Turbo (gpt-4-0125-preview)66.3±0.2GPT-4 Technical Report
InternVL2-40B63.8±0.2--
Gemini Pro Vision57.2±0.2Gemini: A Family of Highly Capable Multimodal Models
Qwen-VL-Max55.8±0.2Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Claude 3 Opus (claude-3-opus-20240229)55.8±0.2--
InternVL-Chat-V1-551.5±0.2How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
LLaVA-NeXT-34B50.9±0.1--
InternVL-Chat-V1-245.5±0.1--
CogVLM-Chat45.1±0.2CogVLM: Visual Expert for Pretrained Language Models
IXC2-VL-7B42.5±0.3InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
Emu2-Chat38.0±0.1Generative Multimodal Models are In-Context Learners
CogAgent-Chat34.7±0.2CogAgent: A Visual Language Model for GUI Agents
0 of 24 row(s) selected.
Visual Question Answering On Mm Vet V2 | SOTA | HyperAI超神经