Visual Question Answering On Mm Vet

评估指标

GPT-4 score

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
gemini-2.0-flash-exp81.2±0.4--
gemini-exp-120678.1±0.2--
Gemini 1.5 Pro (gemini-1.5-pro-002)76.9±0.1Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
MMCTAgent (GPT-4 + GPT-4V)74.24MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning-
Claude 3.5 Sonnet (claude-3-5-sonnet-20240620)74.2±0.2Claude 3.5 Sonnet Model Card Addendum-
Qwen2-VL-72B74.0Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
InternVL2.5-78B72.3Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
GPT-4o +text rationale +IoT72.2Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models-
Lyra-Pro71.4Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
GLM-4V-Plus71.1CogVLM2: Visual Language Models for Image and Video Understanding
Phantom-7B70.8Phantom of Latent for Large Language and Vision Models
GPT-4o (gpt-4o-2024-05-13)69.3±0.1GPT-4 Technical Report
InternVL2.5-38B68.8Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
gpt-4o-mini-2024-07-1868.6±0.1GPT-4 Technical Report
GPT-4V67.7±0.3GPT-4 Technical Report
GPT-4V-Turbo-detail:high67.6±0.1GPT-4 Technical Report
Qwen-VL-Max66.6±0.5Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Gemini 1.5 Pro (gemini-1.5-pro)65.8±0.1Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
InternVL2-26B (SGP, token ratio 64%)65.60A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs
Baichuan-Omni (7B)65.4Baichuan-Omni Technical Report
0 of 229 row(s) selected.
Visual Question Answering On Mm Vet | SOTA | HyperAI超神经