Visual Question Answering On Vqa V2 Test Dev

评估指标

Accuracy

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
PaLI84.3PaLI: A Jointly-Scaled Multilingual Language-Image Model
BEiT-384.19Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
VLMo82.78VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
ONE-PEACE82.6ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
mPLUG (Huge)82.43mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
CuMo-7B82.2CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
X2-VLM (large)81.9X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
MMU81.26Achieving Human Parity on Visual Question Answering-
Lyrics81.2Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects-
InternVL-C81.2InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
X2-VLM (base)80.4X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
XFM (base)80.4Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks
VAST80.23--
SimVLM80.03SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
VALOR78.46VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Prismer78.43Prismer: A Vision-Language Model with Multi-Task Experts
X-VLM (base)78.22Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
VK-OOD77.9Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis-
ALBEF (14M)75.84Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Oscar73.82Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
0 of 56 row(s) selected.
Visual Question Answering On Vqa V2 Test Dev | SOTA | HyperAI超神经