Visual Question Answering On Vqa V2 Test Std

评估指标

overall

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
BEiT-384.03Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
mPLUG-Huge83.62mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
ONE-PEACE82.52ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
X2-VLM (large)81.8X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
VLMo81.30VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
SimVLM80.34SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
X2-VLM (base)80.2X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
VAST80.19--
VALOR78.62VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Prompt Tuning78.53Prompt Tuning for Generative Multimodal Pretrained Models
Prismer78.49Prismer: A Vision-Language Model with Multi-Task Experts
MSR + MS Cog. Svcs., X10 models77.45VinVL: Revisiting Visual Representations in Vision-Language Models
MSR + MS Cog. Svcs.76.63VinVL: Revisiting Visual Representations in Vision-Language Models
ALBEF (14M)76.04Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
BGN, ensemble75.92Bilinear Graph Networks for Visual Question Answering-
ERNIE-ViL-single model74.93ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph-
Single, w/o VLP74.16In Defense of Grid Features for Visual Question Answering
Single, w/o VLP73.86Deep Multimodal Neural Architecture Search
UNITER (Large)73.4UNITER: UNiversal Image-TExt Representation Learning
X-101 grid features + MCAN72.71In Defense of Grid Features for Visual Question Answering
0 of 38 row(s) selected.
Visual Question Answering On Vqa V2 Test Std | SOTA | HyperAI超神经