Visual Question Answering On Msvd Qa 1

评估指标

Accuracy

评测结果

各个模型在此基准测试上的表现结果

		Paper Title	Repository
VLAB	0.61	VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending	-
MA-LMM	0.606	MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
MaMMUT (ours)	.602	MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
VAST	0.60	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
COSA	0.60	COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
VALOR	0.60	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
mPLUG-2	0.581	mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
VideoCoCa	0.569	VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners	-
GIT	0.568	GIT: A Generative Image-to-text Transformer for Vision and Language
FrozenBiLM+	0.558	Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
HiTeA	0.556	HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training	-
InternVideo	0.555	InternVideo: General Video Foundation Models via Generative and Discriminative Learning
UMT-L (ViT-L/16)	0.552	Unmasked Teacher: Towards Training-Efficient Video Foundation Models
vid-TLDR (UMT-L)	0.549	vid-TLDR: Training Free Token merging for Light-weight Video Transformer
MuLTI	0.547	MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling	-
VIOLETv2	0.547	An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
X2-VLM (large)	0.546	X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
X2-VLM (base)	0.528	X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
Clover	0.524	Clover: Towards A Unified Video-Language Alignment and Fusion Model
VIOLET + MELTR	0.517	MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models

0 of 36 row(s) selected.