Visual Question Answering On Msrvtt Qa 1

评估指标

Accuracy

评测结果

各个模型在此基准测试上的表现结果

		Paper Title	Repository
VLAB	0.496	VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending	-
MaMMUT	0.495	MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
mPLUG-2	0.480	mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
MuLTI	0.478	MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling	-
Flamingo	0.474	Flamingo: a Visual Language Model for Few-Shot Learning
UMT-L (ViT-L/16)	0.471	Unmasked Teacher: Towards Training-Efficient Video Foundation Models
InternVideo	0.471	InternVideo: General Video Foundation Models via Generative and Discriminative Learning
vid-TLDR (UMT-L)	0.470	vid-TLDR: Training Free Token merging for Light-weight Video Transformer
FrozenBiLM+	0.470	Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
VideoCoCa	0.463	VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners	-
HBI	0.462	Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
HiTeA	0.459	HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training	-
EMCL-Net	0.458	Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
Co-Tokenization	.457	Video Question Answering with Iterative Video-Text Co-Tokenization	-
X2-VLM (large)	0.455	X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
X2-VLM (base)	0.45	X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
All-in-one-B	0.443	All in One: Exploring Unified Video-Language Pre-training
Clover	0.441	Clover: Towards A Unified Video-Language Alignment and Fusion Model
OmniVL	0.441	OmniVL:One Foundation Model for Image-Language and Video-Language Tasks	-
AIO+MIF	0.440	Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models

0 of 34 row(s) selected.