Visual Question Answering On Msvd Qa 1

评估指标

Accuracy

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
VLAB0.61VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending-
MA-LMM0.606MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
MaMMUT (ours).602MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
VAST0.60VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
COSA0.60COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
VALOR0.60VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
mPLUG-20.581mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
VideoCoCa0.569VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners-
GIT0.568GIT: A Generative Image-to-text Transformer for Vision and Language
FrozenBiLM+0.558Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
HiTeA0.556HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training-
InternVideo0.555InternVideo: General Video Foundation Models via Generative and Discriminative Learning
UMT-L (ViT-L/16)0.552Unmasked Teacher: Towards Training-Efficient Video Foundation Models
vid-TLDR (UMT-L)0.549vid-TLDR: Training Free Token merging for Light-weight Video Transformer
MuLTI0.547MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling-
VIOLETv20.547An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
X2-VLM (large)0.546X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
X2-VLM (base)0.528X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
Clover0.524Clover: Towards A Unified Video-Language Alignment and Fusion Model
VIOLET + MELTR0.517MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
0 of 36 row(s) selected.
Visual Question Answering On Msvd Qa 1 | SOTA | HyperAI超神经