Zeroshot Video Question Answer On Msvd Qa

评估指标

Accuracy
Confidence Score

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
Flash-VStream80.33.9Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
Tarsier (34B)80.34.2Tarsier: Recipes for Training and Evaluating Large Video Description Models
LinVT-Qwen2-VL (7B)80.24.4LinVT: Empower Your Image-level Large Language Model to Understand Videos
VILA1.5-40B80.1-VILA: On Pre-training for Visual Language Models
SlowFast-LLaVA-34B79.94.1SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
PLLaVA (34B)79.94.2PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
IG-VLM-34B79.64.1An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
TS-LLaVA-34B79.44.1TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
PPLLaVA-7B77.14.0PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Elysium75.83.7Elysium: Exploring Object-level Perception in Videos via MLLM
MovieChat75.22.9MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
ST-LLM74.63.9ST-LLM: Large Language Models Are Effective Temporal Learners
MiniGPT4-video-7B73.92-MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
Video-LaVIT73.23.9Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
VideoGPT+72.43.6VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
LLaVA-Mini70.94.0LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Video-LLaVA-7B70.73.9Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
VideoChat270.03.9MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
LLaMA-VID-13B (2 Token)70.03.7LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
LLaMA-VID-7B (2 Token)69.73.7LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
0 of 28 row(s) selected.
Zeroshot Video Question Answer On Msvd Qa | SOTA | HyperAI超神经