Zeroshot Video Question Answer On Msrvtt Qa

评估指标

Accuracy
Confidence Score

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
Flash-VStream72.43.4Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
PLLaVA (34B)68.73.6PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Elysium67.53.2Elysium: Exploring Object-level Perception in Videos via MLLM
SlowFast-LLaVA-34B67.43.7SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Tarsier (34B)66.43.7Tarsier: Recipes for Training and Evaluating Large Video Description Models
TS-LLaVA-34B66.23.6TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
LinVT-Qwen2-VL (7B)66.24.0LinVT: Empower Your Image-level Large Language Model to Understand Videos
PPLLaVA-7B64.33.5PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
IG-VLM63.83.5An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
ST-LLM63.23.4ST-LLM: Large Language Models Are Effective Temporal Learners
CAT-7B62.13.5CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
VideoGPT+60.63.6VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Vista-LLaMA-7B60.53.3Vista-LLaMA: Reducing Hallucination in Video Language Models via Equal Distance to Visual Tokens-
MiniGPT4-video-7B59.73-MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
LLaVA-Mini59.53.6LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Video-LaVIT59.33.3Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
Video-LLaVA-7B59.23.5Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
LLaMA-VID-13B (2 Token)58.93.3LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
LLaMA-VID-7B (2 Token)57.73.2LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
SUM-shot+Vicuna56.8-Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos
0 of 30 row(s) selected.
Zeroshot Video Question Answer On Msrvtt Qa | SOTA | HyperAI超神经