Zeroshot Video Question Answer On Activitynet

评估指标

Accuracy
Confidence Score

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
Tarsier (34B)61.63.7Tarsier: Recipes for Training and Evaluating Large Video Description Models
PLLaVA (34B)60.93.7PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
PPLLaVA-7B60.73.6PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
LinVT-Qwen2-VL(7B)60.13.6LinVT: Empower Your Image-level Large Language Model to Understand Videos
SlowFast-LLaVA-34B59.23.5SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
TS-LLaVA-34B58.93.5TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
IG-VLM58.43.5An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
LLaVA-Mini53.53.5LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Flash-VStream51.93.4Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
ST-LLM50.93.3ST-LLM: Large Language Models Are Effective Temporal Learners
VideoGPT+50.63.6VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
CAT-7B50.23.5CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
Video-LaVIT50.13.3Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
VideoChat249.13.3MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
LLaMA-VID-13B (2 Token)47.53.3LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
LLaMA-VID-7B (2 Token)47.43.3LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Chat-UniVi-13B46.43.6Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
MiniGPT4-video-7B46.3-MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
BT-Adapter (zero-shot)46.13.2BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Chat-UniVi46.13.3Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
0 of 28 row(s) selected.
Zeroshot Video Question Answer On Activitynet | SOTA | HyperAI超神经