Zero Shot Video Question Answer On Egoschema 1

评估指标

Accuracy

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
BIMBA-LLaVA-Qwen2-7B71.14BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
LinVT-Qwen2-VL(7B)69.5LinVT: Empower Your Image-level Large Language Model to Understand Videos
LongVU (7B)67.6LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Video-RAG (Based on LLaVA-Video)66.7Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
VideoLLaMA2 (72B)63.9VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Tarsier (34B)61.7Tarsier: Recipes for Training and Evaluating Large Video Description Models
VideoTree (GPT4)61.1VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
LVNet61.1Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
InternVideo2-6B60.2InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
VideoChat2_phi356.7MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
VideoChat2_HD_mistral55.8MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
VideoChat2_mistral54.4MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Vamos (GPT-4o)53.6Vamos: Versatile Action Models for Video Understanding
TraveLER53.3TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering
LLoVi (GPT-3.5)50.3A Simple LLM Framework for Long-Range Video Question-Answering
Video ReCap50.23Video ReCap: Recursive Captioning of Hour-Long Videos
Vamos (GPT-4)48.3Vamos: Versatile Action Models for Video Understanding
LangRepo (12B)41.2Language Repository for Long Video Understanding
MVU (13B)37.6Understanding Long Videos with Multimodal Language Models
Vamos (13B)36.7Vamos: Versatile Action Models for Video Understanding
0 of 27 row(s) selected.
Zero Shot Video Question Answer On Egoschema 1 | SOTA | HyperAI超神经