Zero Shot Video Question Answer On Egoschema 1

评估指标

Accuracy

评测结果

各个模型在此基准测试上的表现结果

		Paper Title
BIMBA-LLaVA-Qwen2-7B	71.14	BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
LinVT-Qwen2-VL(7B)	69.5	LinVT: Empower Your Image-level Large Language Model to Understand Videos
LongVU (7B)	67.6	LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Video-RAG (Based on LLaVA-Video)	66.7	Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
VideoLLaMA2 (72B)	63.9	VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Tarsier (34B)	61.7	Tarsier: Recipes for Training and Evaluating Large Video Description Models
VideoTree (GPT4)	61.1	VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
LVNet	61.1	Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
InternVideo2-6B	60.2	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
VideoChat2_phi3	56.7	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
VideoChat2_HD_mistral	55.8	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
VideoChat2_mistral	54.4	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Vamos (GPT-4o)	53.6	Vamos: Versatile Action Models for Video Understanding
TraveLER	53.3	TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering
LLoVi (GPT-3.5)	50.3	A Simple LLM Framework for Long-Range Video Question-Answering
Video ReCap	50.23	Video ReCap: Recursive Captioning of Hour-Long Videos
Vamos (GPT-4)	48.3	Vamos: Versatile Action Models for Video Understanding
LangRepo (12B)	41.2	Language Repository for Long Video Understanding
MVU (13B)	37.6	Understanding Long Videos with Multimodal Language Models
Vamos (13B)	36.7	Vamos: Versatile Action Models for Video Understanding

0 of 27 row(s) selected.