Zero Shot Video Question Answer On Next Qa

评估指标

Accuracy

评测结果

各个模型在此基准测试上的表现结果

		Paper Title	Repository
Tarsier (34B)	79.2	Tarsier: Recipes for Training and Evaluating Large Video Description Models
ENTER	75.1	ENTER: Event Based Interpretable Reasoning for VideoQA	-
TS-LLaVA-34B	73.6	TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
VideoTree (GPT4)	73.5	VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
LVNet(GPT-4o)	72.9	Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
VideoAgent (GPT-4)	71.3	VideoAgent: Long-form Video Understanding with Large Language Model as Agent
IG-VLM(LLaVA v1.6)	70.9	An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
VidCtx (7B)	70.7	VidCtx: Context-aware Video Question Answering with Image Models
MoReVQA(PaLM-2)	69.2	MoReVQA: Exploring Modular Reasoning Models for Video Question Answering	-
IG-VLM (GPT-4)	68.6	An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
TraveLER (GPT-4)	68.2	TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering
LLoVi (GPT-4)	67.7	A Simple LLM Framework for Long-Range Video Question-Answering
LongVA(32 frames)	67.1	Long Context Transfer from Language to Vision
Q-ViD	66.3	Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering
ProViQ	64.6	Zero-Shot Video Question Answering with Procedural Programs	-
SlowFast-LLaVA-34B	64.2	SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Sevila (4B)	63.6	Self-Chained Image-Language Model for Video Localization and Question Answering
VideoChat2	61.7	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
DeepStack-L(7B)	61.0	DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs	-
LangRepo (12B)	60.9	Language Repository for Long Video Understanding

0 of 25 row(s) selected.