Zero Shot Video Question Answer On Next Qa

评估指标

Accuracy

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
Tarsier (34B)79.2Tarsier: Recipes for Training and Evaluating Large Video Description Models
ENTER75.1ENTER: Event Based Interpretable Reasoning for VideoQA-
TS-LLaVA-34B73.6TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
VideoTree (GPT4)73.5VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
LVNet(GPT-4o)72.9Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
VideoAgent (GPT-4)71.3VideoAgent: Long-form Video Understanding with Large Language Model as Agent
IG-VLM(LLaVA v1.6)70.9An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
VidCtx (7B)70.7VidCtx: Context-aware Video Question Answering with Image Models
MoReVQA(PaLM-2)69.2MoReVQA: Exploring Modular Reasoning Models for Video Question Answering-
IG-VLM (GPT-4)68.6An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
TraveLER (GPT-4)68.2TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering
LLoVi (GPT-4)67.7A Simple LLM Framework for Long-Range Video Question-Answering
LongVA(32 frames)67.1Long Context Transfer from Language to Vision
Q-ViD66.3Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering
ProViQ64.6Zero-Shot Video Question Answering with Procedural Programs-
SlowFast-LLaVA-34B64.2SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Sevila (4B)63.6Self-Chained Image-Language Model for Video Localization and Question Answering
VideoChat261.7MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
DeepStack-L(7B)61.0DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs-
LangRepo (12B)60.9Language Repository for Long Video Understanding
0 of 25 row(s) selected.
Zero Shot Video Question Answer On Next Qa | SOTA | HyperAI超神经