HyperAI
HyperAI超神经
首页
算力平台
文档
资讯
论文
教程
数据集
百科
SOTA
LLM 模型天梯
GPU 天梯
顶会
开源项目
全站搜索
关于
中文
HyperAI
HyperAI超神经
Toggle sidebar
全站搜索…
⌘
K
全站搜索…
⌘
K
首页
SOTA
零样本视频问答
Zero Shot Video Question Answer On Egoschema 1
Zero Shot Video Question Answer On Egoschema 1
评估指标
Accuracy
评测结果
各个模型在此基准测试上的表现结果
Columns
模型名称
Accuracy
Paper Title
Repository
BIMBA-LLaVA-Qwen2-7B
71.14
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
LinVT-Qwen2-VL(7B)
69.5
LinVT: Empower Your Image-level Large Language Model to Understand Videos
LongVU (7B)
67.6
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Video-RAG (Based on LLaVA-Video)
66.7
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
VideoLLaMA2 (72B)
63.9
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Tarsier (34B)
61.7
Tarsier: Recipes for Training and Evaluating Large Video Description Models
VideoTree (GPT4)
61.1
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
LVNet
61.1
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
InternVideo2-6B
60.2
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
VideoChat2_phi3
56.7
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
VideoChat2_HD_mistral
55.8
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
VideoChat2_mistral
54.4
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Vamos (GPT-4o)
53.6
Vamos: Versatile Action Models for Video Understanding
TraveLER
53.3
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering
LLoVi (GPT-3.5)
50.3
A Simple LLM Framework for Long-Range Video Question-Answering
Video ReCap
50.23
Video ReCap: Recursive Captioning of Hour-Long Videos
Vamos (GPT-4)
48.3
Vamos: Versatile Action Models for Video Understanding
LangRepo (12B)
41.2
Language Repository for Long Video Understanding
MVU (13B)
37.6
Understanding Long Videos with Multimodal Language Models
Vamos (13B)
36.7
Vamos: Versatile Action Models for Video Understanding
0 of 27 row(s) selected.
Previous
Next
Zero Shot Video Question Answer On Egoschema 1 | SOTA | HyperAI超神经