Temporal Relation Extraction On Vinoground

评估指标

Group Score
Text Score
Video Score

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
GPT-4o (CoT)3559.251--
GPT-4o24.65438.2--
LLaVA-OneVision-Qwen2-72B21.848.435.2LLaVA-OneVision: Easy Visual Task Transfer
Qwen2-VL-72B17.450.432.6Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Qwen2-VL-7B15.240.232.4Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
LLaVA-OneVision-Qwen2-7B14.641.629.4LLaVA-OneVision: Easy Visual Task Transfer
Gemini-1.5-Pro (CoT)12.43727.6Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
MiniCPM-2.611.232.629.2MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Claude 3.5 Sonnet10.632.828.8--
Gemini-1.5-Pro10.235.822.6Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
InternLM-XC-2.59.628.827.8InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
InternLM-XC-2.5 (CoT)930.828.4InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
VideoLLaMA2-72B8.436.221.8VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
LLaVA-NeXT-Video-7B (CoT)6.821.826.2--
MA-LMM-Vicuna-7B6.823.825.6MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Video-LLaVA-7B6.624.825.8Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
LLaVA-NeXT-Video-7B6.221.825.6--
Phi-3.5-Vision6.22422.4--
VTimeLLM5.219.427VTimeLLM: Empower LLM to Grasp Video Moments
LLaVA-NeXT-Video-34B (CoT)5.225.822.2--
0 of 24 row(s) selected.
Temporal Relation Extraction On Vinoground | SOTA | HyperAI超神经