8 months ago

Abstract

We present VRBench, the first long narrative video benchmark crafted forevaluating large models' multi-step reasoning capabilities, addressinglimitations in existing evaluations that overlook temporal reasoning andprocedural validity. It comprises 1,010 long videos (with an average durationof 1.6 hours), along with 9,468 human-labeled multi-step question-answeringpairs and 30,292 reasoning steps with timestamps. These videos are curated viaa multi-stage filtering process including expert inter-rater reviewing toprioritize plot coherence. We develop a human-AI collaborative framework thatgenerates coherent reasoning chains, each requiring multiple temporallygrounded steps, spanning seven types (e.g., event attribution, implicitinference). VRBench designs a multi-phase evaluation pipeline that assessesmodels at both the outcome and process levels. Apart from the MCQs for thefinal results, we propose a progress-level LLM-guided scoring metric toevaluate the quality of the reasoning chain from multiple dimensionscomprehensively. Through extensive evaluations of 12 LLMs and 16 VLMs onVRBench, we undertake a thorough analysis and provide valuable insights thatadvance the field of multi-step reasoning.

Source PDF View Code