8 months ago

Abstract

Long-form video understanding is complicated by the high redundancy of video data and the abundance of query-irrelevant information. To tackle these challenges, we propose VideoTree, a training-free framework which builds a query-adaptive and hierarchical video representation for LLM reasoning over long-form videos. First, VideoTree extracts query-relevant information from the input video through an iterative process, progressively refining the selection of keyframes based on their relevance to the query. Furthermore, VideoTree leverages the inherent hierarchical structure of long video data, which is often overlooked by existing LLM-based methods. Specifically, we incorporate multi-granularity information into a tree-based representation, allowing VideoTree to extract query-relevant details from long videos in a coarse-to-fine manner. This enables the model to effectively handle a wide range of video queries with varying levels of detail. Finally, VideoTree aggregates the hierarchical query-relevant information within the tree structure and feeds it into an LLM reasoning model to answer the query. Our experiments show that our method improves both reasoning accuracy and efficiency. Specifically, VideoTree outperforms existing training-free approaches on EgoSchema and NExT-QA with less inference time, achieving 61.1% and 75.6% accuracy on the test set without additional video-specific training. Moreover, on the long split of Video-MME (average 44 minutes), VideoTree achieves better performance than GPT-4V and many other MLLMs that were extensively trained on video data.

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

8 months ago

Video Understanding

Visual Question Answering

Multimodal Representation

Multimodality

Computer Vision

Task/Problem

Ziyang Wang* Shoubin Yu* Elias Stengel-Eskin* Jaehong Yoon Feng Cheng Gedas Bertasius Mohit Bansal

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

8 months ago

Video Understanding

Visual Question Answering

Multimodal Representation

Multimodality

Computer Vision

Task/Problem

Ziyang Wang* Shoubin Yu* Elias Stengel-Eskin* Jaehong Yoon Feng Cheng Gedas Bertasius Mohit Bansal

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

Ziyang Wang* Shoubin Yu* Elias Stengel-Eskin* Jaehong Yoon Feng Cheng Gedas Bertasius Mohit Bansal

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

Ziyang Wang* Shoubin Yu* Elias Stengel-Eskin* Jaehong Yoon Feng Cheng Gedas Bertasius Mohit Bansal

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

Ziyang Wang* Shoubin Yu* Elias Stengel-Eskin* Jaehong Yoon Feng Cheng Gedas Bertasius Mohit Bansal

Abstract

Build AI with AI

HyperAI Newsletters