8 months ago

Video Processing

Visual Document Retrieval

Computer Vision

Tanveer Hannan Md Mohaiminul Islam Jindong Gu Thomas Seidl Gedas Bertasius

Abstract

Large language models (LLMs) excel at retrieving information from lengthytext, but their vision-language counterparts (VLMs) face difficulties withhour-long videos, especially for temporal grounding. Specifically, these VLMsare constrained by frame limitations, often losing essential temporal detailsneeded for accurate event localization in extended video content. We proposeReVisionLLM, a recursive vision-language model designed to locate events inhour-long videos. Inspired by human search strategies, our model initiallytargets broad segments of interest, progressively revising its focus topinpoint exact temporal boundaries. Our model can seamlessly handle videos ofvastly different lengths, from minutes to hours. We also introduce ahierarchical training strategy that starts with short clips to capture distinctevents and progressively extends to longer videos. To our knowledge,ReVisionLLM is the first VLM capable of temporal grounding in hour-long videos,outperforming previous state-of-the-art methods across multiple datasets by asignificant margin (+2.6% R1@0.1 on MAD). The code is available athttps://github.com/Tanveer81/ReVisionLLM.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Video Processing

Visual Document Retrieval

Computer Vision

Tanveer Hannan Md Mohaiminul Islam Jindong Gu Thomas Seidl Gedas Bertasius

Abstract

Large language models (LLMs) excel at retrieving information from lengthytext, but their vision-language counterparts (VLMs) face difficulties withhour-long videos, especially for temporal grounding. Specifically, these VLMsare constrained by frame limitations, often losing essential temporal detailsneeded for accurate event localization in extended video content. We proposeReVisionLLM, a recursive vision-language model designed to locate events inhour-long videos. Inspired by human search strategies, our model initiallytargets broad segments of interest, progressively revising its focus topinpoint exact temporal boundaries. Our model can seamlessly handle videos ofvastly different lengths, from minutes to hours. We also introduce ahierarchical training strategy that starts with short clips to capture distinctevents and progressively extends to longer videos. To our knowledge,ReVisionLLM is the first VLM capable of temporal grounding in hour-long videos,outperforming previous state-of-the-art methods across multiple datasets by asignificant margin (+2.6% R1@0.1 on MAD). The code is available athttps://github.com/Tanveer81/ReVisionLLM.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp