HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos

Hannan Tanveer ; Islam Md Mohaiminul ; Gu Jindong ; Seidl Thomas ; Bertasius Gedas

ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in
  Hour-Long Videos

Abstract

Large language models (LLMs) excel at retrieving information from lengthytext, but their vision-language counterparts (VLMs) face difficulties withhour-long videos, especially for temporal grounding. Specifically, these VLMsare constrained by frame limitations, often losing essential temporal detailsneeded for accurate event localization in extended video content. We proposeReVisionLLM, a recursive vision-language model designed to locate events inhour-long videos. Inspired by human search strategies, our model initiallytargets broad segments of interest, progressively revising its focus topinpoint exact temporal boundaries. Our model can seamlessly handle videos ofvastly different lengths, from minutes to hours. We also introduce ahierarchical training strategy that starts with short clips to capture distinctevents and progressively extends to longer videos. To our knowledge,ReVisionLLM is the first VLM capable of temporal grounding in hour-long videos,outperforming previous state-of-the-art methods across multiple datasets by asignificant margin (+2.6% R1@0.1 on MAD). The code is available athttps://github.com/Tanveer81/ReVisionLLM.

Code Repositories

tanveer81/revisionllm
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
language-based-temporal-localization-onReVisionLLM
R1@.9: 15.2
natural-language-moment-retrieval-on-madReVisionLLM
R@1,IoU=0.1: 17.3
R@1,IoU=0.3: 12.7
R@1,IoU=0.5: 6.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos | Papers | HyperAI