5 months ago

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Hannan Tanveer ; Islam Md Mohaiminul ; Seidl Thomas ; Bertasius Gedas

Abstract

Locating specific moments within long videos (20-120 minutes) presents asignificant challenge, akin to finding a needle in a haystack. Adaptingexisting short video (5-30 seconds) grounding methods to this problem yieldspoor performance. Since most real life videos, such as those on YouTube andAR/VR, are lengthy, addressing this issue is crucial. Existing methodstypically operate in two stages: clip retrieval and grounding. However, thisdisjoint process limits the retrieval module's fine-grained eventunderstanding, crucial for specific moment detection. We propose RGNet whichdeeply integrates clip retrieval and grounding into a single network capable ofprocessing long videos into multiple granular levels, e.g., clips and frames.Its core component is a novel transformer encoder, RG-Encoder, that unifies thetwo stages through shared features and mutual optimization. The encoderincorporates a sparse attention mechanism and an attention loss to model bothgranularity jointly. Moreover, we introduce a contrastive clip samplingtechnique to mimic the long video paradigm closely during training. RGNetsurpasses prior methods, showcasing state-of-the-art performance on long videotemporal grounding (LVTG) datasets MAD and Ego4D.

Code Repositories

tanveer81/revisionllm

pytorch

Mentioned in GitHub

tanveer81/rgnet

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
natural-language-moment-retrieval-on-mad	RGNet	R@1,IoU=0.1: 12.43 R@1,IoU=0.3: 9.48 R@1,IoU=0.5: 5.61 R@5,IoU=0.1: 25.12 R@5,IoU=0.3: 18.72 R@5,IoU=0.5: 10.86

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette