Command Palette
Search for a command to run...
Hannan Tanveer ; Islam Md Mohaiminul ; Seidl Thomas ; Bertasius Gedas

Abstract
Locating specific moments within long videos (20-120 minutes) presents asignificant challenge, akin to finding a needle in a haystack. Adaptingexisting short video (5-30 seconds) grounding methods to this problem yieldspoor performance. Since most real life videos, such as those on YouTube andAR/VR, are lengthy, addressing this issue is crucial. Existing methodstypically operate in two stages: clip retrieval and grounding. However, thisdisjoint process limits the retrieval module's fine-grained eventunderstanding, crucial for specific moment detection. We propose RGNet whichdeeply integrates clip retrieval and grounding into a single network capable ofprocessing long videos into multiple granular levels, e.g., clips and frames.Its core component is a novel transformer encoder, RG-Encoder, that unifies thetwo stages through shared features and mutual optimization. The encoderincorporates a sparse attention mechanism and an attention loss to model bothgranularity jointly. Moreover, we introduce a contrastive clip samplingtechnique to mimic the long video paradigm closely during training. RGNetsurpasses prior methods, showcasing state-of-the-art performance on long videotemporal grounding (LVTG) datasets MAD and Ego4D.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| natural-language-moment-retrieval-on-mad | RGNet | R@1,IoU=0.1: 12.43 R@1,IoU=0.3: 9.48 R@1,IoU=0.5: 5.61 R@5,IoU=0.1: 25.12 R@5,IoU=0.3: 18.72 R@5,IoU=0.5: 10.86 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.