Command Palette
Search for a command to run...
Gordeev Aleksandr ; Dokholyan Vladimir ; Tolstykh Irina ; Kuprashevich Maksim

Abstract
Existing approaches for video moment retrieval and highlight detection arenot able to align text and video features efficiently, resulting inunsatisfying performance and limited production usage. To address this, wepropose a novel architecture that utilizes recent foundational video modelsdesigned for such alignment. Combined with the introduced Saliency-Guided CrossAttention mechanism and a hybrid DETR architecture, our approach significantlyenhances performance in both moment retrieval and highlight detection tasks.For even better improvement, we developed InterVid-MR, a large-scale andhigh-quality dataset for pretraining. Using it, our architecture achievesstate-of-the-art results on the QVHighlights, Charades-STA and TACoSbenchmarks. The proposed approach provides an efficient and scalable solutionfor both zero-shot and fine-tuning scenarios in video-language tasks.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| highlight-detection-on-qvhighlights | SG-DETR | Hit@1: 69.13 mAP: 43.76 |
| highlight-detection-on-qvhighlights | SG-DETR (w/ PT) | Hit@1: 71.00 mAP: 44.70 |
| highlight-detection-on-tvsum | SG-DETR | mAP: 87.1 |
| highlight-detection-on-youtube-highlights | SG-DETR | mAP: 76.7 |
| highlight-detection-on-youtube-highlights | SG-DETR (w/ PT) | mAP: 78.0 |
| moment-retrieval-on-charades-sta | SG-DETR (w/ PT) | R@1 IoU=0.5: 71.10 R@1 IoU=0.7: 52.80 |
| moment-retrieval-on-charades-sta | SG-DETR | R@1 IoU=0.5: 70.20 R@1 IoU=0.7: 49.50 |
| moment-retrieval-on-qvhighlights | SG-DETR | R@1 IoU=0.5: 72.20 R@1 IoU=0.7: 56.60 mAP: 54.10 mAP@0.5: 73.20 mAP@0.75: 55.80 |
| moment-retrieval-on-qvhighlights | SG-DETR (w/ PT) | R@1 IoU=0.5: 74.20 R@1 IoU=0.7: 60.40 mAP: 58.80 mAP@0.5: 76.20 mAP@0.75: 60.80 |
| natural-language-moment-retrieval-on-tacos | SG-DETR | R@1,IoU=0.3: 56.71 R@1,IoU=0.5: 44.70 R@1,IoU=0.7: 29.90 mIoU: 40.90 |
| natural-language-moment-retrieval-on-tacos | SG-DETR (w/ PT) | R@1,IoU=0.3: 58.10 R@1,IoU=0.5: 46.40 R@1,IoU=0.7: 33.90 mIoU: 42.40 |
| zero-shot-moment-retrieval-on-qvhighlights | SG-DETR (ZS) | R1@0.5: 63.90 R1@0.7: 49.60 mAP: 48.30 mAP@0.5: 67.50 mAP@0.75: 49.00 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.