Command Palette
Search for a command to run...
Query-Dependent Video Representation for Moment Retrieval and Highlight Detection
WonJun Moon; Sangeek Hyun; SangUk Park; Dongchan Park; Jae-Pil Heo

Abstract
Recently, video moment retrieval and highlight detection (MR/HD) are being spotlighted as the demand for video understanding is drastically increased. The key objective of MR/HD is to localize the moment and estimate clip-wise accordance level, i.e., saliency score, to the given text query. Although the recent transformer-based models brought some advances, we found that these methods do not fully exploit the information of a given query. For example, the relevance between text query and video contents is sometimes neglected when predicting the moment and its saliency. To tackle this issue, we introduce Query-Dependent DETR (QD-DETR), a detection transformer tailored for MR/HD. As we observe the insignificant role of a given query in transformer architectures, our encoding module starts with cross-attention layers to explicitly inject the context of text query into video representation. Then, to enhance the model's capability of exploiting the query information, we manipulate the video-query pairs to produce irrelevant pairs. Such negative (irrelevant) video-query pairs are trained to yield low saliency scores, which in turn, encourages the model to estimate precise accordance between query-video pairs. Lastly, we present an input-adaptive saliency predictor which adaptively defines the criterion of saliency scores for the given video-query pairs. Our extensive studies verify the importance of building the query-dependent representation for MR/HD. Specifically, QD-DETR outperforms state-of-the-art methods on QVHighlights, TVSum, and Charades-STA datasets. Codes are available at github.com/wjun0830/QD-DETR.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| highlight-detection-on-qvhighlights | QD-DETR (only Video w/ PT) | Hit@1: 61.91 |
| highlight-detection-on-qvhighlights | QD-DETR | Hit@1: 62.87 mAP: 39.04 |
| highlight-detection-on-qvhighlights | QD-DETR (w/ PT) | Hit@1: 62.27 mAP: 38.52 |
| highlight-detection-on-qvhighlights | QD-DETR (only Video) | Hit@1: 62.40 mAP: 38.94 |
| highlight-detection-on-tvsum | QD-DETR | mAP: 86.6 |
| highlight-detection-on-tvsum | QD-DETR (only Video) | mAP: 85.0 |
| moment-retrieval-on-charades-sta | QD-DETR (Only Video) | R@1 IoU=0.5: 57.31 R@1 IoU=0.7: 32.55 |
| moment-retrieval-on-qvhighlights | QD-DETR (only Video) | R@1 IoU=0.5: 62.40 R@1 IoU=0.7: 44.98 mAP: 39.86 mAP@0.5: 62.52 mAP@0.75: 39.88 |
| moment-retrieval-on-qvhighlights | QD-DETR (w/ audio) | R@1 IoU=0.5: 63.06 R@1 IoU=0.7: 45.10 mAP: 40.19 mAP@0.5: 63.04 mAP@0.75: 40.10 |
| moment-retrieval-on-qvhighlights | QD-DETR (w/ PT) | R@1 IoU=0.5: 64.1 R@1 IoU=0.7: 46.1 mAP: 40.62 mAP@0.5: 64.3 mAP@0.75: 40.5 |
| moment-retrieval-on-qvhighlights | QD-DETR (only Video w/ PT ASR Captions) | R@1 IoU=0.5: 63.2 R@1 IoU=0.7: 45.2 mAP: 40.0 mAP@0.5: 63.4 mAP@0.75: 40.4 |
| video-grounding-on-qvhighlights | QD-DETR | R@1,IoU=0.5: 62.40 R@1,IoU=0.7: 44.98 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.