Command Palette
Search for a command to run...
LD-DETR: Loop Decoder DEtection TRansformer for Video Moment Retrieval and Highlight Detection
Zhao Pengcheng ; He Zhixian ; Zhang Fuwei ; Lin Shujin ; Zhou Fan

Abstract
Video Moment Retrieval and Highlight Detection aim to find correspondingcontent in the video based on a text query. Existing models usually first usecontrastive learning methods to align video and text features, then fuse andextract multimodal information, and finally use a Transformer Decoder to decodemultimodal information. However, existing methods face several issues: (1)Overlapping semantic information between different samples in the datasethinders the model's multimodal aligning performance; (2) Existing models arenot able to efficiently extract local features of the video; (3) TheTransformer Decoder used by the existing model cannot adequately decodemultimodal features. To address the above issues, we proposed the LD-DETR modelfor Video Moment Retrieval and Highlight Detection tasks. Specifically, wefirst distilled the similarity matrix into the identity matrix to mitigate theimpact of overlapping semantic information. Then, we designed a method thatenables convolutional layers to extract multimodal local features moreefficiently. Finally, we fed the output of the Transformer Decoder back intoitself to adequately decode multimodal information. We evaluated LD-DETR onfour public benchmarks and conducted extensive experiments to demonstrate thesuperiority and effectiveness of our approach. Our model outperforms theState-Of-The-Art models on QVHighlight, Charades-STA and TACoS datasets. Ourcode is available at https://github.com/qingchen239/ld-detr.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| moment-retrieval-on-charades-sta | LD-DETR | R@1 IoU=0.3: 73.92 R@1 IoU=0.5: 62.58 R@1 IoU=0.7: 41.56 mIoU: 53.44 |
| moment-retrieval-on-qvhighlights | LD-DETR | R@1 IoU=0.5: 66.80 R@1 IoU=0.7: 51.04 mAP: 46.41 mAP@0.5: 67.61 mAP@0.75: 46.99 |
| natural-language-moment-retrieval-on-tacos | LD-DETR | R@1,IoU=0.3: 57.61 R@1,IoU=0.5: 44.31 R@1,IoU=0.7: 26.24 mIoU: 40.30 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.