5 months ago

Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

Wang Teng ; Zhang Jinrui ; Zheng Feng ; Jiang Wenhao ; Cheng Ran ; Luo Ping

Abstract

Joint video-language learning has received increasing attention in recentyears. However, existing works mainly focus on single or multiple trimmed videoclips (events), which makes human-annotated event boundaries necessary duringinference. To break away from the ties, we propose a grounded vision-languagelearning framework for untrimmed videos, which automatically detectsinformative events and effectively excavates the alignments betweenmulti-sentence descriptions and corresponding event segments. Instead ofcoarse-level video-language alignments, we present two dual pretext tasks toencourage fine-grained segment-level alignments, i.e., text-to-event grounding(TEG) and event-to-text generation (ETG). TEG learns to adaptively ground thepossible event proposals given a set of sentences by estimating the cross-modaldistance in a joint semantic space. Meanwhile, ETG aims to reconstruct(generate) the matched texts given event proposals, encouraging the eventrepresentation to retain meaningful semantic information. To encourage accuratelabel assignment between the event set and the text set, we propose a novelsemantic-aware cost to mitigate the sub-optimal matching results caused byambiguous boundary annotations. Our framework is easily extensible to taskscovering visually-grounded language understanding and generation. We achievestate-of-the-art dense video captioning performance on ActivityNet Captions,YouCook2 and YouMakeup, and competitive performance on several other languagegeneration and understanding tasks. Our method also achieved 1st place in boththe MTVG and MDVC tasks of the PIC 4th Challenge. Our code is publiclyavailable at https://github.com/zjr2000/GVL.

Code Repositories

zjr2000/gvl

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
dense-video-captioning-on-activitynet	GVL	CIDEr: 33.33 METEOR: 10.03 SODA: 7.11
dense-video-captioning-on-youcook2	GVL	CIDEr: 26.52 METEOR: 5.01 SODA: 4.91
natural-language-moment-retrieval-on	GVL	R@1,IoU=0.5: 49.18 R@1,IoU=0.7: 29.69
natural-language-moment-retrieval-on	GVL (paragraph-level)	R@1,IoU=0.5: 60.67 R@1,IoU=0.7: 38.55
natural-language-moment-retrieval-on-tacos	GVL (paragraph-level)	R@1,IoU=0.3: 48.29 R@1,IoU=0.5: 36.07
natural-language-moment-retrieval-on-tacos	GVL	R@1,IoU=0.3: 45.92 R@1,IoU=0.5: 34.57

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette