HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

Wang Teng ; Zhang Jinrui ; Zheng Feng ; Jiang Wenhao ; Cheng Ran ; Luo Ping

Learning Grounded Vision-Language Representation for Versatile
  Understanding in Untrimmed Videos

Abstract

Joint video-language learning has received increasing attention in recentyears. However, existing works mainly focus on single or multiple trimmed videoclips (events), which makes human-annotated event boundaries necessary duringinference. To break away from the ties, we propose a grounded vision-languagelearning framework for untrimmed videos, which automatically detectsinformative events and effectively excavates the alignments betweenmulti-sentence descriptions and corresponding event segments. Instead ofcoarse-level video-language alignments, we present two dual pretext tasks toencourage fine-grained segment-level alignments, i.e., text-to-event grounding(TEG) and event-to-text generation (ETG). TEG learns to adaptively ground thepossible event proposals given a set of sentences by estimating the cross-modaldistance in a joint semantic space. Meanwhile, ETG aims to reconstruct(generate) the matched texts given event proposals, encouraging the eventrepresentation to retain meaningful semantic information. To encourage accuratelabel assignment between the event set and the text set, we propose a novelsemantic-aware cost to mitigate the sub-optimal matching results caused byambiguous boundary annotations. Our framework is easily extensible to taskscovering visually-grounded language understanding and generation. We achievestate-of-the-art dense video captioning performance on ActivityNet Captions,YouCook2 and YouMakeup, and competitive performance on several other languagegeneration and understanding tasks. Our method also achieved 1st place in boththe MTVG and MDVC tasks of the PIC 4th Challenge. Our code is publiclyavailable at https://github.com/zjr2000/GVL.

Code Repositories

zjr2000/gvl
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
dense-video-captioning-on-activitynetGVL
CIDEr: 33.33
METEOR: 10.03
SODA: 7.11
dense-video-captioning-on-youcook2GVL
CIDEr: 26.52
METEOR: 5.01
SODA: 4.91
natural-language-moment-retrieval-onGVL
R@1,IoU=0.5: 49.18
R@1,IoU=0.7: 29.69
natural-language-moment-retrieval-onGVL (paragraph-level)
R@1,IoU=0.5: 60.67
R@1,IoU=0.7: 38.55
natural-language-moment-retrieval-on-tacosGVL (paragraph-level)
R@1,IoU=0.3: 48.29
R@1,IoU=0.5: 36.07
natural-language-moment-retrieval-on-tacosGVL
R@1,IoU=0.3: 45.92
R@1,IoU=0.5: 34.57

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos | Papers | HyperAI