HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

End-to-End Dense Video Captioning with Parallel Decoding

Teng Wang Ruimao Zhang Zhichao Lu Feng Zheng Ran Cheng Ping Luo

End-to-End Dense Video Captioning with Parallel Decoding

Abstract

Dense video captioning aims to generate multiple associated captions with their temporal locations from the video. Previous methods follow a sophisticated "localize-then-describe" scheme, which heavily relies on numerous hand-crafted components. In this paper, we proposed a simple yet effective framework for end-to-end dense video captioning with parallel decoding (PDVC), by formulating the dense caption generation as a set prediction task. In practice, through stacking a newly proposed event counter on the top of a transformer decoder, the PDVC precisely segments the video into a number of event pieces under the holistic understanding of the video content, which effectively increases the coherence and readability of predicted captions. Compared with prior arts, the PDVC has several appealing advantages: (1) Without relying on heuristic non-maximum suppression or a recurrent event sequence selection network to remove redundancy, PDVC directly produces an event set with an appropriate size; (2) In contrast to adopting the two-stage scheme, we feed the enhanced representations of event queries into the localization head and caption head in parallel, making these two sub-tasks deeply interrelated and mutually promoted through the optimization; (3) Without bells and whistles, extensive experiments on ActivityNet Captions and YouCook2 show that PDVC is capable of producing high-quality captioning results, surpassing the state-of-the-art two-stage methods when its localization accuracy is on par with them. Code is available at https://github.com/ttengwang/PDVC.

Code Repositories

ttengwang/pdvc
Official
pytorch
Mentioned in GitHub
aim3-ruc/youmakeup_challenge2022
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
dense-video-captioning-on-activitynetPDVC (TSP features, no SCST)
BLEU-4: 2.17
CIDEr: 31.14
METEOR: 9.03
SODA: 6.05
dense-video-captioning-on-youcook2PDVC (TSN features, no SCST)
BLEU4: 0.8
CIDEr: 22.71
METEOR: 4.74
SODA: 4.42

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
End-to-End Dense Video Captioning with Parallel Decoding | Papers | HyperAI