Command Palette
Search for a command to run...
Gabriel Huang Bo Pang Zhenhai Zhu Clara Rivera Radu Soricut

Abstract
Learning specific hands-on skills such as cooking, car maintenance, and home repairs increasingly happens via instructional videos. The user experience with such videos is known to be improved by meta-information such as time-stamped annotations for the main steps involved. Generating such annotations automatically is challenging, and we describe here two relevant contributions. First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations. Second, we explore several multimodal sequence-to-sequence pretraining strategies that leverage large unsupervised datasets of videos and caption-like texts. We pretrain and subsequently finetune dense video captioning models using both YouCook2 and ViTT. We show that such models generalize well and are robust over a wide variety of instructional videos.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| dense-video-captioning-on-youcook2 | E2vidD6-MASSalign-BiD | ROUGE-L: 39.03 |
| video-captioning-on-youcook2 | E2vidD6-MASSvid-BiD | BLEU-4: 12.04 CIDEr: 1.22 METEOR: 18.32 ROUGE-L: 39.03 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.