Command Palette
Search for a command to run...
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Miech Antoine ; Zhukov Dimitri ; Alayrac Jean-Baptiste ; Tapaswi Makarand ; Laptev Ivan ; Sivic Josef

Abstract
Learning text-video embeddings usually requires a dataset of video clips withmanually provided captions. However, such datasets are expensive and timeconsuming to create and therefore difficult to obtain on a large scale. In thiswork, we propose instead to learn such embeddings from video data with readilyavailable natural language annotations in the form of automatically transcribednarrations. The contributions of this work are three-fold. First, we introduceHowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22Mnarrated instructional web videos depicting humans performing and describingover 23k different visual tasks. Our data collection procedure is fast,scalable and does not require any additional manual annotation. Second, wedemonstrate that a text-video embedding trained on this data leads tostate-of-the-art results for text-to-video retrieval and action localization oninstructional video datasets such as YouCook2 or CrossTask. Finally, we showthat this embedding transfers well to other domains: fine-tuning on genericYoutube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms modelstrained on these datasets alone. Our dataset, code and models will be publiclyavailable at: www.di.ens.fr/willow/research/howto100m/.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| long-video-retrieval-background-removed-on | Text-Video Embedding | Cap. Avg. R@1: 46.6 Cap. Avg. R@10: 83.7 Cap. Avg. R@5: 74.3 |
| temporal-action-localization-on-crosstask | Text-Video Embedding | Recall: 33.6 |
| video-retrieval-on-lsmdc | Text-Video Embedding | text-to-video Median Rank: 40 text-to-video R@1: 7.2 text-to-video R@10: 27.9 text-to-video R@5: 19.6 |
| video-retrieval-on-msr-vtt | Text-Video Embedding | text-to-video Median Rank: 9 text-to-video R@1: 14.9 text-to-video R@10: 52.8 video-to-text R@5: 40.2 |
| video-retrieval-on-msr-vtt-1ka | HT-Pretrained | text-to-video Median Rank: 9 text-to-video R@1: 14.9 text-to-video R@10: 52.8 text-to-video R@5: 40.2 |
| video-retrieval-on-msr-vtt-1ka | HT | text-to-video Median Rank: 12 text-to-video R@1: 12.1 text-to-video R@10: 48.0 text-to-video R@5: 35.0 |
| video-retrieval-on-youcook2 | Text-Video Embedding | text-to-video Median Rank: 24 text-to-video R@1: 8.2 text-to-video R@10: 35.3 text-to-video R@5: 24.5 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.