HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Miech Antoine ; Zhukov Dimitri ; Alayrac Jean-Baptiste ; Tapaswi Makarand ; Laptev Ivan ; Sivic Josef

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million
  Narrated Video Clips

Abstract

Learning text-video embeddings usually requires a dataset of video clips withmanually provided captions. However, such datasets are expensive and timeconsuming to create and therefore difficult to obtain on a large scale. In thiswork, we propose instead to learn such embeddings from video data with readilyavailable natural language annotations in the form of automatically transcribednarrations. The contributions of this work are three-fold. First, we introduceHowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22Mnarrated instructional web videos depicting humans performing and describingover 23k different visual tasks. Our data collection procedure is fast,scalable and does not require any additional manual annotation. Second, wedemonstrate that a text-video embedding trained on this data leads tostate-of-the-art results for text-to-video retrieval and action localization oninstructional video datasets such as YouCook2 or CrossTask. Finally, we showthat this embedding transfers well to other domains: fine-tuning on genericYoutube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms modelstrained on these datasets alone. Our dataset, code and models will be publiclyavailable at: www.di.ens.fr/willow/research/howto100m/.

Code Repositories

antoine77340/S3D_HowTo100M
pytorch
Mentioned in GitHub
antoine77340/milnce_howto100m
pytorch
Mentioned in GitHub
antoine77340/MIL-NCE_HowTo100M
pytorch
Mentioned in GitHub
roudimit/AVLnet
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
long-video-retrieval-background-removed-onText-Video Embedding
Cap. Avg. R@1: 46.6
Cap. Avg. R@10: 83.7
Cap. Avg. R@5: 74.3
temporal-action-localization-on-crosstaskText-Video Embedding
Recall: 33.6
video-retrieval-on-lsmdcText-Video Embedding
text-to-video Median Rank: 40
text-to-video R@1: 7.2
text-to-video R@10: 27.9
text-to-video R@5: 19.6
video-retrieval-on-msr-vttText-Video Embedding
text-to-video Median Rank: 9
text-to-video R@1: 14.9
text-to-video R@10: 52.8
video-to-text R@5: 40.2
video-retrieval-on-msr-vtt-1kaHT-Pretrained
text-to-video Median Rank: 9
text-to-video R@1: 14.9
text-to-video R@10: 52.8
text-to-video R@5: 40.2
video-retrieval-on-msr-vtt-1kaHT
text-to-video Median Rank: 12
text-to-video R@1: 12.1
text-to-video R@10: 48.0
text-to-video R@5: 35.0
video-retrieval-on-youcook2Text-Video Embedding
text-to-video Median Rank: 24
text-to-video R@1: 8.2
text-to-video R@10: 35.3
text-to-video R@5: 24.5

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp