8 months ago

Abstract

Learning text-video embeddings usually requires a dataset of video clips withmanually provided captions. However, such datasets are expensive and timeconsuming to create and therefore difficult to obtain on a large scale. In thiswork, we propose instead to learn such embeddings from video data with readilyavailable natural language annotations in the form of automatically transcribednarrations. The contributions of this work are three-fold. First, we introduceHowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22Mnarrated instructional web videos depicting humans performing and describingover 23k different visual tasks. Our data collection procedure is fast,scalable and does not require any additional manual annotation. Second, wedemonstrate that a text-video embedding trained on this data leads tostate-of-the-art results for text-to-video retrieval and action localization oninstructional video datasets such as YouCook2 or CrossTask. Finally, we showthat this embedding transfers well to other domains: fine-tuning on genericYoutube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms modelstrained on these datasets alone. Our dataset, code and models will be publiclyavailable at: www.di.ens.fr/willow/research/howto100m/.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

8 months ago

Multimodal

Multimodal Representation

Visual Document Retrieval

Multimodality

Task/Problem

Antoine Miech Dimitri Zhukov Jean-Baptiste Alayrac Makarand Tapaswi Ivan Laptev Josef Sivic

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

8 months ago

Multimodal

Multimodal Representation

Visual Document Retrieval

Multimodality

Task/Problem

Antoine Miech Dimitri Zhukov Jean-Baptiste Alayrac Makarand Tapaswi Ivan Laptev Josef Sivic

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Antoine Miech Dimitri Zhukov Jean-Baptiste Alayrac Makarand Tapaswi Ivan Laptev Josef Sivic

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Antoine Miech Dimitri Zhukov Jean-Baptiste Alayrac Makarand Tapaswi Ivan Laptev Josef Sivic

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Antoine Miech Dimitri Zhukov Jean-Baptiste Alayrac Makarand Tapaswi Ivan Laptev Josef Sivic

Abstract

Build AI with AI

HyperAI Newsletters