HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

Krishna Srinivasan Karthik Raman Jiecao Chen Michael Bendersky Marc Najork

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

Abstract

The milestone improvements brought about by deep representation learning and pre-training techniques have led to large performance gains across downstream NLP, IR and Vision tasks. Multimodal modeling techniques aim to leverage large high-quality visio-linguistic datasets for learning complementary information (across image and text modalities). In this paper, we introduce the Wikipedia-based Image Text (WIT) Dataset (https://github.com/google-research-datasets/wit) to better facilitate multimodal, multilingual learning. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal models, as we show when applied to downstream tasks such as image-text retrieval. WIT has four main and unique advantages. First, WIT is the largest multimodal dataset by the number of image-text examples by 3x (at the time of writing). Second, WIT is massively multilingual (first of its kind) with coverage over 100+ languages (each of which has at least 12K examples) and provides cross-lingual texts for many images. Third, WIT represents a more diverse set of concepts and real world entities relative to what previous datasets cover. Lastly, WIT provides a very challenging real-world test set, as we empirically illustrate using an image-text retrieval task as an example.

Code Repositories

clip-italian/clip-italian
jax
Mentioned in GitHub
google-research-datasets/wit
Official
Mentioned in GitHub
paullerner/viquae
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
image-retrieval-on-witWIT-ALL
R@1: 0.346
R@5: 0.642
image-retrieval-on-witCC (Conceptual Captions)
R@1: 0.048
R@5: 0.122

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning | Papers | HyperAI