5 months ago

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

Laurençon Hugo ; Saulnier Lucile ; Tronchon Léo ; Bekman Stas ; Singh Amanpreet ; Lozhkov Anton ; Wang Thomas ; Karamcheti Siddharth ; Rush Alexander M. ; Kiela

Abstract

Large multimodal models trained on natural documents, which interleave imagesand text, outperform models trained on image-text pairs on various multimodalbenchmarks. However, the datasets used to train these models have not beenreleased, and the collection process has not been fully specified. We introducethe OBELICS dataset, an open web-scale filtered dataset of interleavedimage-text documents comprising 141 million web pages extracted from CommonCrawl, 353 million associated images, and 115 billion text tokens. We describethe dataset creation process, present comprehensive filtering rules, andprovide an analysis of the dataset's content. To show the viability of OBELICS,we train vision and language models of 9 and 80 billion parameters namedIDEFICS, and obtain competitive performance on different multimodal benchmarks.We release our dataset, models and code.

Code Repositories

MindSpore-scientific-2/code-14/tree/main/idefics

mindspore

huggingface/obelics

Official

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
mmr-total-on-mrr-benchmark	Idefics-80B	Total Column Score: 139

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette