Command Palette
Search for a command to run...
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

Abstract
Large multimodal models trained on natural documents, which interleave imagesand text, outperform models trained on image-text pairs on various multimodalbenchmarks. However, the datasets used to train these models have not beenreleased, and the collection process has not been fully specified. We introducethe OBELICS dataset, an open web-scale filtered dataset of interleavedimage-text documents comprising 141 million web pages extracted from CommonCrawl, 353 million associated images, and 115 billion text tokens. We describethe dataset creation process, present comprehensive filtering rules, andprovide an analysis of the dataset's content. To show the viability of OBELICS,we train vision and language models of 9 and 80 billion parameters namedIDEFICS, and obtain competitive performance on different multimodal benchmarks.We release our dataset, models and code.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| mmr-total-on-mrr-benchmark | Idefics-80B | Total Column Score: 139 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.