HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text
  Documents

Abstract

Large multimodal models trained on natural documents, which interleave imagesand text, outperform models trained on image-text pairs on various multimodalbenchmarks. However, the datasets used to train these models have not beenreleased, and the collection process has not been fully specified. We introducethe OBELICS dataset, an open web-scale filtered dataset of interleavedimage-text documents comprising 141 million web pages extracted from CommonCrawl, 353 million associated images, and 115 billion text tokens. We describethe dataset creation process, present comprehensive filtering rules, andprovide an analysis of the dataset's content. To show the viability of OBELICS,we train vision and language models of 9 and 80 billion parameters namedIDEFICS, and obtain competitive performance on different multimodal benchmarks.We release our dataset, models and code.

Code Repositories

Benchmarks

BenchmarkMethodologyMetrics
mmr-total-on-mrr-benchmarkIdefics-80B
Total Column Score: 139

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents | Papers | HyperAI