7 months ago

Guilherme Penedo Hynek Kydl\u00ed\u010dek Vinko Sabol\u010dec Bettina Messmer Negar Foroutan Amir Hossein Kargaran Colin Raffel Martin Jaggi Leandro Von Werra Thomas Wolf

Abstract

Pre-training state-of-the-art large language models (LLMs) requires vastamounts of clean and diverse text data. While the open development of largehigh-quality English pre-training datasets has seen substantial recentprogress, training performant multilingual LLMs remains a challenge, in largepart due to the inherent difficulty of tailoring filtering and deduplicationpipelines to a large number of languages. In this work, we introduce a newpre-training dataset curation pipeline based on FineWeb that can beautomatically adapted to support any language. We extensively ablate ourpipeline design choices on a set of nine diverse languages, guided by a set ofmeaningful and informative evaluation tasks that were chosen through a novelselection process based on measurable criteria. Ultimately, we show that ourpipeline can be used to create non-English corpora that produce more performantmodels than prior datasets. We additionally introduce a straightforward andprincipled approach to rebalance datasets that takes into consideration bothduplication count and quality, providing an additional performance uplift.Finally, we scale our pipeline to over 1000 languages using almost 100 CommonCrawl snapshots to produce FineWeb2, a new 20 terabyte (5 billion document)multilingual dataset which we release along with our pipeline, training, andevaluation codebases.

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

7 months ago

Guilherme Penedo Hynek Kydl\u00ed\u010dek Vinko Sabol\u010dec Bettina Messmer Negar Foroutan Amir Hossein Kargaran Colin Raffel Martin Jaggi Leandro Von Werra Thomas Wolf

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

7 months ago

Guilherme Penedo Hynek Kydl\u00ed\u010dek Vinko Sabol\u010dec Bettina Messmer Negar Foroutan Amir Hossein Kargaran Colin Raffel Martin Jaggi Leandro Von Werra Thomas Wolf

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language | Papers | HyperAI

Command Palette

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Guilherme Penedo Hynek Kydl\u00ed\u010dek Vinko Sabol\u010dec Bettina Messmer Negar Foroutan Amir Hossein Kargaran Colin Raffel Martin Jaggi Leandro Von Werra Thomas Wolf

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Guilherme Penedo Hynek Kydl\u00ed\u010dek Vinko Sabol\u010dec Bettina Messmer Negar Foroutan Amir Hossein Kargaran Colin Raffel Martin Jaggi Leandro Von Werra Thomas Wolf

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Guilherme Penedo Hynek Kydl\u00ed\u010dek Vinko Sabol\u010dec Bettina Messmer Negar Foroutan Amir Hossein Kargaran Colin Raffel Martin Jaggi Leandro Von Werra Thomas Wolf

Abstract

Build AI with AI

HyperAI Newsletters