HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Guilherme Penedo Hynek Kydl\u00ed\u010dek Vinko Sabol\u010dec Bettina Messmer Negar Foroutan Amir Hossein Kargaran Colin Raffel Martin Jaggi Leandro Von Werra Thomas Wolf

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data
  Processing to Every Language

Abstract

Pre-training state-of-the-art large language models (LLMs) requires vastamounts of clean and diverse text data. While the open development of largehigh-quality English pre-training datasets has seen substantial recentprogress, training performant multilingual LLMs remains a challenge, in largepart due to the inherent difficulty of tailoring filtering and deduplicationpipelines to a large number of languages. In this work, we introduce a newpre-training dataset curation pipeline based on FineWeb that can beautomatically adapted to support any language. We extensively ablate ourpipeline design choices on a set of nine diverse languages, guided by a set ofmeaningful and informative evaluation tasks that were chosen through a novelselection process based on measurable criteria. Ultimately, we show that ourpipeline can be used to create non-English corpora that produce more performantmodels than prior datasets. We additionally introduce a straightforward andprincipled approach to rebalance datasets that takes into consideration bothduplication count and quality, providing an additional performance uplift.Finally, we scale our pipeline to over 1000 languages using almost 100 CommonCrawl snapshots to produce FineWeb2, a new 20 terabyte (5 billion document)multilingual dataset which we release along with our pipeline, training, andevaluation codebases.

Code Repositories

huggingface/fineweb-2
Official
Mentioned in GitHub

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language | Papers | HyperAI