3 months ago

BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

Pratyush Maini Vineeth Dorna Parth Doshi Aldo Carranza Fan Pan Jack Urbanek Paul Burstein Alex Fang Alvin Deng Amro Abbas

Abstract

Recent advances in large language model (LLM) pretraining have shown thatsimply scaling data quantity eventually leads to diminishing returns, hitting adata wall. In response, the use of synthetic data for pretraining has emergedas a promising paradigm for pushing the frontier of performance. Despite this,the factors affecting synthetic data quality remain poorly understood. In thiswork, we introduce BeyondWeb, a synthetic data generation framework thatproduces high-quality synthetic data for pretraining. BeyondWeb significantlyextends the capabilities of traditional web-scale datasets, outperformingstate-of-the-art synthetic pretraining datasets such as Cosmopedia andNemotron-CC's high-quality synthetic subset (Nemotron-Synth) by up to 5.1percentage points (pp) and 2.6pp, respectively, when averaged across a suite of14 benchmark evaluations. It delivers up to 7.7x faster training than open webdata and 2.7x faster than Nemotron-Synth. Remarkably, a 3B model trained for180B tokens on BeyondWeb outperforms an 8B model trained for the same tokenbudget on Cosmopedia. We also present several insights from BeyondWeb onsynthetic data for pretraining: what drives its benefits, which data torephrase and how, and the impact of model size and family on data quality.Overall, our work shows that there's no silver bullet for generatinghigh-quality synthetic pretraining data. The best outcomes require jointlyoptimizing many factors, a challenging task that requires rigorous science andpractical expertise. Naive approaches can yield modest improvements,potentially at great cost, while well-executed methods can yield transformativeimprovements, as exemplified by BeyondWeb.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

Pratyush Maini Vineeth Dorna Parth Doshi Aldo Carranza Fan Pan Jack Urbanek Paul Burstein Alex Fang Alvin Deng Amro Abbas19 more

Abstract

Build AI with AI

Hyper Newsletters

Pratyush Maini Vineeth Dorna Parth Doshi Aldo Carranza Fan Pan Jack Urbanek Paul Burstein Alex Fang Alvin Deng Amro Abbas