HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale
  Pretraining

Abstract

Recent advances in large language model (LLM) pretraining have shown thatsimply scaling data quantity eventually leads to diminishing returns, hitting adata wall. In response, the use of synthetic data for pretraining has emergedas a promising paradigm for pushing the frontier of performance. Despite this,the factors affecting synthetic data quality remain poorly understood. In thiswork, we introduce BeyondWeb, a synthetic data generation framework thatproduces high-quality synthetic data for pretraining. BeyondWeb significantlyextends the capabilities of traditional web-scale datasets, outperformingstate-of-the-art synthetic pretraining datasets such as Cosmopedia andNemotron-CC's high-quality synthetic subset (Nemotron-Synth) by up to 5.1percentage points (pp) and 2.6pp, respectively, when averaged across a suite of14 benchmark evaluations. It delivers up to 7.7x faster training than open webdata and 2.7x faster than Nemotron-Synth. Remarkably, a 3B model trained for180B tokens on BeyondWeb outperforms an 8B model trained for the same tokenbudget on Cosmopedia. We also present several insights from BeyondWeb onsynthetic data for pretraining: what drives its benefits, which data torephrase and how, and the impact of model size and family on data quality.Overall, our work shows that there's no silver bullet for generatinghigh-quality synthetic pretraining data. The best outcomes require jointlyoptimizing many factors, a challenging task that requires rigorous science andpractical expertise. Naive approaches can yield modest improvements,potentially at great cost, while well-executed methods can yield transformativeimprovements, as exemplified by BeyondWeb.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining | Papers | HyperAI