HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

On the importance of pre-training data volume for compact language models

Vincent Micheli Martin d&#39 Hoffschmidt François Fleuret

On the importance of pre-training data volume for compact language models

Abstract

Recent advances in language modeling have led to computationally intensive and resource-demanding state-of-the-art models. In an effort towards sustainable practices, we study the impact of pre-training data volume on compact language models. Multiple BERT-based models are trained on gradually increasing amounts of French text. Through fine-tuning on the French Question Answering Dataset (FQuAD), we observe that well-performing models are obtained with as little as 100 MB of text. In addition, we show that past critically low amounts of pre-training data, an intermediate pre-training step on the task-specific corpus does not yield substantial improvements.

Benchmarks

BenchmarkMethodologyMetrics
question-answering-on-fquad-1LePetit
EM: 57.2
F1: 70.71

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
On the importance of pre-training data volume for compact language models | Papers | HyperAI