Command Palette
Search for a command to run...
On the importance of pre-training data volume for compact language models
Vincent Micheli Martin d' Hoffschmidt François Fleuret

Abstract
Recent advances in language modeling have led to computationally intensive and resource-demanding state-of-the-art models. In an effort towards sustainable practices, we study the impact of pre-training data volume on compact language models. Multiple BERT-based models are trained on gradually increasing amounts of French text. Through fine-tuning on the French Question Answering Dataset (FQuAD), we observe that well-performing models are obtained with as little as 100 MB of text. In addition, we show that past critically low amounts of pre-training data, an intermediate pre-training step on the task-specific corpus does not yield substantial improvements.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| question-answering-on-fquad-1 | LePetit | EM: 57.2 F1: 70.71 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.