Influence-driven Curriculum Learning for Pre-training on Limited Data

Content

Abstract
Authors
Projects
Shortfacts

Abstract

Curriculum learning, a training technique where data is presented to the model in order of example difficulty (e.g., from simpler to more complex documents), has shown limited success for pre-training language models. In this work, we investigate whether curriculum learning becomes competitive if we replace conventional human-centered difficulty metrics with one that more closely corresponds to example difficulty as observed during model training. Specifically, we experiment with sorting training examples by their training data influence, a score which estimates the effect of individual training examples on the model's output. Models trained on our curricula are able to outperform ones trained in random order by over 10 percentage points in benchmarks, confirming that curriculum learning is beneficial for language model pre-training, as long as a more model-centric notion of difficulty is adopted.

Top

Authors

Schoenegger, Loris
Thoma, Lukas
Blevins, Terra
Roth, Benjamin

Top

Projects

Top

Shortfacts

Category	Paper in Conference Proceedings or in Workshop Proceedings (Paper)
Event Title	The First BabyLM Workshop at the Conference on Emperical Methods in Natural Language Processing 2025 (EMNLP)
Divisions	Data Mining and Machine Learning
Event Location	Suzhou, China
Event Type	Workshop
Event Dates	08.11.2025
Publisher	Association for Computational Linguistics
Page Range	pp. 356-379
Date	8 November 2025
Official URL	https://aclanthology.org/2025.babylm-main.26/
Export

Top