RecombiText: Compositional Data Augmentation for Enhancing LLM Pre-Training Datasets in Low-Resource Scenarios
We introduce RecombiText Augmentation (RTA), a novel purely statistical NLP method for compositional data augmentation for data-efficient LLM pre-training in low-resource scenarios. RTA identifies lexically and semantically similar sentences within the corpus and generates synthetic sentence pairs from them while preserving underlying patterns from the corpus. We pre-train GPT-2 and RoBERTa language models on a domain-specific, low-resource corpus of 10 million words, with different proportions of augmented data. We compare our RTA-augmented model variants to a baseline model trained on the full original dataset. Zero-shot results show that the language models pre-trained on synthetic data improve in entity tracking, self-paced reading, and morphological generalization benchmarks. In other tasks, the performance is comparable to the baseline model. We demonstrate that it is possible to expand low-resource datasets by two- to four-fold without compromising benchmark performance, solely through statistical processing of the available data.
Top
- Tampier, Alexander
- Thoma, Lukas
- Schoenegger, Loris
- Roth, Benjamin
Top
Category |
Paper in Conference Proceedings or in Workshop Proceedings (Paper) |
Event Title |
Proceedings of the First BabyLM Workshop |
Divisions |
Data Mining and Machine Learning |
Event Location |
Suzhou, China |
Event Type |
Workshop |
Event Dates |
08.11.2025 |
Publisher |
Association for Computational Linguistics |
Page Range |
pp. 548-565 |
Date |
November 2025 |
Official URL |
https://aclanthology.org/2025.babylm-main.40/ |
Export |
Top
