Language-Adaptive Pretraining Framework
LAPT is a ready-to-use toolkit for adapting multilingual language models to new languages. It handles the engineering so you can focus on your language.
Adapting a pretrained model to a new language involves several moving parts: loading and mixing datasets, optionally training a new tokenizer, initializing embeddings for new vocabulary, and tracking experiments across different configurations. LAPT provides a modular framework for all of this, built on HuggingFace Transformers and Datasets.
LAPT is designed for researchers and practitioners working on low-resource language adaptation who want a solid starting point rather than building training infrastructure from scratch. It's particularly useful if you're:
GitHub Repository Related Paper (EMNLP 2024)
If you use LAPT, please cite the repository and the related work on targeted multilingual adaptation:
@inproceedings{downey-etal-2024-targeted,
title = "Targeted Multilingual Adaptation for Low-resource Language Families",
author = "Downey, C. M. and Blevins, Terra and Serai, Dhwani
and Parikh, Dwija and Steinert-Threlkeld, Shane",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
year = "2024",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-emnlp.918",
}