LAPT

Language-Adaptive Pretraining Framework

What is LAPT?

LAPT is a ready-to-use toolkit for adapting multilingual language models to new languages. It handles the engineering so you can focus on your language.

Adapting a pretrained model to a new language involves several moving parts: loading and mixing datasets, optionally training a new tokenizer, initializing embeddings for new vocabulary, and tracking experiments across different configurations. LAPT provides a modular framework for all of this, built on HuggingFace Transformers and Datasets.

Key Features

  • Flexible data mixing — Load from OSCAR, HuggingFace Hub, or local files. Combine sources with temperature-scaled multinomial sampling for controlled language ratios.
  • Vocabulary replacement — Optional integration with FOCUS for training target-language tokenizers and initializing embeddings via FastText, rather than starting from random weights.
  • Composable configuration — Hydra-based YAML configs make it easy to define dataset mixtures and sweep hyperparameters.
  • Intelligent caching — Selective cache invalidation speeds up iteration when you're only changing some parameters.

Who is this for?

LAPT is designed for researchers and practitioners working on low-resource language adaptation who want a solid starting point rather than building training infrastructure from scratch. It's particularly useful if you're:

  • Adapting a multilingual model (like XGLM or mGPT) to a new target language
  • Experimenting with vocabulary specialization for better tokenization
  • Combining data from multiple related languages with controlled sampling

Links

GitHub Repository Related Paper (EMNLP 2024)

Citation

If you use LAPT, please cite the repository and the related work on targeted multilingual adaptation:

@inproceedings{downey-etal-2024-targeted,
  title = "Targeted Multilingual Adaptation for Low-resource Language Families",
  author = "Downey, C. M. and Blevins, Terra and Serai, Dhwani
            and Parikh, Dwija and Steinert-Threlkeld, Shane",
  booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
  year = "2024",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2024.findings-emnlp.918",
}