University of Rochester Under-Resourced NLP Lab

About

We develop methods to improve NLP tools for low-resource languages—those lacking the abundant data needed to train modern machine learning models.

Most ML approaches require vast amounts of text (hundreds of gigabytes), available only for high-resource languages like English and Chinese. This leaves the majority of the world's languages behind, undermining the vital role these systems can play in tools like keyboard autocorrect, speech recognition, and machine translation—tools that help languages thrive in the digital era.

To address this gap, we focus on machine learning techniques that work with limited data:

  • Unsupervised/self-supervised learning — training with raw text or much smaller amounts of specialized data
  • Multilingual modeling — pooling language data by training on multiple languages at once
  • Transfer learning — leveraging models from higher-resource languages for new, low-resource ones

News

Feb 2026
Invited Talk at University at Buffalo

Prof. Downey gave an invited talk at the University at Buffalo Department of Linguistics on computational tools for under-resourced and endangered languages.

Feb 2026
Paper Selected for AfricaNLP 2026

Work by Fei-Yueh Chen (MS Linguistics), Lateef Adeleke (PhD Linguistics), and C.M. Downey on linguistically informed evaluation of multilingual ASR for African languages has been selected to appear at the AfricaNLP workshop.

Read more
Dec 2025
Empire AI Compute Allocation Awarded

Our project on rapid adaptation of ASR models in data-scarce scenarios received 5,000 service units on Empire AI Beta.

Aug 2025
New Lab Member

Welcome to Ifeoma Okoh, a new PhD student working on low-resource NLP!

People

C.M. Downey

Assistant Professor

Linguistics and Data Science

Personal Website
C.M. Downey
Ifeoma Okoh

PhD Student

Low-resource NLP, ASR

Website
Robert J. Chen

MS Student

Multilingual speech technology, targeted transfer learning

Projects

LAPT: Language-Adaptive Pretraining

A modular toolkit for adapting multilingual language models to new languages. Handles dataset mixing, vocabulary replacement, and embedding initialization so you can focus on your language, not the infrastructure.

language adaptation low-resource open-source Learn more

Low-Resource ASR

Exploring methods for rapidly adapting speech recognition models to endangered and under-documented languages, drawing on recordings from the Endangered Languages Archive.

speech recognition endangered languages Learn more

Publications

For a complete list of publications, see lab members' individual pages.

Selected Recent Publications

  • Linguistically Informed Evaluation of Multilingual ASR for African Languages
    Fei-Yueh Chen, Lateef Adeleke, C.M. Downey
    7th Workshop on African Natural Language Processing (AfricaNLP), 2026
    Paper
  • Targeted Multilingual Adaptation for Low-resource Language Families
    C.M. Downey, Terra Blevins, Dhwani Serai, Dwija Parikh, Shane Steinert-Threlkeld
    Findings of EMNLP, 2024
    Paper

Join Us

We're always looking for students motivated by low-resource language technology!

Prospective Students

If you're interested in joining the lab as a PhD student, I'm currently accepting students to the PhD in Linguistics. Please mention Professor Downey and/or UR2NLP in your application.

Current UR Students

Undergraduate and MS students interested in research opportunities are encouraged to reach to c.m.downey@rochester.edu