Researchers release largest-known African language dataset to narrow AI language gap
New dataset seeks to make AI tools usable in Africa’s largely oral languages as scholars warn millions risk being left behind

Researchers have released what they describe as the largest known dataset of African languages in an effort to close a widening gap between the continent’s linguistic diversity and modern artificial intelligence systems.
The dataset, announced in recent days, is intended to provide the training material many AI models need but which is scarce for most African tongues. "We think in our own languages, dream in them and interpret the world through them. If technology doesn't reflect that, a whole group risks being left behind," said the University of Pretoria in a statement accompanying the release.
AI models that power widely used tools such as ChatGPT have been trained predominantly on English and other major European and Asian languages because those languages have vast quantities of digital text available. Many African languages are primarily oral or have far less written text online, limiting the raw data that developers use to build and fine-tune language models. That paucity of data, along with uneven investment in language technology, has left millions of speakers with fewer AI-driven services in their native languages.
The scarcity affects practical applications across sectors. In rural and low-resource settings, voice and text interfaces trained on local languages can help deliver agricultural advice, health information and educational content. The BBC reported the example of farmer Kelebogile Mosime, who uses an AI application that speaks her language, underscoring how language-capable tools can make technology accessible to people who do not use English or other dominant languages.
Researchers behind the dataset said it is a step toward building more inclusive systems, enabling developers to train models that understand and generate a wider range of African languages. The work addresses one of the primary technical barriers: creating sufficiently large, diverse and labelled corpora that can be used to train statistical and neural language models.
The release follows years of academic and community-driven efforts to expand resources for African language computing. Experts say technical work must be paired with funding, local participation and ethical data-collection practices. Challenges include gathering representative samples across dialects, developing consistent orthographies where writing systems vary, and protecting the rights and privacy of language contributors.
Researchers and institutions involved in the effort emphasized that the dataset alone will not solve the problem of linguistic exclusion. They described it as a foundation that can accelerate further research, product development and partnerships between technology companies, universities and local communities. Progress will also depend on integrating such data into commercial and open-source models and ensuring that resulting systems are evaluated for accuracy and cultural sensitivity.
The dataset release was reported by the BBC from Johannesburg and published on the BBC News website. Advocates of language technology said the move could help expand the reach of AI, but they also warned that ongoing investment and community-led governance will be necessary to make those gains durable and equitable.

Developers, researchers and policymakers say the new resource should make it easier to create applications tailored to African linguistic realities. They called for continued collaboration to translate that technical advance into real-world tools that serve education, health, agriculture and civic needs across the continent.