Despite the growing popularity of generative artificial intelligence (GenAI) tools, including ChatGPT and Gemini, these technologies often fail to authentically or accurately represent smaller languages.
The root of this asymmetry lies in how these technologies are trained. Of the more than 7,000 languages spoken worldwide, only a fraction are represented online. Large language models—the computer programs that enable GenAI to create full, coherent text rather than single-word outputs—are trained on this data. Languages with lots of available data are considered “high-resource” languages, while those with insufficient data are frequently misrepresented, reduced to versions that fail to capture the nuance of how people speak.
AI and language assimilation
GenAI’s inaccurate output of the Estonian language can be thought of as a consequence of technological colonialism. As described by AI and data science expert Dr. Seth Dobrin, technological colonialism refers to the process by which a small number of powerful corporations control the development of emerging technologies, resulting in the imposition of these entities' cultural values, biases, and societal norms on a global scale.
The dominance of US-centric English language within these technologies, reflected in the data they are trained on and the outputs they produce, falls into the patterns that Dr. Dobrin describes.
The United States is leading the world in the development of generative AI. The dominance of US-centric English language within these technologies, reflected in the data they are trained on and the outputs they produce, falls into the patterns that Dr. Dobrin describes.
If this cycle persists, smaller languages (including Estonian) are at risk of being engulfed by the US’ cultural and linguistic norms. This concern echoes the historical patterns of language loss observed by linguistic anthropologist, Anna Luisa Daigneault. She explains that European colonial powers in North and South America imposed their language on local communities over time, replacing Indigenous languages with Spanish, Portuguese, English, French, and Dutch:
“Language assimilation can be forced through violence and oppression, but can also be more of a gradual shift,” according to Daigneault in an article in School of Journalism.
The notion of a gradual shift is particularly relevant in the context of GenAI. The integration of these technologies into everyday life pressures speakers of smaller languages to conform to using English for convenience or access. Daigneault refers to this process as chain endangerment, which happens when “smaller, local languages are taken over by regional languages, those regional languages are taken over by other regional languages, and those ones are taken over by larger colonial languages.”
The national Estonian Language Corpus
Estonian officials have been aware of these issues. To protect the language, some have been supportive of providing GenAI developers the national Estonian Language Corpus—a database of all digitally available Estonian texts curated and managed by the Institute of the Estonian Language (EKI).
In a press release issued by the Ministry of Justice and Digital Affairs earlier this year, minister Liisa Pakosta noted that “It is crucial for the sustainability of our language and culture that open data of the Estonian language corpus be available to language model developers.”
“Sharing Estonian-language data creates the precondition for large language models to understand the cultural context of Estonia and become more proficient in using the Estonian language. At the same time, this enables the development of better services for Estonian-speaking users in various AI-based applications—such as chatbots, translation systems, and other language technology solutions,” reads the document.
Public backlash
Soon after it was published, a controversial chain of events unravelled. The original press release stated that “Meta is the second company developing large language models to be given access to the corpus…” suggesting a deal was already made.
Following criticism from Estonian Prime Minister Kristen Michal and Minister of Culture Heidy Purga that Estonia’s media archives—including material from news publications and other cultural institutions—“should not be given away lightly and for free,” the press release was later edited to say that Meta was the second company to be “interested” in the corpus.
Director of the Institute of the Estonian Language Arvi Tavast later clarified that Meta had no exclusive access to the corpus: “Since 2020, the Estonian state has been working at both the official and political levels to improve the representation of the Estonian language in large language models, including attempts to persuade major developers to use our corpus data. So far, without success,” according to ERR.
“If we don't want Estonian to become a language that disappears from technology as technology develops… then I think it is absolutely the right decision to make it available to major language models.”
(Kristina Kallas)
Language sovereignty in the age of Big Tech
Questions of cost and licensing aside, many government ministries and cultural institutions agree that giving the corpus to companies developing GenAI is vital for protecting the Estonian language.
“If we don't want Estonian to become a language that disappears from technology as technology develops, that is that it doesn't use Estonian, and is not capable of developing further—and we don't want that—then I think it is absolutely the right decision to make it available to major language models,” said Minister of Education and Research Kristina Kallas in ERR.
But this raises another question: is it useful to Big Tech, whose priorities lie in profit and market dominance, to preserve the integrity of local languages, and by extension, national identity?
Big Tech companies are not neutral actors. In fact, the vast infrastructural costs required to power GenAI are contributing to the environmental degradation and displacement of many communities. In this light, it is not only deeply contradictory to frame Big Tech as entities protecting linguistic or cultural integrity, but risks another form of assimilation where a language’s survival lies at the whim of complying with foreign corporate interests.
Moving forward
Even if Estonia is successful in integrating the corpus into GenAI, AI entrepreneur Indrek Seppo sees this as just a starting point. “[The corpus] allows AI to learn the Estonian language better, but it is not enough to grasp the Estonian mindset,” he said in Estonian World. “For that, our cultural heritage needs to be made accessible. Otherwise, our children may speak Estonian with AI, but with an American mindset.” In this light, Liina Kersna, chair of the Cultural Affairs Committee, told ERR that an option might be to develop a domestic large language model using Estonia’s own language data.
These tensions reveal the value that comes from local knowledge; oral tradition and written records keeps cultural nuance and heritage alive. Without agency in how language is transmitted, its preservation risks succumbing to corporate power under the façade of technological progress.
This article was written by Natalie Jenkins as part of the Local Journalism Initiative.