Defend Truth

DEVELOP AI OP-ED

Not all languages are equal in the artificial intelligence boom

Not all languages are equal in the artificial intelligence boom

Language is at the centre of the AI boom and that essentially means English (with a dash of French) is being prioritised. Minority languages, particularly in Africa, are in danger of being left behind.

Journalist, iHubOnline founder and general nice guy Mallick Mnela has become an ambassador for artificial intelligence (AI) in Malawi. He is increasingly passionate about inserting African languages into these monstrous, English-dominated large language models (LLMs).

I wanted to know from Mnela whether this problem of language representation in AI was solvable — or are the resources needed to make a difference too massive?

Mnela was in journalism for 20 years, but quit his job in 2019 to be an entrepreneur. I met him in October last year in Namibia and he was inhaling every AI development, app and piece of code he could find. 

In Malawi, he says, the majority of the population is digitally excluded. “Most indigenous knowledge is not properly documented and even less is published online,” he says. Still, a few people in Malawi are pushing the AI sector forward.

He’s busy working with Microsoft to make the Malawian language Chichewa better represented in the AI ecosystem. He built a Chichewa version of ChatGPT and bundled it with “reinforced learning”. This is where a user sends feedback if they aren’t satisfied with the answer and a human will check the interaction and make changes where necessary. It sort of crowdsources the tweaks needed to improve the model.

Mnela found that religious Chichewa content was generously available online (and there were plenty of available translations to stack it next to). So, he used this to train his model and it meant that a bias manifested: the LLM used more “biblical” language. But it also made the LLM more tolerant and philosophical, he says.

The big problem is that if we don’t start training these models on niche languages now, we aren’t going to be able to develop them as easily in the future as the tech accelerates. 

“For English, the LLMs perform well in handling complicated language queries. But they can be pretentious and give false information that appears real,” Mnela says. To stop this from happening Retrieval-Augmented Generation is used. This gives the LLM context and lowers the chance of it spewing nonsense. However, if the language is low-resourced (like Chichewa) then you can’t augment it in this way. “The model just produces chaos when you try,” Mnela says.

Who should pay for all this?

“Media companies may not have the money but they have the data. I therefore strongly propose a collaborative approach,” Mnela says. “The best idea is to forge partnerships so that the media can provide the much-needed datasets while big tech focuses on the technology.”

If a news outlet has been publishing bilingual news daily for years then they already have an incredibly rich dataset. They have masses of sentence pairs needed to train a custom model. And this model has far-reaching applications (beyond media) that could create a revenue stream for a newsroom.

“We need to start looking at how our newsrooms are structured,” Mnela says. “We need to locate the tech-savvy journalists and empower them to provide a bridge between newsrooms and tech spaces.”

As Mnela points out, the big tech companies need to bring as many languages as possible along with them as this is their future user base. However, they are going to do this with or without the cooperation of the media and content creators. 

“As long as journalists focus on generating content, AI companies will be scraping it from their platforms with little or no acknowledgement at all,” Mnela says. It is better for the media to be actively involved, generate cash and produce a far better product.

This week’s AI tool for people to use 

I demonstrated this tool last week during my class on “Data Visualisation and AI” for The British University in Egypt. Google Looker Studio (previously known as Google Data Studio) is the Google Docs of data visualisation. It integrates with all your other Google tools and lets you quickly turn your data into a cool-looking graph or dashboard.

What AI was used in creating this piece?

I used ChatGPT for this main image above. It took more than a dozen tries to remove African beads from the picture. They would either be on the man’s neck or wrists. Even when ChatGPT said that it had removed the beads, they were still in the picture. And the AI kept complaining that it couldn’t see or review what it had produced … like it was caught in a kind of creative torture.

In the news 

The bad: AI and jobs. The Hard Fork podcast, with Kevin Roose and Casey Newton, has that insufferable chirpy tone that is a characteristic of podcasts from The New York Times, but if you can get through that then there are some gems in this episode. They take a solid look at what AI is doing to jobs and the economy in the US.

The good: people are being paid. A new report from Reuters shows how tech giants are racing to secure vast quantities of online data to feed their AI models. And they are actually paying. Meta, Google, Amazon and Apple all reached a deal with Shutterstock in 2022. The agreement included hundreds of millions of images, videos and music files for AI training. Prices for training data range from a few cents per image to hundreds of dollars per hour of video. If you are a data-heavy platform it is time to cash out. DM

Develop AI is an innovative company that reports on AI, builds AI-focused projects and provides training on how to use AI responsibly.

Subscribe to Develop AI’s newsletter here.

Gallery

Comments - Please in order to comment.

Please peer review 3 community comments before your comment can be posted

We would like our readers to start paying for Daily Maverick...

…but we are not going to force you to. Over 10 million users come to us each month for the news. We have not put it behind a paywall because the truth should not be a luxury.

Instead we ask our readers who can afford to contribute, even a small amount each month, to do so.

If you appreciate it and want to see us keep going then please consider contributing whatever you can.

Support Daily Maverick→
Payment options

Premier Debate: Gauten Edition Banner

Gauteng! Brace yourselves for The Premier Debate!

How will elected officials deal with Gauteng’s myriad problems of crime, unemployment, water supply, infrastructure collapse and potentially working in a coalition?

Come find out at the inaugural Daily Maverick Debate where Stephen Grootes will hold no punches in putting the hard questions to Gauteng’s premier candidates, on 9 May 2024 at The Forum at The Campus, Bryanston.