WRITTEN RECORDS

Posted on 2025/11/17 by secret santa

TRAINING on LOW-RESOURCE LANGUAGES
https://restofworld.org/2023/internet-most-used-languages
https://restofworld.org/2024/filipino-ai-local-languages
https://restofworld.org/2025/mongolia-egune-ai-llm
https://restofworld.org/2025/chatgpt-latin-america-alternative
https://restofworld.org/2025/chatgpt-india-alternative-ai-llms
ChatGPT is huge in India. These locally focused startups found a way to compete
by Tauseef Ahmad and Sajid Raina / 17 November 2025

“When Amrith Shenava began experimenting with large language models shortly after the launch of ChatGPT, he quickly realized that Tulu — the language he and some 2 million people spoke in the southern Indian state of Karnataka — had virtually no digital data set. He decided to build one. Shenava, who has a degree in computer science from Kent State University in Ohio, had earlier launched a translation app, and a language learning app for Tulu. To build the data set for the LLM, he had to collect voice and text data from native speakers including teachers, professionals, homemakers, and members of the Tulu diaspora. “Most AI systems are built in the U.S. They don’t understand Indian languages or contexts,” Shenava, the 27-year-old founder of TuluAI, told Rest of World. “We need our own models that represent us.”

India has more than 1,600 languages and dialects, but most artificial intelligence systems cater to those that are widely spoken. OpenAI’s ChatGPT supports more than a dozen Indian languages including Hindi, Tamil, and Kannada, the dominant language in Karnataka. Google’s Gemini can chat with users in nine Indian languages. Spurred by their success, and keen to be a part of the rapid global transition to AI, a handful of Indian startups are building AI tools for so-called low-resource languages such as Tulu, Bodo, and Kashmiri, which have a limited online presence and few written records. The startups are having to build data sets nearly from scratch.

TuluAI holds storytelling sessions and workshops in rural areas, in which local residents — particularly women and elders — narrate their stories, or are asked to read texts and simulate everyday conversations. Participants are taught to record and label the data. Each workshop of one to two days produces over 150 hours of labeled voice and text data, Shenava said. The startup also collects WhatsApp voice notes from anyone who wishes to send one, with annotators checking transcripts and labels for accuracy. “Major translation tools miss the context that gives meaning to words. The only way to fix that is to use authentic, human-recorded data that reflects real-life language use,” Shenava said. “The goal is for the model to talk like a native speaker. We want it to understand humor, idioms, and cultural context. So we’re building slowly, verifying every sample.”

Across the country, in the northeastern state of Assam, Kabyanil Talukdar, the 25-year-old co-founder of Aakhor AI, follows a similar process to build data sets in Bodo and Assamese. Talukdar’s team conducts community workshops and classes, and holds voice-note drives via WhatsApp groups, with simple daily prompts like “Talk about your morning tea.” Each submission is tagged with metadata such as dialect, region, and speaker demographics to ensure diversity. The clips, 20–60 seconds long, are processed, transcribed, and anonymized. Each three-month campaign produces over 5,000 voice samples, Talukdar told Rest of World. “When people see that their voices help preserve their language, they feel ownership,” he said. “They are driven by the shared goal of creating AI that understands and speaks their native language.”

Big tech LLMs such as GPT and Meta’s Llama are trained on a wide range of data, including in languages other than English. Yet their performance in low-resource languages can be unpredictable, particularly in dialects and local idioms. Countries keen to support their languages and become self-sufficient in AI are building their own multilingual LLMs, which can support translation, speech recognition, and tools for customer service, education, health care, and other applications. These include the Chile-led LatamGPT project, Southeast Asia’s Sealion, and efforts by Masakhane — a grassroots organization that aims to build AI data sets and tools in African languages. India’s BharatGPT and Sarvam support many major Indian languages, and the government is building open-source models for several languages under the Bhashini project.

It is not easy. Tulu’s ancient script lacks a Unicode standard that would allow computational processing of text. Shenava’s team is digitizing literature written in the script, and training the model to identify patterns. While more complicated, the process helps capture the cultural nuance that is often lost in translation, he said. The team avoids AI-generated or machine-translated data, which is often riddled with grammatical errors, made-up words and phrases, and other inaccuracies, he said. “Even open-source models produce text that doesn’t make sense. That’s why we decided to build it from scratch,” Shenava said. This also ensures ethical data use, he said. “We don’t use any personal data without explicit permission.” Aakhor AI’s models are voice-first, targeting areas with low literacy and weak internet access. The company recruits speakers from underrepresented areas to prevent dominant dialects from overshadowing smaller ones, and ensure “balanced sampling,” Talukdar said.

For Saqlain Yousef, it was the fear that Kashmiri — a language spoken by about 7 million people in India — might disappear that drove him to build the KashmiriGPT app using OpenAI’s application programming interface. The platform accepts input in English as well as Kashmiri written in the Roman script, and generates responses in the Kashmiri script, Roman Kashmiri script, and English. “Our language is vulnerable and at risk of disappearing. So I took matters into my own hands,” the 25-year-old told Rest of World. “This will help preserve Kashmiri in the AI age.”

Yousef is right to be concerned, C. Vanlalawmpuia, an independent researcher in language and AI, told Rest of World. “These languages are already marginalized, and without proper digital representation, they risk disappearing from online spaces entirely,” he said. AI makes it easier to preserve a language through translation tools, transcription systems, and data sets that can make a language more visible and accessible, according to Vanlalawmpuia. But the lack of digital resources and funding are a challenge, and community-led efforts are one way to sustain the platforms, he said. AI platforms from deep-pocketed big tech firms including OpenAI, Google, and Perplexity are also targeting India.

The country is already the biggest market for ChatGPT outside the U.S., and OpenAI this month offered its ChatGPT Go service free for a year to users in India. Aakhor AI is aware of its challenge. “We don’t compete with GPT on scale,” Talukdar said. “We compete on relevance.” By sourcing data from the ground, the community is involved in preserving linguistic diversity and advancing linguistic inclusion, Shenava said. “Anyone can contribute. That’s how language preservation will happen,” he said. “If AI can help keep it alive, that’s worth all the effort.” For Rita D’Souza, a 32-year-old primary schoolteacher in coastal Karnataka, TuluAI is already making a difference, helping students improve their pronunciation and spelling, she told Rest of World.”

PREVIOUSLY

GLOBAL EXOCORTEX
https://spectrevision.net/2025/02/28/global-exocortex/
TOO LONG; DIDN’T READ
https://spectrevision.net/2024/11/28/too-long-didnt-read/
PROTEIN LANGUAGE MODELS
https://spectrevision.net/2024/05/08/protein-language-models/

MACHINE READABLE
https://spectrevision.net/2024/04/25/machine-readable/
GROKKING AI
https://spectrevision.net/2024/03/27/grokking-ai/
LANGUAGE CITY
https://spectrevision.net/2024/03/07/language-city/

'SEARCH TERMS'

— #spectre #spectrelabs #spectrerealty —

Category Archives: spectre

WRITTEN RECORDS