Why Generative AI Still Struggles With Indian Languages

[00:00:00] Why Generative AI still Struggles with Indian Languages by Shrushti Chapia it almost feels like there was one world before CHATGPT and another after it. From the stories emerging worldwide about artificial intelligence, AI powered chatbots passing medical exams, writing poetry, and even mimicking human emotions, it seems like AI has cracked the code of human language. [00:00:27] But beneath this impressive progress lies a significant gap, one that becomes glaringly obvious when we look beyond English and widely spoken Latin script languages. [00:00:38] How does AI truly navigate the complexities of non English languages, each with its own grammar, idioms and cultural nuances? [00:00:47] This question is especially pertinent when it comes to Indian languages. [00:00:52] With over 22 official languages and hundreds of dialects, India is one of the most linguistically diverse countries in the world. [00:01:01] But AI systems struggle to accurately understand, translate, and generate text in many Indian languages, especially those with complex grammar, rich oral traditions, and non Latin scripts like Devanagari, Tamil, and Bengali. This article explores the reasons why generative AI still struggles with Indian languages, detailing linguistic intricacies, data limitations, sociocultural factors, and technological gaps. [00:01:32] The main bottleneck A big obstacle preventing generative AI from mastering Indian languages is the lack of large, high quality datasets for training. [00:01:43] Unlike English, which has billions of digitized texts, transcripts and annotated datasets, most Indian languages remain underrepresented in the digital world. [00:01:55] Many Indian languages, especially regional ones, have a strong oral tradition but a weak digital footprint. [00:02:02] For example, while Hindi and Bengali have some presence online, languages like Bodo, Konkani and Sontoli have minimal digitized resources. [00:02:12] Even widely spoken languages like Punjabi and Marathi lack a comprehensive corpus of diverse texts, for example legal documents, scientific papers limiting AI's ability to train on rich, varied linguistic inputs. [00:02:28] Non Standardized spelling and script variations Another significant challenge is non standardized languages. Unlike English, where spelling is largely standardized, many Indian languages lack uniformity in writing. [00:02:43] For instance, Bengali and Assamese share the same script but have different linguistic rules, making it harder for AI to distinguish between them. Similarly, Urdu and Hindi share vocabulary but use different scripts, Perso, Arabic and Devanagari respectively, which adds another layer of complexity. [00:03:04] AI models trained on inconsistent or fragmented data struggle to learn accurate linguistic patterns, resulting in fluid translations and unnatural text generation, code switching and contextual ambiguities. [00:03:20] Code switching, blending elements of two or more languages, and contextual ambiguities are common in India, and a models struggle to handle these linguistic complexities. [00:03:31] Many Indian people are at least bilingual, seamlessly switching between their native language for example Hindi, the country's lingua franca, and English in The sentence in figure 1amazing and movie are English words embedded in a Marathi sentence. [00:03:49] AI faces multiple challenges here. [00:03:52] Language Tagging the model must correctly identify which words belong to which language and apply appropriate linguistic rules. [00:04:01] Grammar Adaptation Amazing is an adjective following English grammar, but in Marathi, adjectives often agree in gender with the noun. AI needs to adjust its grammatical processing accordingly. [00:04:15] Multilingual Processing this sentence includes both Marathi and English, requiring AI to recognize and process multiple linguistic frameworks simultaneously. [00:04:27] Because most AI training datasets are monolingual, they fail to handle these dynamic real world language patterns. [00:04:36] Moreover, resource poor languages like Marathi lack high quality annotated datasets for code switching, making it difficult for AI to learn the correct rules. [00:04:46] This results in AI generated translations sounding robotic, making them ineffective for real life use cases in multilingual India Technological Limitations in AI models finally, even the most advanced AI models today struggle with low resource languages and dialects. Due to fundamental technological barriers, most AI models rely on two key techniques to handle languages with limited training data Few shot learning AI learns from a small number of examples and generalizes patterns. [00:05:22] Few shot learning works best when at least a small but diverse dataset is available for AI to detect patterns. [00:05:30] However, many Indian dialects have little to no written resources. [00:05:36] Most exist primarily as spoken languages with few digital texts, dictionaries or formal linguistic studies. [00:05:44] Zero shot Learning in this way of learning, AI attempts to interpret text without any prior training by transferring knowledge from similar high resource languages. [00:05:55] This works by leveraging similarities between an unknown language and a known one. [00:06:01] So, for instance, an AI trained on Spanish can understand Catalan quite well because the two languages share many similarities. [00:06:10] Hindi and Bhojpuri also have overlapping vocabulary and grammar. However, AI struggles with Bhojpuri specific expressions, pronunciation and informal structures that don't exist in standard Hindi. In the Example In Figure 2, Bhojpuri uses Shalgo, a verb structure that doesn't exist in standard Hindi. [00:06:34] Lacking Bhojpuri specific training incorrectly assumes a progressive tense distorting the sentence's meaning. [00:06:43] Technological Limitations in AI models finally, even the most advanced AI models today struggle with low resource languages and dialects choose to fundamental technological barriers. [00:06:56] Most AI models rely on two key techniques to handle languages with limited training data. [00:07:03] Few shot learning AI learns from a small number of examples and generalizes patterns. [00:07:09] Few shot learning works best when at least a small but diverse dataset is available for AI to detect patterns. [00:07:17] However, many Indian dialects have little to no written resources. [00:07:22] Most exist primarily as spoken languages with few digital texts, dictionaries or formal linguistic studies. [00:07:30] Zero shot learning in this way of learning. It attempts to interpret text without any prior training by transferring knowledge from similar high resource languages. [00:07:42] This works by leveraging similarities between an unknown language and a known one. [00:07:48] So for instance, an AI trained on Spanish can understand Catalan quite well because the two languages share many similarities. [00:07:56] Hindi and Bhojpuri also have overlapping vocabulary and grammar. However, AI struggles with Bhojpuri specific expressions, pronunciation and informal structures that don't exist in standard Hindi. In the example in Figure 2, Bhojpuri uses shall go a verb structure that doesn't exist in standard Hindi. [00:08:20] Lacking Bhojpuri specific training incorrectly assumes a progressive tense are going distorting the sentence's meaning. [00:08:29] Training AI Models what are we doing to train the models? [00:08:34] Despite the challenges, several innovative efforts to improve AI's understanding of Indian dialects are underway. [00:08:42] Here's a look at how researchers, tech companies and linguistic communities are working to bridge the gap. Government Initiatives the Indian Government's Bashini Project under the Digital India Initiative aims to create AI powered translation models and speech recognition tools for all 22 scheduled languages and several dialects. The National Language Translation Mission seeks to make government documents and educational materials accessible in regional languages using AI crowdsourced language data collection Tech companies and research institutions are encouraging native speakers to contribute text and speech samples to improve AI's accuracy in regional languages. [00:09:27] AI research and development focused on Indian dialects Institutions like the Indian Institute of Technology, Madras and the International Institute of Information Technology, Hyderabad are developing AI driven linguistic innovations tailored to Indian dialects. [00:09:45] Microsoft Research India's project Ellora is using deep learning to create adaptable AI models for dialect heavy regions. [00:09:53] Advanced AI models like MBERT and XLMR are being trained to improve code mixed and dialect heavy conversations. [00:10:03] Development of AI models that support code switching meta's no Language Left behind models are being trained on Hinglish Hindi English Tanglish Tamil English and Bangladesh Bengali English to improve AI's ability to process mixed language inputs naturally. [00:10:21] Chatbots and voice assistants are being optimized to interpret and respond to code mixed queries without forcing users to choose between English and their native language. [00:10:32] Although generative AI has made remarkable strides in language processing, its struggle with Indian dialects highlights the deep rooted challenges of data scarcity, code switching complexities and low digital representation of regional languages. [00:10:49] The future of AI in India isn't just about supporting dominant languages. It's about ensuring that every dialect, every voice and every linguistic identity finds its place in the digital world. [00:11:03] This article was written by Srushti Chapia. She is the CEO and co founder of A Language World a language service provider specializing in Indian and Southeast Asian languages for more than 10 years. [00:11:16] Currently based in London, Chapia is a Council Member at the association of Translation Companies uk. [00:11:24] She holds a BTECH and ANMARIN translation Combining tech expertise with deep industry insight. Originally published in multilingual magazine, issue 239.

Show Notes

Episode Transcript

Other Episodes

Episode 213

Researchers share framework for more accurate speech recognition

Episode 240

Making the Link Between Marketing, Revenue, and Localization | By Nataly Kelly | October 2022

Episode 192

Localizing Elden Ring