Towards Inclusive Natural Language Processing

[00:00:00] Towards Inclusive Natural Language Processing why Modern Language Models Perpetuate Bias and what to Do About It By Anika Shaffer by now it's common knowledge that the artificial intelligence language models that society increasingly relies on in everyday life do not understand meaning or think like humans, but rather operate purely on surface level patterns gleaned from torrents of online data. [00:00:28] What is less widely known is that this data is essentially the digital exhaust of our collective speech and therefore reflects the brilliance, biases, creativity, and cruelty of humanity itself. And that's where the trouble begins. Because the output of language models can appear fluent and impartial, we have a tendency to treat them as linguistic equals or worse, neutral authorities. [00:00:52] When we do so, we risk reinforcing the very systems of inequality and exclusion that human language is already tangled up in. In this article, I explore the linguistic theories that lay the foundation for natural language processing. NLP confront the ethical risks of allowing machines to speak on our behalf, and present a new vision for inclusive models. [00:01:15] Because the truth is, machines don't talk. They echo. And what they echo reflects not just our language but but our values, blind spots and power structures From Rule Based to Statistical Models when people first started trying to make computers understand language, the idea was if humans follow rules when they speak, then maybe machines can learn those rules too. In its early stages, NLP was deeply influenced by formal linguistic theory, Most notably Noam Chomsky's 1965 work on generative grammar aspects of the theory of syntax. [00:01:52] Chomsky suggested that humans are born with an innate set of grammatical rules akin to an internal framework that enables language acquisition. He also made an important distinction between competence, the idealized, internalized knowledge of language and performance, the way language is actually used in everyday situations, complete with errors, pauses, and improvisation. [00:02:15] For early NLP systems, this was a welcome distinction. [00:02:19] Rather than grappling with the chaotic way humans actually speak, developers focused on clean, rule based representations of grammar. They built systems that could parse sentence structures using syntactic rules, logic trees, and handcrafted dictionaries. But human language is rarely clean. It's ambiguous, context sensitive, and full of idioms meant metaphors and exceptions. Rule based systems struggled to handle the vast variability of real world communication. [00:02:49] In the 1990s, researchers moved away from manually coded linguistic rules and started embracing statistical models. [00:02:57] Instead of explicitly programming linguistic rules, researchers began supplying computers with massive collections of real world text, often comprising millions of words, and allowed the models to infer patterns on their own. [00:03:11] The emphasis shifted away from theoretical frameworks toward empirical data. [00:03:16] Language was no longer approached as something to be understood, it was treated as something to be counted and predicted. This shift paved the way for today's language models. Systems like Google's BERT or OpenAI's GPT don't rely on grammatical rules or linguistic theory. Instead, they learn from usage based on the principle of distributional semantics, the idea that a word's meaning is shaped by the words around it. These models turn language into numbers. [00:03:46] A word becomes a point in a high dimensional space defined not by its dictionary definition but by statistical proximity to other words. [00:03:55] Modern models are built on transformer architectures, a neural network design that excels at capturing long range dependencies in text. [00:04:04] These models break down input into tokens or subword units and learn contextual relationships between them using billions of parameters over time. They learn to map words to embeddings, mathematical vectors that capture how words relate to others in similar contexts. At its most basic level, a language model is trained to predict the next word in a sentence, give it the beginning of a phrase. Say the cat sat on the and it will draw on all the data it has consumed to guess what's likely to come mat, floor or sofa, for instance. [00:04:40] With enough data and computing power, these predictions can become remarkably precise and even convincing. It's an impressively powerful approach, but it comes at a cost. [00:04:51] Downsides of Modern NLP Systems Modern large language models LLMs lack any genuine grasp of meaning, intention, or situational context. They are indifferent to who is speaking, what relationship exists between speaker and listener, or the purpose behind the exchange. [00:05:11] Subtle distinctions such as irony versus sincerity or metaphor versus manipulation routinely escape them. They operate on statistical associations without engaging with the deeper structures that understanding. [00:05:25] Beneath the surface, a set of hidden assumptions shapes how NLP systems work. Language is a predictable rule governed system. [00:05:33] Statistical patterns can fully replace human understanding of meaning n variation and diversity in language or noise, not signal. [00:05:43] These ideas weren't chosen maliciously. They're the legacy of how the field evolved. But they carry real consequences. [00:05:50] When machines learn to speak based on these assumptions, they reflect a narrow, idealized view of what language is and who gets to define it. This data centric approach has enabled extraordinary leaps in performance, but it also comes with serious blind spots. [00:06:06] Most modern NLP systems have no access to pragmatics or intention. [00:06:11] They do not model how meaning shifts depending on who is speaking to whom and in what context, all of which are central to human communication. They simply generate text by predicting likely word sequences. In that sense, not much has changed over the decades. In the 1960s, Joseph Weizenbaum developed ELIZA, a program that mimicked a psychotherapist by using simple rule based pattern matching. [00:06:38] If the user typed I feel sad, Eliza might respond, why do you feel sad? It was a clever illusion, relying on scripted responses without any real understanding. Some users were so taken with Eliza's responses that they began attributing empathy to the machine. What separates ELIZA from systems like GPT is not intelligence, but scale. The underlying approach simulating language without meaning remains remarkably consistent. [00:07:08] As machines get better at sounding human, it becomes easier to forget that they are not thinking in any human sense. They are echoing us and the data we feed them. [00:07:18] The Illusion of Intelligence it's easy to be impressed and even unnerved by how fluent LLMs have become. [00:07:27] They generate poems, draft emails, answer legal questions, and mimic human conversation with eerie smoothness. [00:07:35] But the key word here is mimic. These systems are stochastic parrots, repeating patterns in training data without understanding what the words are for. What LLMs do not do is model truth, beliefs or intentions. They have no memory of past conversations unless explicitly designed to retain them, and no awareness of real world consequences. [00:07:57] This lack of grounding creates a curious paradox. [00:08:01] LLMs sound intelligent, even profound, but they cannot distinguish between fact and fiction. Unless that distinction happens to be encoded in the probabilities they've seen, they may generate grammatically perfect and confident sounding sentences that are also wildly inaccurate or deeply biased. The illusion of intelligence is reinforced by our own human tendency to anthropomorphize, that is, to project human traits, emotions or intentions onto non human entities. [00:08:32] When a machine speaks fluently, we instinctively assume it knows what it's saying. But fluency is not cognition, and coherence is not comprehension. As LLMs become increasingly embedded in everyday tools, from chatbots and search engines to virtual tutors and customer support systems, it becomes even more important to keep this distinction in mind. [00:08:55] The danger lies not just in what the model can't do, but in what we mistakenly believe it can do. [00:09:01] Bias in, Bias out the politics of Data Language models may be built on math, but they're made of people. [00:09:10] Every word they know, every sentence they complete, and every story they generate comes from human produced data websites, books, forums, articles, transcripts, and social media posts. [00:09:23] This data is not random. It reflects who has access to digital platforms, who publishes content, who dominates public discourse, and who is systematically silenced. This is the heart of the problem. [00:09:36] LLMs don't just learn language, they learn the worldviews embedded in that language. And the world they inherit from the Internet is deeply unequal. Researchers have repeatedly shown that LLMs reproduce and often amplify racist, sexist, ableist, and culturally imperialist biases. For example, models trained on web data have been found to associate names typically perceived as African American with criminality, or to rank male pronouns as more closely related to science and leadership than female ones in machine translation, gender neutral job titles in one language frequently end up being rendered with masculine forms in English, a pattern that reinforces gender bias rather than correcting it. [00:10:22] Similarly, when asked to generate depictions of roles like CEOs, doctors, or engineers, LLMs tend to default to white male Western representations, reflecting the dominant cultural narratives embedded in their training data. This is not a bug, it's a consequence of the data. [00:10:41] Much of the content scraped from the Internet, especially forums and comment sections, skews toward dominant perspectives. White, male, Western, and affluent marginalized voices are often underrepresented, misrepresented, or targeted. [00:10:58] When such imbalances are baked into training data, the models reflect them back, not just passively, but with the false confidence and fluency that make machine generated text so persuasive. This is what's often described in software development as the garbage in, garbage out principle. But in this case, the garbage isn't random noise. It's structured historical and ideological. [00:11:23] Machine learning amplifies the patterns it finds, and when those patterns encode inequality, so does the model. There's a persistent myth that ML is objective, that data is neutral, and that algorithms remove human bias. [00:11:40] In reality, every step of the process, from what data is collected to how it's labeled, who tunes the model, and who tests the outputs, involves subjective choices, and those choices are often shaped by power. In that sense, language models are not just technical systems, they're political artifacts. They reflect which voices are considered representative, which ideas are seen as normal, and which ways of speaking are worthy of being learned. [00:12:07] Language Ideology at Scale if you want to teach a machine to understand language, the first question is, whose language? [00:12:15] Most LLMs are trained primarily on data in English, and not just any English, but a particular variety standardized, formal, often United States centric, and steeped in the values of dominant institutions. [00:12:30] In practice, this means that language models not only absorb patterns but also internalize language ideologies, beliefs about which forms of speech are correct, valuable, or neutral. This process of encoding linguistic norms often reflects the same social hierarchies that shape schools, media, and corporate communication what sociolinguists refer to as standard language ideology. [00:12:57] Within this framework, linguistic variation is not treated as richness but as deviation. [00:13:03] Dialects, regional accents, non Western languages, and informal styles are frequently perceived as flawed unreliable or inappropriate for proper communication. [00:13:14] Language models replicate this bias by design, during dataset curation, anything that deviates from the standard, non standard grammar, slang, or content from underrepresented languages is often filtered out as noise. [00:13:30] In doing so, these systems encode a narrow view of what language should be while sidelining the vast diversity of how people actually speak. Low resource languages, especially those spoken outside of North America and Europe, are often dramatically underrepresented in training corpora. Even when such languages are included, they are rarely prioritized in model optimization, which means performance suffers and the gap between dominant and marginalized linguistic communities deepens. This dynamic isn't limited to language form it extends to content as well. The training data that fuels LLMs is saturated with the opinions, metaphors, cultural assumptions, and social narratives of privileged groups. [00:14:15] As a result, these models often reproduce such dominant perspectives by default, unless specifically directed otherwise. Even more concerning is how this replication plays out at scale. [00:14:28] When LLMs are used to generate everyday content, from marketing emails and product descriptions to automated lessons and customer facing scripts, they propagate a superficial idea of what language should be and, by implication, what counts as legitimate or authoritative knowledge. [00:14:47] In doing so, they risk reinforcing cultural hierarchies under the guise of fluency and efficiency. This is especially problematic in global contexts. Companies that use English only LLMs to generate customer support scripts in India, Brazil, or Nigeria, for instance, may unknowingly impose a Western corporate tone that feels alien or even patronizing to local users. [00:15:11] Even translation models, when trained on biased or imbalanced datasets, can end up erasing culturally specific expressions or smoothing out linguistic nuance, all in the pursuit of efficiency and scalability. [00:15:25] Put simply, NLP systems have the potential to reinforce existing hierarchies tied to language, class, geography, and race, often invisibly and without accountability. [00:15:36] When we scale language through machines, we unwittingly scale the ideologies embedded in that language. [00:15:44] Societal Risks and Ethical Frontiers Language models don't just generate text, they generate consequences. [00:15:52] As language models move beyond research settings and into real world systems, the societal risks are no longer theoretical. These technologies are already influencing how people are informed, evaluated, categorized, and monitored. [00:16:08] Misinformation Technologies, Disinformation and the Illusion of Credibility One of the most immediate concerns is how authoritative machine generated content can sound even when it is entirely incorrect. With the right input, LLMs can produce coherent and contextually appropriate text that blurs the boundary between fact and fiction. [00:16:31] This makes them potent engines for both misinformation, false content spread unknowingly and disinformation deliberately deceptive material. These aren't hypothetical risks. Such content is already being created and disseminated at scale in environments where speed and fluency are mistaken for accuracy. Like social media and real time customer service, this is a serious problem. [00:16:55] AI generated news summaries, medical advice, or legal explanations may carry a veneer of credibility that masks their unreliability. [00:17:05] And because LLMs do not know whether something is true, they can't warn users when they hallucinate. The result is a growing credibility crisis. Readers are faced with fluent but unverified text, and users interact with systems that are optimized for plausibility, not truth. [00:17:23] Surveillance and Language based profiling as LLMs become embedded in surveillance infrastructure, the complex systems of cameras, sensors, automated moderation tools, and algorithmic policing used by platforms and authorities, they introduce new mechanisms for language based profiling. [00:17:43] These models are deployed to scan vast volumes of text like posts, comments and chats for signs of suspicion or deviance. However, what counts as suspicious is often shaped by training data that privileges standardized English and reflects dominant mainstream language norms. As a result, models may misclassify informal, culturally specific or non standard varieties of language such as African American vernacular English, aav, regional dialects, or Internet slang as offensive, unreliable, or harmful. [00:18:17] In doing so, these systems risk penalizing the very linguistic diversity they should be designed to recognize and respect. [00:18:26] In fact, social media posts written in AAVE are 1.5 times more likely to be flagged as offensive by hate speech detection tools than standard American English posts, even when the actual content is not hateful. Automated systems trained primarily on formal language data often treat non standard grammar or spelling as signs of risk, illegitimacy or poor communication, thereby reproducing classist and racist assumptions. [00:18:53] This kind of linguistic filtering acts as a form of gatekeeping, disproportionately silencing speakers of marginalized dialects and varieties. [00:19:02] Most users remain unaware that they've been flagged or shadow banned, making these systems opaque instruments of control. [00:19:09] Under the pretext of maintaining quality or safety, such technologies end up stifling linguistic diversity. The risks are even more severe in authoritarian settings where NLP systems have already been deployed to monitor the use of minority languages, suppress dissenting speech, and reinforce state power. [00:19:29] Even in democratic societies, chat logs, emails, and social media posts are increasingly filtered through sentiment analysis and automated flagging tools, often without the user's knowledge or consent. [00:19:42] This creates new pathways for discrimination, especially for speakers of marginalized dialects, non native speakers, or those who use coded or subcultural language. [00:19:53] Automation of bias in hiring, Law and Education LLMs are increasingly integrated into high stakes decision making processes. In recruitment, for instance, AI tools have been used to assess curricula, vita cvs, or evaluate candidate responses in real time. [00:20:14] Yet these systems often reproduce patterns of racial, gender, or socioeconomic bias found in the data they were trained on, reinforcing structural inequalities rather than correcting them. In education, automated tutors and grading systems may favor responses that conform to standardized academic norms while penalizing linguistic variation, creativity, or culturally specific styles of expression. [00:20:40] This can disproportionately disadvantage students from minoritized language backgrounds even when their ideas are strong. [00:20:47] Meanwhile, in the legal domain, LLMs are being used to summarize court documents, draft legal memos, and even contribute to sentencing recommendations. [00:20:58] Yet these systems operate without the ethical accountability, contextual judgment, or oversight we demand from human professionals, raising serious questions about transparency, fairness, and due process. [00:21:10] These uses may save time, but they also automate the prejudices present in their training data and often provide no transparency into how decisions are made. When biased predictions are wrapped in the language of efficiency, they gain power and escape scrutiny. [00:21:27] The Illusion of Objectivity Perhaps the most dangerous risk of all is the illusion that machines are neutral. [00:21:35] When an LLM outputs a result, it often appears objective, the product of pure data, not human judgment. [00:21:43] But as we've seen, every model is shaped by choices about data, labels, architecture, and purpose. [00:21:50] The outputs reflect those choices, whether we see them or not. This illusion of objectivity becomes a shield that allows institutions to say the model decided rather than interrogating who trained the model on what data and with what goals, towards more just machines. [00:22:10] The problems are real, but so are the possibilities. [00:22:14] In response to the persistent biases and exclusions embedded in language technologies, a growing community of researchers, linguists, ethicists, and activists is shifting the conversation. [00:22:26] No longer satisfied with chasing performance metrics or celebrating technological milestones, they are posing more fundamental and more urgent questions. Whose languages are being represented, whose voices are elevated, and whose are filtered out? What would it take to build language technologies that don't just function efficiently but function ethically? [00:22:47] These are not merely technical questions they are deeply political and ethical. For decades, the development of NLP systems has prioritized scale, spending, speed, and surface level fluency, often at the expense of nuance, inclusivity, and accountability. [00:23:04] But a new vision is gaining ground, one that treats NLP not as a purely computational problem but as a site of cultural negotiation and social responsibility, a space where questions of power, access, and linguistic justice must be taken seriously. This shift marks a decisive move away from techno optimism and toward the principles of responsible AI. At its core is a commitment to inclusive NLP systems designed with an awareness of linguistic variation, cultural specificity, and systemic inequality, and with the intention to actively mitigate harm rather than reproduce it. [00:23:43] Inclusive NLP and Responsible AI Inclusive NLP starts with a recognition that language is not universal, neutral, or one size fits all. [00:23:54] Instead of treating English as a default, new approaches emphasize multilingual parity, dialectal awareness, and context sensitive modeling. Projects like Massicane, a grassroots NLP initiative focused on African languages, demonstrate the power of community led data curation. [00:24:13] By involving local speakers in the creation and validation of language resources, Masikane is helping to ensure that underrepresented communities have a say in the technologies built using their languages. [00:24:25] At the same time, emerging responsible AI frameworks are broadening the scope of evaluation. [00:24:31] Rather than focusing solely on accuracy or performance benchmarks, these approaches incorporate additional metrics to assess fairness, representational harm, toxicity, and inclusivity dimensions essential for understanding how language models impact people in the real world. [00:24:49] Leading research labs are experimenting with techniques to de bias embeddings, audit training data, and design interventions that interrupt harmful outputs not just after deployment but at the training stage. [00:25:03] Better Data curation Instead of blindly scraping massive datasets from the Internet, some researchers are creating smaller, carefully sourced corpora that reflect linguistic diversity, include metadata for context, and center the voices of historically marginalized communities. [00:25:21] Others are building data nutrition labels, transparent documentation that outlines what data was collected, how, from where, and with what biases in mind. [00:25:31] The aim is not merely to build models that are cleaner or more efficient, but ones that are accountable systems in which the sources of data are transparent and decisions about inclusion are deliberate rather than accidental. Centering linguists, ethicists, and Communities Creating language technologies that engage ethically with human communication requires the expertise of those who understand language not just technically but socially and culturally. Linguists with their insights into variation, context, and meaning making ethicists who can anticipate harm and ask uncomfortable questions and crucially, communities whose languages and voices are being modeled all must have a seat at the table. Projects like Data Statements for NLP have shown that when developers include detailed sociolinguistic context about a dataset, including demographics, setting, and communicative goals, models perform more reliably and ethically. [00:26:31] Similarly, participatory design, where AI tools are co developed with affected communities, is emerging as a best practice in global NLP development. [00:26:42] Language Equity as an Ethical Imperative in many ways, language equity is the ethical frontier of AI. Just as we advocate for fairness in healthcare, housing, and education we must also demand equity in how machines speak, listen, and make decisions. [00:27:00] When AI systems marginalize the way someone communicates, they are not simply misinterpreting the message they are, in effect, denying the speaker full participation. In digital life, language is not just a medium of expression it is a marker of identity, belonging, and power. [00:27:18] Systems that fail to recognize this risk reinforcing the very inequalities they claim to transcend. [00:27:25] If we want machines to serve humanity, they must be trained on a fuller picture of humanity's voices, not just the loudest ones. [00:27:33] This work is ongoing and difficult, and it often runs counter to commercial incentives for scale and speed. But it is necessary because machines don't just learn from us they shape the world we build. Efforts such as inclusive data practices, multilingual modeling, and participatory research represent meaningful progress, but they are only the first steps. [00:27:57] The more difficult work lies in shifting not just the way we build language technologies, but the way we think about them. As machines become increasingly fluent, the temptation grows to treat them as conversational partners, reliable experts, or even arbiters of truth. But to do so is to mistake mimicry for meaning and to overlook the systems of power that shape who gets to be heard in the first place. [00:28:21] Choosing Inclusion Today's language models are not neutral mirrors. [00:28:27] What they produce is not speech but the illusion of it, built from the scaffolding of past human utterances, repeated and recombined at unimaginable scale. This doesn't make them useless, but it does make them dangerous if we forget what they are. When we treat machines as neutral, we risk outsourcing not only our tasks but also our judgments. We scale linguistic dominance without accountability and mask exclusion as efficiency. [00:28:56] In short, we let statistical echoes stand in for human voices. But we can choose differently. We can build language technologies that listen more carefully, honor variation, and serve rather than silence. [00:29:10] We can move beyond the question of what machines can say and start asking what they should say and for whom. Because in the end, machines don't talk, we do, and it's our responsibility to make sure they echo something worth saying. [00:29:25] This article was written by Anika Shafer. She is a localization quality expert with a background in business, marketing, and psychology. [00:29:34] She drives quality strategy for global brands and explores how language shapes identity, society, and the way we see the world. Originally published on Multilingual Magazine, Issue 249, February 2026.

Show Notes

Why modern language models perpetuate bias and what to do about it

Episode Transcript

Other Episodes

Episode 77

Celebrating Indigenous languages on the International Day of the World’s Indigenous People

Episode 208

The Missing Language Syllabus | The Red List | September 2022

Episode 335

NTIF 2025: Shaping the Nordic Language Industry