Mitigating Hallucinations in AI-Powered Translation

[00:00:00] Mitigating Hallucinations in AI powered Translation by Olga Beregovaya and Alex Yanushevsky Artificial intelligence has revolutionized translation, providing high quality output at unprecedented speed and scale. [00:00:19] Today's AI powered translation tools convert speech, text and images into different languages in seconds, enabling global communication and collaboration like never before. [00:00:31] But these massive capabilities come with massive shortcomings. Human communication is complex and ever evolving. [00:00:40] Our languages encompass more than words. They require a deep knowledge of tone, context, emotions, intent, and shared experiences. [00:00:50] Although generative AI powered translation systems are amazingly fluent, they sometimes struggle with relevance and accuracy when it comes to language specificities such as voice, tone, cultural nuance, and even verifiable facts. When it comes to translation, large language models LLMs can and do make mistakes. [00:01:14] One of the biggest challenges in AI translation is the presence of hallucinations. [00:01:19] Unlike simple mistranslations, hallucinated outputs introduce false information by generating details that were not present in the original text. What's more, these hallucinations usually appear credible and are delivered with the same confidence as accurate outputs. Even minor inaccuracies in translation can have major consequences. [00:01:41] False fluency, where the text sounds natural in the target language but is actually wrong, can lead to misunderstandings, misinformation and ultimately distrust. [00:01:54] In extreme circumstances, it can even result in legal proceedings. To realize the full potential of AI translation tools, we must first recognize their pitfalls. The good news is that there are many practical ways to detect and mitigate AI hallucinations in translation. [00:02:13] This article delves into the reasons behind the problem and the techniques that can be used to address it. What causes AI hallucinations? [00:02:21] Unpredictable model behavior in language translation can be caused by a mix of technical limitations in modelsys architecture, noisy training datasets, vague or ambiguous source inputs, unstable model decoding parameters, and the simple fact that generalized foundational models were not initially built for translation tasks. The ability to perform translation is an ancillary benefit due to predominantly English centric training data. There is a higher risk of AI hallucination for languages other than English, especially under resource languages and languages that are both complex and linguistically distant from English, such as Estonian or Turkish. Different translation models have different blind spots. [00:03:09] Neural machine translation models excel at delivering actionable, consistent and predictable translations, especially in well supported languages with abundant training data through continuous retraining and adaptation. Purpose built NMT models also tend to gradually produce more relevant on brand translations. [00:03:32] However, they tend to be more limited in language fluency and understanding of the context present in the source language. [00:03:40] On the other hand, LLMs leverage diverse datasets to perform a multitude of tasks from coding to search and document summarization, with translation being just one of many tasks. While LLMs have the ability to incorporate rich cross domain knowledge, they carry a higher risk of distorting the meaning of source material. [00:04:02] Both types of systems rely on tokens rather than full words to process text. These tokens can be words, subwords, or even punctuation. [00:04:13] Tokenization can be a fundamental constraint even with large context windows. [00:04:19] For example, due to the absence of word delimiters, character based tokenization of Asian languages increases the risk of translation unpredictability and model hallucinations. [00:04:31] Model decoding parameters such as temperature lend further variance and unpredictability to generated outputs. Temperature, which balances creativity with predictability, can lead to increased AI hallucinations. [00:04:47] Higher temperature settings use more randomized tokens to predict and generate text. This can produce more natural sounding translations, but risks going too far by introducing errors or or even completely fabricated details while producing false fluency. Non deterministic AI models can give different outputs from the same input. It's impossible to tell if a translation prompt will work by just reading it. In LLM based translation, surface accuracy can be misleading. A prompt with spelling mistakes or logical gaps might still yield a fluent output, while a carefully detailed prompt can fail unexpectedly. [00:05:27] Variations in domain or tone further increase the risk of hallucinations, producing inconsistent or unreliable results. [00:05:35] How to Mitigate AI Hallucinations Deploying practical mitigation strategies can reduce the risk of hallucinations, enhance linguist productivity, and optimize overall translation quality. One of the simplest ways to identify hallucinations is by running a mechanical rule based check is a boundary much like a conventional spelling or grammar checker. This automated verification process reviews structural issues such as word count, punctuation, spelling and terminology and quickly flags potential hallucinations. [00:06:12] Take the source to target length ratio as an example. If an input is 20 words and the resulting output is 200 words, there is probably a hallucination in the output. Otherwise, where did the extra words come from? A drastic change in word count often indicates hallucinations or unnecessary additions that require automated correction or even better human intervention. Another strategy involves applying machine learning based techniques including reviewing semantic similarity and lexical accuracy. With both modern state LLMs and state of the art embedding models such as text multilingual embedding002 in Google's Vertex AI platform. [00:06:58] It's also possible to experiment with more advanced approaches such as semantic entropy algorithms and log probability analysis. This meaning based approach analyzes ambiguous phrasing and inconsistent outputs which can indicate a potential hallucination. [00:07:16] The higher the semantic dissimilarity, the higher the probability that something is off with the translation. The following five approaches have also proven successful at mitigating hallucinations. [00:07:28] Properly structured Prompts it sounds obvious, but AI prompting techniques that reinforce fidelity mean that AI translation sticks to the source material instead of trying to fill in the blanks. [00:07:42] Be as specific as possible, providing the model with explicit instructions. [00:07:47] For example, safeguard instructions in the form of an LLM prompt could, if you're not certain, reply with I don't know as part of the reply. [00:07:58] This is especially critical for domain specific content such as legal, medical or technical documents. [00:08:06] Proper Model for the Task There are diverse sets of models available for AI translation. Their size, cost, speed and quality can have dramatic impacts on your ability to translate at scale. [00:08:21] Understanding the languages supported the optimal prompting styles, their fundamental ability to perform specific linguistic tasks, the prompt size you can execute before losing quality, the appropriate parameters, setup, and the quality versus cost tradeoffs of the various models is all necessary to optimize the AI translation flow retrieval Augmented Generation Rag technology combines industry leading AI language models with retrieved knowledge sources to provide context. RAG enables translations based on original linguistic assets such as translation memory, matches, style preferences, available glossary terms and more. The result is richer, more accurate translations based on your preferences. [00:09:14] Model as a Judge the model as a judge approach uses LLMs to evaluate another model's outputs. This practical mitigation strategy implements a self healing loop to correct identified issues through automated post editing and smoothing. [00:09:32] Human in the Loop While AI brings speed and consistency to translation, human reviewers bring unmatched creativity, emotional intelligence, cultural knowledge and critical thinking skills. With AI powered translation tools, subject matter experts can verify machine outputs and correct mistakes towards better AI translation. [00:09:58] The need for fast, accurate multilingual translations spans all industries and with stakeholder safety, trust and finances on the line, bad translations are not an option. At least with current AI technology, hallucinations can't be completely eliminated. [00:10:16] Until AI models have enough data and real world knowledge, errors are inevitable. [00:10:22] This is especially true for lower resource languages such as many Indic, African and Long tail Slavic languages which remain at higher risk of hallucinations and inaccuracies due to insufficient lexical and corporate coverage. Overcoming these challenges will require new resources, diverse data sets, fine tuned models and output validation that is impossible with machines alone. [00:10:48] The future of high fidelity translations will combine proactive mitigation and human oversight to bring about seamless human model collaboration. [00:10:58] As AI translation models evolve, human oversight is and will continue to be essential to detecting hallucinations, correcting inaccuracies, and creating meaningful connections across cultures. Olga Beregovaya is vice president of AI at SmartLink. She has more than 25 years of experience in natural language processing, machine learning, AI model development, and global content delivery. Olga serves as a technology program sponsor for Women in Localization. [00:11:31] Alex Yanischevsky is senior director of AI Solutions at SmartLink. His areas of expertise include AI, natural language processing, machine translation, data mining, and computational linguistics. He has written numerous articles for industry journals and has presented at industry conferences.

Show Notes

Episode Transcript

Other Episodes

Episode 72

Alfred Mtawali 30 Years, Just Getting Started

Episode 285

Why Generative AI Still Struggles With Indian Languages

Episode 264

Native, Hybrid, or Localized? HubSpot’s framework for marketing in many languages | November 2022