Episode Transcript
[00:00:00] Arabic speakers deserve high quality machine translation.
[00:00:04] By Dima Jaradat Machine Translation MT is a key driver of the future of the language industry, and while it has proven effective for several languages, it is less robust for Arabic. As MT continues to proliferate, it is crucial that linguists working with Arabic join the conversation so that they contribute to refining the technology to better augment their roles. After all, Arabic is one of the world's most commonly spoken languages. According to Worldata Info, Arabic is spoken by about 328 million people as their first language and by around 400 million people as a foreign language. It is one of the six official and working languages of the United Nations General Assembly. Arabic is also widely spoken in The United States, U.S. as of 2022, U.S. census Bureau statistics indicate that Arabic is spoken at home by more than a million.
[00:00:58] Not only that, but today's sophisticated MT engines actually have roots in Arabic, building on the legacy of 9th century Arab cryptographer Abu Yusuf Al Kindi C through lines of Genius from Multilingual Magazine's May 2024 issue, three techniques that Al Kindi identified serve as the basis for modern the quantitative technique, which compares the frequency of plaintext and ciphertext letters the qualitative technique, which analyzes which letters can or cannot be combined and determines the proper order of letter sequences the probable word technique, which leverages the predictability of language to fill in missing information with likely word choices. Challenges of Arabic despite the historical link between MT and Arabic in the 21st century, MT tools are underperforming for speakers of Al Kindi's language. Several linguistic features of Arabic pose challenges for modern MT engines. One factor is intralinguistic diversity, referred to as diglossia, a phenomenon whereby a language has distinct varieties used in different contexts within the same community.
[00:02:08] In addition to the formal modern standard Arabic variant, there are at least 22 distinct dialects. These include, but are not limited to, the dialects of the Levant, the Persian Gulf, and Northern Africa. While diglossia enriches the collective linguistic repertoire of the Arabic speaking world, it is a pain point in translation by virtue of its very nature. Professional translation transfers content not only cross linguistically but also cross culturally. Therefore, understanding the linguistic and cultural nuances of Arabic speaking locales is essential to accurate translation. Another translation challenge that poses itself when working with Arabic is the polysemy of words which where a word can bear multiple meanings or senses. For instance, as a native speaker of Jordanian Arabic, I grew up saying yadik al afiya, an expression that literally translates into May God give you good health. It is one of many possible ways to express gratitude or even greet someone. In this phrase, afiya means health, but the word does not have the same meaning in all Arabic speaking locales. For example, afia means fire in the Moroccan Dariya dialect, so unless you want to see someone set on fire, you would want to use caution when using this expression in Morocco. This is but one example of how polysemy might cause translation challenges for human linguists and MT engines alike. Arabic also features inconsistent term usage across dialects, which undermines the probable word cryptanalysis technique in terminology management, the practice and study of collecting, defining, standardizing, translating, localizing, and maintaining terms for a domain specific use in a terminology database or glossary. A term designates one and only one concept 1:1 ratio.
[00:03:56] For example, target language is a translation studies term that refers to the language into which a source text is being translated. Through consistent use, it has been standardized to mean exactly that, facilitating understanding among speakers in that specific domain. Alas, the situation is more complex in Arabic. Consider the English term authority in the sense of an official governmental body such as the Federal Labor Relations Authority, which has the following Arabic equivalents in three different Egypt Maslaha, Kuwait Haya, United Arab Emirates Jihad While these Arabic terms are all semantically acceptable translations of the English term authority, they are not interchangeable across Arabic speaking locales.
[00:04:43] This is an already challenging undertaking for human linguists, let alone MT engines. Prior to being trained on nuanced datasets, MT engines often mistranslated such terms, and in many cases they still do. The irregularity in the usage of terms across Arabic dialects breaks down the predictability of language, leading to MT outputs that lack accuracy, consistency, and cultural sensitivity, all of which are hallmarks of successful translations that meet the specific needs of target readers. A call for Improvement in the era of human machine collaboration, it is essential to develop a reliable ally that effectively augments our roles moving forward. The solution is not to abandon MT engines in response to the issues outlined above.
[00:05:29] Rather, linguists working with Arabic, especially those at the forefront of MT training and refinement, should take corrective and preventive measures to enhance the efficacy and usability of MT for Arabic. Addressing multiple layers of complexity calls for a multifaceted approach to tackle the polysemy diglossia, inconsistent term usage, and other language specific challenges inherent in refining MT for Arabic. An ideal solution would be to establish a formal standardization of terminology across Arabic speaking locales. This proposal is by no means an attempt to stifle the rich intralinguistic diversity of Arabic. Rather, it is a means of streamlining the use of Arabic terms in translation, making the process easier for both human translators and MT engines. Even if such a standardized framework materializes, cultural nuances will inevitably exist. This is where mt, no matter how sophisticated it becomes, will still fall short. Given that MT engines hinge on training data, developing and using datasets that fairly represent the full spectrum of Arabic speaking locales is of paramount importance.
[00:06:38] Commendable efforts are already underway to make this a reality, and continuing to invest in this arena would yield more adaptable and reliable MT engines capable of handling the cultural subtleties of Arabic. We must also overcome the paradox that many current and future native Arabic speaking linguists remain unaware of the historical roots of mt. Recognizing these Arab roots will encourage Arabic speakers to take the driver's seat and contribute rather than merely respond to the future of mt.
[00:07:06] A good place to start is Arabic translation and interpreting classrooms. These are incubators of a vast pool of talent waiting to be tapped and are the spaces where the next generation of linguists is being prepared for the challenges of the future.
[00:07:20] Instead of graduating cohorts of passive users of technology, educators should encourage students to view themselves as active participants in the MT conversation. This article was written by Dima Jaradat, a PhD student in translation studies and a global literacy instructor at Kent State University.
[00:07:38] In her research, she's interested in language policy and justice. She has also worked as an English Arabic linguist. Originally published in Multilingual Magazine, Issue 238, March 2025.