Episode Transcript
[00:00:00] Terminology in the Age of AI by Kara Warburton From Carl Linnaeus work on taxonomies in the pure sciences to Melville Dewey's decimal classification system for libraries, going back even further to the philosophers of antiquity, humankind has sought to organize concepts, objects, and phenomena to fully comprehend the world around us. From this quest emerged the discipline known as terminology.
[00:00:28] For decades, terminologists like myself have known that terminological resources, long digitized, can support various practical IT applications. Translation companies are using computer assisted translation CAT tools, and global enterprises are increasingly adopting advanced natural language processing technologies for controlled authoring, content management, and other business processes.
[00:00:53] Companies and organizations are starting to see the value of developing their own in house terminology databases Term Basis but what about artificial intelligence, which is also an NLP application? Let's reflect on the relationship between AI and terminology, starting by looking at microcontent as a type of linguistic resource.
[00:01:14] The shift from terminology to microcontent to enable terminology resources, e.g. term bases to serve cutting edge applications such as machine translation and generative artificial intelligence.
[00:01:30] First we need to stop calling them terminology resources, as this leads many to think that they contain only terms in the strictest sense, that is Designations of concepts from a language used for specific purposes, e.g. legal and medical terms. Calling these small units of language microcontent can help unchain the discipline from that narrow perception. Domain specific terms are only one form of microcontent. Other forms include words from the general lexicon, fixed phrases, slogans, collocations, word associations, graphical symbols, even morphemes subword units. Microcontent refers to small linguistic units that, if captured and discreetly represented in digital form, can be consumed by an application or process to improve its output and performance.
[00:02:17] Conventional terminology resources have done just improve content production processes such as translation, authoring, public service distribution, and marketing messaging.
[00:02:27] However, we have an opportunity to further extend the use of terminology resources to rapidly developing new NLP technologies, specifically MT and Gen AI. And these apps need more than just terms, they need microcontent.
[00:02:43] Microcontent in the service of AI since the public release of ChatGPT in November 2022 followed quickly by other Genai chatbots, the interest in Genai has intensified. To put it mildly. There is no doubt that AI will transform society, but how it will transform society remains to be seen.
[00:03:04] Some people fear jobs being replaced, education impacted, and media controlled. On the flip side, AI could enable more effective knowledge extraction to solve problems. We're entering unknown territory. Let's start our reflections by summarizing how AI works Genai uses a large language model LLM, a framework trained on a large corpus of texts to infer knowledge from existing human created content. Public gen AI applications such as OpenAI, ChatGPT, and Microsoft's Copilot use general purpose LLMs developed with Open source Internet content.
[00:03:40] These systems aren't trained on an organization's specific materials, so the information they produce will be equally generic and thus sometimes irrelevant or even incorrect for a given context. Another problem is that the content of LLMs is dirty, comprising giant blobs of unvetted, unstructured text rife with inconsistencies, inaccuracies, and ambiguity.
[00:04:03] The saying garbage in, garbage out could not be more fitting for LLMs and Genai, and most would agree that cleaning LLMs is impractical due to their massive size. Organizations may be able to circumvent the dirtiness and lack of structure in context of generic LLMs by deploying an LLM based on their own content in a custom AI implementation.
[00:04:26] According to Donald DePalma and Aral Lommel, organizations will benefit from having large language models trained on appropriate data and focused on their use case. These focused models are an instance of vertical AI that relies on models designed to address particular use cases, verticals, or challenges. However, for this to work, the organization's own corpus needs to be clean and well structured, which is often not the case. Adopting best practices when creating content can help to improve the performance of custom AI implementations. Leveraging corporate LLMs the main guidelines here are consistency and accuracy in the corporate language. As companies increasingly aspire to deploy AI, they need to start producing clean content. This means having a term base is key. Another method to improve AI is incorporating a different type of external linguistic asset, one based on structured semantics. Leveraging additional linguistic assets in GENAI is known as Retrieval augmented generation rag. Per IBM, RAG is an AI framework for improving the quality of LLM generated responses by grounding the model on external sources of knowledge to supplement the LLM's internal representation of information. This combination of clean LLMs and RAG reduces AI hallucinations, incorrect or misleading information.
[00:05:48] RAG is often touted as the key to driving improvements in AI. It incorporates external curated sources of knowledge in the Genai engine to help interpret the large corpus of linguistic data that the AI system uses.
[00:06:02] Literature on the RAG approach frequently references knowledge graphs, knowledge bases whose data is represented by and operates on a graph, structured data model, or topology.
[00:06:12] Ontologies, taxonomies, semantic networks, semantic stacks, Darwin information typing architecture, dita architected topics, and other forms of intelligent content may also be leveraged in this context. Some of these resources are interdependent. For instance, a knowledge graph may contain data instances that align with an organization's ontology. The ontology may import labels from a taxonomy. The taxonomy may be derived from the term base. The term base may contain keywords used to semantically tag data, topics, and so forth. These assets have a common guiding principle, semantic rigor, which is core to the terminology discipline.
[00:06:52] Michael Iantoska depicts these assets in his Semantic Content Maturity Model and adds, managed terminology is a prerequisite that must be developed before or in parallel with other semantic assets. An enterprise termbase is the source of truth for all the words used in our taxonomies, ontologies, knowledge graphs, and in the content itself. Anyone who's incorporated linguistic resources in NLP applications knows that to be machine interpretable, they need to be consistent, granularly structured, semantically organized, associated with metadata, representable in a machine readable markup language. Properly developed termbases deliver all these features. Companies and organizations have invested in their term bases initially to provide assistance to their translators in the form of so called bilingual glossaries. They started with primitive tools such as a word processor or a spreadsheet application.
[00:07:48] They soon realized the limitations of such tools and purchased software dedicated for developing multilingual terminology.
[00:07:55] Because the primary use case was translation, many organizations opted to use the terminology management software embedded in their CAT tool. Later came the desire to use the terminology data for controlled authoring.
[00:08:09] This is where the problem starts. Terminology data in a CAT tool can't be easily transferred to a CA tool due to differences in their focus and structure, and CA requires different types of microcontent than cat. If an organization later decides to use its termbase for other applications such as search engine optimization, SEO, multilingual content management, or indexing, it would face more challenges due to data incompatibility and gaps. Standards governing the structure of termbases and the machine readable representation of their content. In particular, ISO 3042 termbase exchange and ISO 16642 terminological markup increase the interoperability of microcontent resources across various applications. These standards require microcontent resources to adhere to certain basic principles.
[00:09:02] Concept orientation Data granularity A wide range of metadata also standardized a fixed metamodel concept orientation is what enables microcontent resources to comprise multiple languages, support the ranking of synonyms for ca, and enable search query expansion for SEO, among other functions. This principle also enables term bases to represent semantic relations between concepts.
[00:09:30] Several of the more advanced terminology management software tools can represent these relations in graph form. This sounds a lot like a knowledge graph. What's really interesting is how the four principles align with the requirements for machine interpretability mentioned earlier. Take the third requirement, semantically organized, for example. This is precisely how term bases are structured.
[00:09:53] Concept orientation and the semantic relations that it enables is shared by the various types of external knowledge sources required by rag ontologies, taxonomies, knowledge graphs, etc. Regrettably, some AI experts don't realize that this principle was developed for the field of terminology centuries ago, human yes, even before wuster, and is practiced by terminologists every day.
[00:10:17] This lack of awareness about the synergy between terminology as a discipline and AI is a loss of opportunity. Microcontent resources built on the proven theoretical and methodological foundations of terminology as a discipline can be leveraged to create intelligent knowledge resources to enhance AI. Because many organizations have already developed in house term bases, the potential for reusing this data to build an AI solution should not be overlooked. For instance, we could leverage microcontent resources to use synsets to map a word used in a query with other synonymous words in the LLM, expanding the search while maintaining accuracy. Determine the knowledge domain of a term used in a query to more accurately identify the relevant content in the LLM. Distinguish homonyms through associated keywords and other metadata, part of speech domain, etc. Use accurate curated definitions to determine the meaning of words used in a query. Map words to their equivalents in other languages for multilingual AI, crawl up or down a semantic stack to broaden or narrow the information returned for a query. Suggest related material based on semantic relations Check facts from curated content Having access to a trusted source of curated microcontent as a foundation will help to ensure that the other derived resources, ontologies, taxonomies, knowledge graphs are semantically coherent. As Jantoska says, a common terminology base serves as the lingua franca for your entire semantic stack. Without it, one is building a stack of cards that will inevitably collapse. So it seems that microcontent resources and the long standing principles, methodologies and standards inherited from terminology that are used to create them can be the model for developing the external sources of knowledge for racing. As Mike Dillinger pointed out in a common Sense advisory presentation, knowledge graphs will play a central role in the next stages of development of AI. And we could say that terminology work is at the very heart of knowledge graph construction. In the voluminous discussions about GEN AI, the potential for AI to be deployed in languages other than English is rarely mentioned and for good reason. As shown in Figure 2, nearly half of the so called common crawl data available to produce a generic LLM is in English and another 38% is in European languages.
[00:12:38] There are hundreds, if not thousands of under resourced languages for which GENAI would be impossible to deploy, but terminologists and organizations across the world have been gathering multilingual microcontent and storing that data in robust IT systems for decades.
[00:12:54] It would seem then that the semantically based resource required to enhance LLMs could be ready to deploy in some non English gen AI systems long before the LLM training data is available for those languages.
[00:13:06] AI in the Service of Microcontent We've seen that microcontent resources developed according to the principles of terminology as a discipline can be leveraged to improve AI. But can AI assist in the development of microcontent resources? Indeed it can if used with discretion. Can it replace terminologists? I think not. We can use AI to craft definitions, albeit with caution. Antonio San Martin Pizarro, a professor at the Universite du Quebec, Trois Rivieras, Canada, shows the pros and cons of using GENAI for creating definitions.
[00:13:44] Barbara Ingekarch of BIK Terminology suggests that prior to asking an AI app for a definition, you should already have one in your head, in particular knowing the correct superordinate genus term and perhaps some of the distinguishing characteristics.
[00:13:59] This seems to be a common recommendation. Do not use AI to answer questions that you cannot answer yourself, or at least without having a general idea of the correct answer. Otherwise you won't know when it's wrong. AI can also be used to extract terms from a text, but to exercise due diligence the results should be compared with those of purpose built term extraction tools such as Sketch Engine and Thermostat that use a reference corpus to measure the saliency domain specificity of a candidate term.
[00:14:30] These tools can also work with very large submitted corpora, which increases output quality, whereas gen AI prompts, at least those offered to the general public, are limited in terms of how much text you can submit. Gen AI can also be used to find synonyms, explore collocations, for instance verbs that are frequently used with a certain noun and identify equivalent terms in other languages.
[00:14:53] Again, caution should be exercised and the information checked. What AI can likely not replace are terminologists themselves. Too many interrelated elements of terminology work require reasoning, assembling a coherent and standards compliant terminological entry, verifying authenticity, ensuring contextually relevant and accurate data, establishing complex semantic relationships, and ensuring quality control Some have suggested that AI could generate knowledge graphs. If intelligent curated knowledge resources are necessary for next gen AI as the experts claim, then building them by using a machine with dirty data instead of a human with clean data would defeat the purpose, would it not? Dillinger thinks so. Curated knowledge sources like ontologies and rich knowledge graphs are increasingly viewed as key contributors to AI reliability and safety.
[00:15:44] Developing these resources depends on deep and often subtle understanding of semantics. I for one am excited about the opportunities for terminologists in the AI space, but they need to embrace the notion of microcontent and apply concept orientation and semantic relations with more rigor.
[00:16:03] This article was written by Dr. Kara Warburton, a terminologist and a leader in microcontent management. The author of the Corporate Terminologist, she has long advocated for extending the notion of terminology to microcontent to optimize repurposability.
[00:16:19] Cara operates a consultancy in microcontent terminologic and teaches university courses in terminology, comparative linguistics, localization, and translation. Originally published in Multilingual Magazine, Issue 237, February 2025.