Developing Language Technology Tools for Low-Resource Languages The Sámi Case Study

[00:00:00] Developing language technology tools for low resource languages the Sami case study by Chen Osteb Moshegon the Sami languages are spoken by a few thousand indigenous northern Europeans in the region of Sapmi, which spans four countries, large areas of Norway and Sweden, the northern areas of Finland and Russias Kola peninsula. In Norway, the country with the largest sami population, the people are represented by the norwegian sami parliament. In 2004, the parliament received funding from the norwegian government for a project to ensure a digital future and thus a future for the Sami languages. This kick started a 20 year effort to develop language technology tools for these extremely low resource languages with the express goal of making sami languages as easy to use in the digital world as Norwegian or English. The norwegian Sami parliament created the Divan Research and development group to begin tackling the project, and I was appointed project manager and head of the group. In 2011, the group was transferred to the University of Troms, UT, the Arctic University of Norway. Throughout the project, Divan collaborated closely with another research group called Chela Techno, also at Ute. Initially, we targeted three of the eight Sami languages North Sami, around 25,000 speakers, Lul Sami 102200 speakers and South Sami, about 600 speakers. Later on, we added more sami languages as well as other indigenous languages spoken in the nordic countries and in Canada. After two decades, the team has developed most LT tools considered necessary in todays digital environment, keyboards for all eight Sami languages and for six platforms, spell and grammar checkers, hyphenators, electronic dictionaries, machine translation, Mt, and speech synthesis. There is also an experimental version of speech recognition in the works. In many ways, the project has been a resounding success and serves as a model for developing LT tools for low resource languages around the world. However, the limitations of commonly used hardware and software negatively affect the tools accessibility and ease of use, bringing into question the projects potential impact. This article discusses what it took to achieve the projects successes, why it could be considered a last mile failure, and what lessons can be applied to future LT development projects. Design principles when working with sami languages, the scarcity of both digital and human resources is a constant challenge. For instance, due to previous assimilation policies in the nordic countries, almost no one born before 1975 learned to read and write their own Sami language. However, two guiding design principles helped us overcome the lack of existing resources, use grammar based technologies and make the source code reusable. Using grammar based technologies and focusing on reusability of the source code in a way turned the development of the supporting infrastructure into a game of getting the most out of the code with as little effort as possible. Although this way of working is labor intensive compared to the amount of work needed to establish a corpus large enough for machine learning. ML methods, the effort needed is almost negligible. Grammar based technologies with no or very little pre existing digital resources such as dictionaries and text collections, building LT tools using ML methods is practically impossible, but as long as there are native speakers, it is possible to build almost anything based on grammar. Grammar based technologies have a long history in Finland. A Finnish, just like the Sami languages, has a complex word structure and internal changes in the words as they are inflected. We relied on two grammar based technologies, one, finite state machines using formalisms developed by researchers at the University of Helsinki and Xerox in the 1980s and 1990s, and two, constraint grammar, a robust formalism for syntactic parsing that was also developed at University of Helsinki in the 1990s. [00:04:34] Finite state transducers are used for roughly word level processing, including inflection, derivation, and sound changes, while CG is used for everything else. Both technologies have proven fast, efficient and robust for a large range of purposes. Ifsts have proven capable of handling every human language thrown at them. So far, CG processing takes the output from the fsts and processes it further depending on the purpose of the tool. Grounding the LT tools in the grammar of the language means we needed and wanted native speakers on our team. Roughly half the team consists of native speakers and the rest either speak or are learning one of the sami languages. Being able to take your own language knowledge and scholarly training and turn that into tools that are valuable for the language community is a strong motivator and source of joy and pride in our work. Reusable source code with very limited pre existing digital resources and a population of native speakers that is trying to fill all the needs of the community, we cant afford to redo the same work twice. Besides being boring, it would be a huge waste of resources. Therefore, we try to build versatile resources that can be used for as many purposes as possible and that can be updated with more information relatively easily. The prototypical example is the difference between a descriptive analyzer, which should be able to tackle whatever text is thrown at it, and a normative tool like a spelling checker, which should flag everything not within the written standard. We solve this by tagging every word form outside the written standard with er xxx, where xxx can be a descriptive string so that we can subsequently remove all such word forms from the finite state model of the language, leaving us with just the accepted standard. We follow a similar design principle for other use cases as well, for example, we tag the exceptions and remove them from all tools in which they are not wanted. This way we we can, with a limited amount of work, support many uses of the same core description of a language. Supporting scalability as the infrastructure developed and achieved success for the initial three Sami languages, other language communities wanted on board both Sami and others. Around the same time, expectations grew beyond the initial set of tools, spelling checkers and hyphenators. It became clear that we needed to improve the supporting infrastructure to be scalable in two languages and tools. The goal for language scalability was that adding a new language to the infrastructure should take almost no effort. In other words, the only thing needed should be the linguistics. In this effort we succeed quite well. It now takes just a minute or two to add a new language in the most basic form and ten to 15 minutes to add additional metadata, configuration, and so on. The idea with tool scalability is that once a new tool has been developed for one language, it should be effortless to propagate support for that tool to all languages. Without going into details, our experience so far is that this is working as expected. Example tools we have added after first developing a prototype for a single language are grammar checkers and text processing for speech synthesis project, the team developed high quality spelling checkers for many languages, advanced grammar checkers for some, MT for several language pairs, and speech synthesis for three Sami languages. A lot of people in Sapmi, Greenland and the Faroe Islands depend on our tools for their everyday writing support, and even more will do so in the future. As anyone who has developed LT tools knows, the work never ends. Languages change constantly along with changes in society, and there is a steady need for updates to the source code. Still, when we measure the quality of our tools and compare them with similar tools for other languages, they are not worse. In fact, in some cases they are quite a bit better. Often they are simply different, with different strengths and weaknesses compared to machine learned tools. The best quality assurance comes from our users, though, who have told us numerous times about the tool's importance in their daily work. We have even been given a poem from a very grateful user. Finally, an important aspect of our work is that it almost ensures community engagement. Since a native speaker is needed as a crucial member of the development team, the language community is by definition involved. This in turn makes it much more likely that the tools will be used as they are in a way owned by the language community. Implementation challenges despite all the positive outcomes of the project, key challenges remain related to the technology platforms that users typically employ. Because our researchers dont own the platforms the users are on, they are not as compatible as we would like. The platform owners very often dont see the consequences of their actions for minority language communities. Lets take a look at a couple of hardware limitations. [00:10:04] Tablets are popular in schools. They are relatively cheap and easy to grasp for youngsters growing up on mobile phones. But to be functional for writing you need a physical keyboard attached. So tablets plus external keyboards is a common combination in nordic schools, including in sami classrooms. The problem was that you couldnt write Sami using just the external keyboards because third party keyboard apps are not given access to them on either Android or a pedos. So sami pupils would have to write either by using the on screen keyboard only or by jumping back and forth between the physical keyboard and the on screen one the situation for sami languages changed late last year when Apple published an update for all their platforms containing keyboards for eight sami languages, including support for hardware keyboards. This was of course very welcome for the sami languages, but it does not change the situation for all the other minority and indigenous languages in the world. Additionally, Apple keyboards do not contain a speller and Apple does not allow the speller in our third party keyboard app access to the text typed using its keyboards. Software limitations when we released the first spellers for north and Lul Sami in 2007, word processors and other office software only existed on your local computer or terminal server software as a service was hardly a thing for regular people. Google Docs went out of beta in 2009 and so our first three releases ran on local machines. When office software moved to the cloud, so did the servers and computers running all additional functionality. The languages being served were only the ones that the software producers supported. [00:11:56] Suddenly the sami community went back to square one. Thats why even today most of our spellers are downloaded and installed on a local computer. At some point the makers of cloud based office suites realize that they needed to allow for extensions plugins to be installed in the cloud based office. For each user that provide functionality not found in the original software, the plugin author can request access to the text and provide changes to it. Crucially, though, the plugin cannot ask for the language of the text, set the language of the text, make red squiggles under misspelled words, or populate a right click menu with correction suggestions. Every aspect of the traditional speller is off limits. [00:12:41] What is offered instead is a separate panel outside the actual document window which the developer can use as best as possible. And we have done so for the grammar checker and speller, but it can only batch process, the text and all traces of interactive incremental processing is gone. Causes of limitations the examples above are just two in a long list of various issues we have encountered in the projects last mile. Getting our tools into the hands of the users based on where they are, not on where we as developers are allowed to go. To the extent that technology companies have been mentioned, it is only for illustration purposes. The various issues are industry wide and concern every aspect of language localization and technology. From letter rendering to virtual assistants. The question soon presents itself, why have we ended up here? The way we see it, the following three reasons seem like plausible explanations. Blind spots developers naturally focus on major markets and dont realize that design decisions have consequences beyond their target market. Security concerns it is understandable that opening up a software platform to third parties can seem risky, so it is probably safe to assume that some of the limits we have met are based on security concerns. Market blindness sure, the total sami speaking population is only about 30,000, but according to CSA Research, 3 billion people face the same digital language access problems as the Sami people. That is a huge potential market. The endless list of seemingly careless limitations and restrictions causes uncertainty about whether well be able to get the tools into the hands of the language community. For each item on the list, each language community has to ask every technology provider the following please can we get our language in? Clearly this does not scale and it is one of the many forces driving language shift. The open language proposal so what can be done about it? We propose an open language model similar to the idea of open source. In most cases, the software stack is built on many components that together define the business logic or the processes of the system, the functionality that the software is supposed to give to its users. On top of that, there is a thin layer of localization strings and application programming interfaces APIs for language related services. We believe this thin layer should be opened up to every language community. [00:15:25] This layer is what the user sees and this is where every language matters. The idea is that by opening up this layer to all language communities, each one can decide what is crucial to its members, what they want and who they will cooperate with. Most importantly, in the open language model, there is no need to ask for permission to see, speak, read or write ones own language. For open language to actually be beneficial to the language community, three components are required, open integrated development, environment support and easy distribution. Open access platform owners should support open access by automatically making localization data available for developers for not only their own apps and systems, but also all third party apps using their ecosystem human language APIs and support systems should likewise be openly accessible to access both data and APIs. It should be enough to be a registered and verified developer. Just as for regular software development on a given system I'd support the default data type for localizable strings should be made such that the platform owners can extract the base locale and make it available for localizers with no effort from the developer. It should also be automatically made available to localizers at latest when an app is released and in a format that supports easy updating of an existing localization. Easy distribution when a package of localized apps and human language processing tools is ready for release, it should be easy to get support for the language into the hands of users. The existing app stores for various systems and platforms would be an excellent avenue for this. In addition to getting apps, you also get language support from your preferred language provider. Conclusion open language means that using your language where and when you want to should be effortless. The burden on developers should diminish as there would be no need to maintain localizations or to remember to use the correct string type. This should all be handled automatically and invisibly by the platform owners and the IDE. As a further bonus, platform owners can open up huge new markets bye both for themselves and third parties with a modest investment in their platforms. The core idea of open language was presented for the first time at the United Nations Language Technologies for all LT four conference in Paris in December 2019 at the end of the international year of indigenous Languages. The period of 2022 to 2032 has been declared the international decade of indigenous languages. By the end of that decade, I hope all major players in the computing technology industry will adhere to open language principles and remove all barriers to entry for all of the worlds more than 7000 languages. [00:18:29] This article was written by Sir n Osteb Moshegon, a chief engineer at UT, the Arctic University of Norway, and has been leading Sami Lt development for the past 20 years. Previously, he developed Lt at Lingsoft of Helsinki, Finland. Shaw holds a degree in general linguistics, originally published in multilingual magazine, issue 232, September 2024.

Show Notes

Episode Transcript

Other Episodes

Episode 266

The Alien Communication Handbook Review | November 2022

Episode 253

DeepSeek: Beyond the Headlines | Panel

Episode 211

How to Analyze Game Localization Quality Using Reviews | September 2022