The Unicode Consortium: Inside the Global Mission

[00:00:02] Speaker A: Hello and welcome to Localization Today, the show where we explore the stories, people and technologies driving global communication. Today we are focusing on the Unicode Consortium, the nonprofit that develops the Unicode standard so that text and symbols display correctly everywhere in every language. Without Unicode, your device wouldn't know how to render emoji and accented letters or complex scripts from around the world. I'm Eddie Arrieta, CEO at Multilingual Media, and joining me are three key contributors to the Unicode's mission. First, Stephen Loomis is the chair of Unicode's Digitally Disadvantaged Languages Working Group, veteran software internationalization engineer who has contributed high quality code to icu. Elango Sharon, vice chair of Unicode's community engagement team and internationalization engineer at Google, who contributes to ICU and ICU4.x and helps drive message format standards. And last but not least, Bridget Chase, a language technologist specializing in digital tools for indigenous language communities, ensuring that underserved scripts and orthographies gain Unicode support. We'll discuss today Unicode's non profit open source model, how its technical structure enables global text interoperability and the ongoing work to bring every language into the digital world. Thank you all for being here in the multilingual ecosystem. It's great to have you here and I have to say that I've been working with the Unicode team for, I would say it's over a year now. Elango, I don't know how long have we been in contact collaborating and trying to do different things. [00:02:05] Speaker B: Yeah, maybe it's been a year, maybe it's more. There's so much work that's going on. Yeah. It's hard to keep track. [00:02:11] Speaker A: Yeah. And it's incredible to see amazing people doing amazing things. And that's what we're here to talk about. We love to get a little bit of, of background from each of you. Steven, Bridget, Elango, your short version of your career, what has you here? And of course, if you can tell us a little bit about your role in the broader Unicode ecosystem, how do you first became involved and what your focus is on today, we can do. Stephen, Bridget, Elango. [00:02:43] Speaker C: Okay, very good. So I'm Stephen Loomis and I am the owner of Codehive TX llc, which is, which is my own software consultancy focused on internationalization. Before that I spent almost 23 years at IBM doing corporate globalization. We helped launch ICU CLDR as open source projects. And as far as my role in the ecosystem, I'm the chair of the Digitally Disadvantaged Language Working Group. But I also am a contractor working on CLDR enablement software. So some of the infrastructure that's used there and I'm also the editor for the keyboard spec that's in cldr. [00:03:26] Speaker D: Hi, my name is Bridget Chase. I am a language technologist. I've been working with indigenous communities for over a decade now, predominantly in so called British Columbia, but across North America to support language revitalization efforts with technology. And this includes everything from the foundational technologies required to have your language work in digital spaces like Unicode, all the way up to building complex pieces of software in order to teach and learn language through revitalization strategies. And I'm calling in today from the unceded and ancestral territories of the Musqueam, Squamish and Tsleil Wauth first nations community communities in what is commonly known now as Vancouver, British Columbia. [00:04:19] Speaker B: Hi, I'm Ilango Charon. My background's in computer science, but I specialized at first in bioinformatics, kind of genetics, related data, algorithms and such. That led my work into big data know distributed computing. But my real passion was to try to find a way to help languages with computing skills. And so that that led me to internationalization and Unicode first. I was doing some kind of side projects on my own before I really even knew really all that Unicode has to offer. But currently I made my way to Google and, and through Google I work primarily in open source projects. So I contribute to the software, to the software projects, work on some of the standardizations and the community engagement. [00:05:24] Speaker A: Thank you, thank you all. And I love the next question because it talks a little bit about how you all came in contact with Unicode. I can tell you how I came in contact with Unicode. So I didn't know anything about localization or language industries. And I came into the industry about two years ago and about it must have been like a year and a half ago. Renato told me like I think what people at UNICO are doing, it's great. And I was like, who are these people at Unicode? I have no idea who they are. So I tried to like figure out who it is and then I found finally get to be in contact with Toral and then she's all available and open to collaboration and making a difference. And then that's how I came. I guess there was a little bit of my own curiosity from a comment from someone who I respect and then that curiosity led me to even volunteer some of our time and resources to the cause. So that's how I came. I don't know how you want to do it. Who wants to go first so I'll let you all decide. [00:06:33] Speaker D: I feel like Steven's been around the longest with Unicode if I'm not incorrect. So maybe. Stephen, you want to go first? [00:06:40] Speaker C: Sure. I can kick this off. Yeah. I think one of the things that's interesting is that it's just the people involved, but first came into contact with Unicode. I was working as an intern at Intelligent Apple IBM joint venture and I was doing network protocols and I was trying to write a chat client. Just a, you know, text based chat back and forth. And for some reason my strings were getting corrupted. I didn't know why. I can only get one character through at a time. Being the network team, I fired up a network analyzer, plugged it into the the network, and on inspection there was a null, there was a zero byte in between every other character of the text string. And I didn't know at the time, but it was actually Unicode text. It was in UTF 16, but at that point I just was wondering who was corrupting my text. So anyway, that was my first exposure to Unicode and sometime later I transferred to the international team that was there. [00:07:47] Speaker B: That's where ICU got created. Right? Teligent? [00:07:51] Speaker C: Yes. From Telogen and from IBM. Correct. [00:07:55] Speaker A: Then who's next? Is it Elango or Bridget? You are next. [00:07:58] Speaker B: Oh, am I next? [00:07:59] Speaker D: Yeah. I think you'd be working with Unicode probably longer than me. So you go ahead. [00:08:04] Speaker B: Okay. I think my first exposure to Unicode is not from within. It was from without as a member of a language community. So the language I speak with my parents is Tamil. It's since spoken in South India, Northern Sri Lanka, Malaysia, Singapore. And there was a community computing event that they had back in 2002. And I remember somebody from Java, from Sun Microsystems came and said, hey, Java 1.3 is coming out and here's some cool new things like we can support Unicode and we can support fonts. And so he had a demo where he was actually typing into, I want to say it was an applet, a Java applet back in the day. And he was typing Wanakan, like the word for hello. And I was like, what is this? This is amazing. How do I get this? And then I came back home and I'm still using Windows, you know, Windows XP or something. And I didn't really know, it just wasn't all set up, but it just was in my mind. And you know, I also heard about Unicode here and there through the community, but it made me aware of the technologies that were underneath that were supporting internationalization. And it Just always kind of stuck in my mind as, like, this is where everything happens. I still, for the longest time though, and basically until right before joining the team, I didn't really know all the things that Unicode did. That's why when I was writing my side projects, I was doing my own segmentation code, because I was like, oh, this doesn't exist. I looked on the Internet, I can't find it. But, oh, there's so much, there's, there's that and so much more. [00:09:53] Speaker D: Yeah, Unicode is quite vast, hey, in terms of the, all of the things that it actually does and, and the different projects that, that it's involved with or it runs. I came across Unicode. My first sort of formal exposure to Unicode, similarly was a number of years ago when I was working with a community who was really struggling to figure out what was wrong with some of their text processing. And so I think Unicode is one of those things that again, most people have no idea that it exists or how it works because it works so well and so seamlessly that you don't have to think about it. And so hilariously, it was only when it wasn't working that well, that's when I was sort of first exposed to the concept of it. And it was because this community, this indigenous community, has an Alphabet, an orthography that uses special characters, and those special characters just weren't displaying quite right and things in the dictionary weren't alphabetizing properly as we were expecting it to. And I was sort of tasked to figure out why. And after doing some digging, I realized that the Unicode code points that they were using sort of differed throughout their text. You know, they were using a certain code point in certain, some words and a very visually similar but technically different code points, in other words, and that was causing all sorts of issues in the back end of this platform that they were uploading content into. And so that was really my first exposure to Unicode was, oh, well, this isn't working quite right. And so then when I started digging into, okay, well how does this actually work and what the background of this and what's the setup of this? You know, I, I, I plunged head first into the deep end with, you know, with all of the amazing complexities and histories of, of Unicode. And I've been working closely in the space ever since with communities, indigenous communities, predominantly across North America, who, who need to make sure that their languages can display properly on, in digital spaces, which is what Unicode, one of the many things that Unicode does. [00:12:24] Speaker A: Thank you so much. I think there is so much depth to the work that's been done at Unicode. I just wonder if in a far away galaxy they'll talk about our languages as the indigenous languages. Oh, you remember the Earthly language, the archaic Earthly language might be. I of course would love to let the audience also have the opportunity to learn a little bit more on what the Unicode is, what the Unicode Consortium is and why does it exist. I'm going to switch gears a little bit. I'm going to start with Steven. Steven, you've been here the longest. We've talked about it. Just in a bit. Why was the Unicode Consortium established as a non profit organization? How does that start to shape Unicode's mission and operations? And then you can tell us a little bit about that mission and operations just for the audience that might not know much about the consortium. [00:13:25] Speaker C: Certainly. Yeah. And like a lot of technological underpinnings, you think of plumbing or electrical power, when everything works great, you don't notice it, you don't know it's there. And I think that Unicode is one of those where early on you might just assume, especially working in a majority language, let's say English or something else, everything seems to work fine. There must be vast amounts of people that are hard at work and making all these things happen when actually it's, you know, it's handfuls of people. But the, prior to the existence of Unicode as a standard, the basically there was no single way to represent all of human text. You had to basically switch to a different environment, a different code page. I remember going to the store and buying a box that had the Indian Language Kit, you know, this is for Mac, for example. And once I purchased that and got that home and installed it, now I can use the languages of India, right? So it was something where you had to, and you had to switch into that mode. And if I created a document and sent it to someone who didn't have that box, that software box, then they couldn't work with it at all. Unicode came about, I wasn't there at the beginning of Unicode certainly, but Unicode came about cross industry across multiple companies working together to say let's solve this issue of text encoding once for all. Let's just solve it across industry rather than making that a specific feature of one company. And so as a, as a result, I think that's, that's very key because if it's, if it's just a feature, then now you have market segmentation even among majority languages, not to mention ddls, as far as I know. I don't think that helping Smaller languages come up to speed in the digital world. I don't think that was the original goal of Unicode, but it certainly has fallen into place as Unicode became ubiquitous then it certainly was the way to get a new language, a new writing system supported. So Unicode was established as a nonprofit organization, I think, think I want to say 1993. I was just looking at this more recently, but. And because of that, its work is open, it's open standards, it's open source, and for that reason it really exists to serve the language communities. [00:16:44] Speaker A: I also would like to come back to you in a bit, Steven, on the digitally disadvantaged languages. Before that, I want to ask Elango something that you probably went through with your experience or you realize I need to have this. What is this thing? Once a developer wants to support multiple languages, how do they typically incorporate the Unicode or ICU into their application? What does the integration look like in practice? What happens? Because in my brain, I don't know, it's like I say yes and then it happens. That's how I think about it. Could you tell us a little bit more about Illustrators? [00:17:25] Speaker B: Sure. I think what's nice is that in a lot of cases, if you're creating an app, if you're creating a mobile app, your framework will actually have that integration built in. So you have the functions that do that can format numbers and currencies and things built in. And a lot of these frameworks will also force you to put all of your user facing strings in a certain resource bundle and then kind of give them an ID and so on. Basically what that means is that those frameworks are already forcing you to have good practices. And there's a particular API called Message Format. It's sort of the entry point for all of the other things. And it's what the application engineers, as well as your translators, they both know about that because the translators will have messages and they'll go through message format. So they'll need to know how it works, or at least if not the syntax. So you know, in one way or another we're dealing with that technology. And that's how if your message has some string and then there's a number in there, or there's a currency in there, and you kind of like mark it up with a placeholder in your message. And so then that's being powered by ICU in the backend. So that's because the framework has already kind of got all of the setup there for you. [00:18:57] Speaker A: And it's incredible because then you'll have to do that with every single language that you come across with. [00:19:06] Speaker B: Well, I'm not sure if that's. You need to tag, you'll need to translate that. What's nice is that you create these messages and you say, okay, here's a button here, like I have a menu. So every little word, every little entry and item in that menu has a message. But then if you translate them per language, then your job is done. So in the code you just say, I have this message that has this id, and then that's it. Your code is actually language agnostic. So you only have to write one code base. And I think back in the day, if you were naive about it and people did this, they actually wrote duplicate copies of their code base. So here's one for French, here's a whole code base for German. And it's like, oh, this takes forever to translate my code base from one language. It's like, that's not the way to do it. No, you just want one code base. For obvious reasons. Obvious, obvious reasons. But then also too, just for the programmers out there. When you're learning programming, you'll learn basic data types. And one of the basic data types is a string. It's a sequence of characters. And we're talking about user facing messages. Those are strings too. But when you learn programming, you just say a string is just a sequence of characters. And for some of these older programming languages, a character isn't a Unicode character, it comes before Unicode. So it might be here's eight bits, here's a byte, or here's 16 or something. And that's not. If you're dealing with a string that's facing a user and you use those built in functions that came before Unicode, you're going to do some basic mistakes. These are rookie mistakes that you don't want to repeat. And that's why we have icu. So when you do those operations, you, and you're writing code, you also want to use ICU there as well. [00:20:57] Speaker A: It's basically that layer that enables that flexibility, especially if you care about this content, because then your job becomes. Okay, how do you populate these places? Right. Would you call them content places? What are these places where this information is added? [00:21:16] Speaker B: Well, if you, let's say that you've got an application and you need to split the word into characters, you want to deal with Unicode characters, not the programming language concept of a character. Then you use a library and you say, hey, icu, give me everything in terms of every character. But more usefully, it's like what is the character in terms of the language? What's a meaningful unit of text that shows up properly to the user? And so then that's a higher level concept and again, we support it all. So back in the day before these libraries were used, like when I was looking at indic text, a lot of characters, a lot of. For a human, what a character in the language is, what a letter in the language is, is actually the combination of multiple of these low level concepts. And if it wasn't done right, then you would see a bunch of broken text, you would see some dotted circles here and there and some weird little loopy things because they weren't using the libraries. So that's just basic string handling. But at a human level that's truly. [00:22:31] Speaker A: Internationalized and that is a fantastic accomplishment and a huge innovation. Saves tons of time, tons of resources everywhere and it opens up opportunities probably in very unsuspected ways. And I presume then that's why you're working group as Stephen exists, right? The digital disadvantaged languages. And it's very interesting because in my capitalistic brain would be like, all right, let's just roll it to all the main languages, like why even worry about the digitally disadvantaged? Why even worry about indigenous languages, right? I just recently knew that Colombia has 72 indigenous languages. I had no idea. If someone would have asked me, I would have guessed six and I would have felt like very ambitious with my guess. But 72 and disappearing rapidly. Now we're not even talking about those, we're talking about digitally disadvantaged languages. And that working group probably has very specific lines of action which we'd love to hear more about. Stephen. So if you can tell us what the work group does, how does that effort helps underrepresented language communities and then we'll get a little bit onto your points, Bridget. [00:23:52] Speaker C: Yeah, certainly, as I mentioned, the origin of Unicode, companies such as Xerox, IBM, Apple working together to, to solve these problems for the languages that were on their budget for that year. Understandably, the major languages of the world, you can find those in any chart where you are. Where Bridget is especially has very rich local languages, quite a few. But what I was realizing in continuing the work with, with Unicode and CLDR is that the languages of the world, if there's a problem in French and Canada, if there's a problem even in Tamil, is as far as indigenous languages go, it's up there. And so there are people that will get to work and solve those issues. But there wasn't really any team, there wasn't Really a place where other languages had advocacy, had a voice. And so kind of mulling through these things. As I said, CLDR became an on ramp for languages where languages could be added and show up on platforms because it became part of Unicode cldr, it was added to libraries such as ICU and such. But there wasn't really a central place where there wasn't a welcoming committee. There wasn't a central place where we were trying to solve problems from the perspective of these languages, provide advocacy for them. And so that's why a number of us started that DDL workgroup just a few years ago now. And one of the things we've done more recently is to launch a help page. Imagine that, launching a webpage for languages that's on the CLDR site. And one of the pieces of that is where a language organization can sign up and actually request sort of formal recognition within cldr. So the structure is that anyone can sign up, anyone can go to the form and say, I want to contribute in my language. But if someone can contribute where they are a part of a recognized organization, then we can advance the language much more quickly that way. And so we've actually had about 11 organizations sign up this year. They are not all set up, but some include a Institute of Linguistics in Kazakhstan and another institute in Chuvashia, Mayan Languages Project Kademia Siciliana, the Sinawar Welfare Society. So just quite a few organizations now are signed up. And that's again a direct result of the DDL workgroup. A couple of years ago, we worked on having the workgroup validate data and fill in kind of help data, move along in places where it just needed someone to check it just needed someone to fix a couple errors, move it along. And so there were, I think, a half dozen languages that were advanced. They were able to be added to ICU because of the work of that committee. All volunteer work at this point, but definitely some good solid progress in processes, documentation, and hopefully a better welcoming environment for language communities. [00:27:52] Speaker A: Thank you, Stephen. And I think this is a perfect segue to talk to you, Bridget. Define for those we've been listening to, we've heard Digital Disadvantage Language in the context of language access, which is a huge topic for multilingual in our magazine, our audience, we know we talk about it all the time. What does it look like on the ground when a community lacks proper support. And you talked a little bit about it, right? Like the characters. But of course, at the end, we would imagine is the lack of access to knowledge and to also share Identity and many other expressions, cultural expressions. But from your perspective, is there ever a point at which Unicode can claim all languages covered? I know I throw a lot at you. [00:28:39] Speaker D: Yeah, no problem. They're all great questions. I mean, first of all, the question of can Unicode ever claim that all languages are covered? That is a super complex and nuanced question that in theory, the answer is yes, but with an asterisk caveat that that language is always changing, it's always in flux. Right. We have communities whose languages look the same as they did for hundreds of years, and we have communities whose languages have changed, have evolved over the last hundred years. And so can Unicode cover all languages? Absolutely, it is a possibility. But we have to remember that Unicode is sort of a first step to a larger picture, which is, can languages actually be used and typed and rendered properly across all digital devices and systems? And that goes beyond just the actual baseline access to Unicode code points. Right. Unicode puts out updates with new versions that include new characters. As an example, the most recent update of Unicode included two new capital letters that were missing from the Unicode standard used in indigenous languages. A number of indigenous languages here in British Columbia. Those characters just hadn't been added to the standard. And so the team that I work with, the indigenous North American type team through Thibaut Tech, worked with the HealthSoc First Nation to go through the process of creating the proposal to have those missing characters added to the standard. And so now those characters were released with the new version of Unicode, and they can be encoded properly across devices. However, then the next step is that, you know, the major technology operating system providers have to update their systems to be able to properly accept and render those, those new code points across different devices and operating systems. And we're still waiting for that to happen, you know, for a number in a number of spaces. So, you know, Unicode exists in a much larger technological ecosystem that plays a role in the bigger sort of question of, is a language digitally disadvantaged or not? And when we think about digitally disadvantaged languages, it's easy to think about cases like some of the communities that I work with here and here in British Columbia, where there might be a handful of fluent speakers currently, you know, alive and working to revitalize the language. But there are many languages around the world, you know, India comes to mind, where there are languages that are classified as digitally disadvantaged languages that have millions of speakers. So it's not necessarily about how many people use or speak the language. It really comes down to a lot of factors about the history of technology and the history of inclusion of a particular language community within that technology at a baseline, the question really just becomes, can a speaker of a language type their language onto a digital device? And can that text render properly, consistently and respectfully across different spaces and platforms? And if the answer is no, there are. It might not be a Unicode problem, right? Often it isn't. Unicode, for the most part, often works very well at this point for a huge portion of the world's languages, the majority of the world's languages. So usually what we find is that it's a next kind of step, like, whether it be it's a font problem or it's a keyboard problem, or it's a problem with the piece of software or the operating system or the website. But until those issues, sort of across the entire technological ecosystem can be resolved, communities will still continue to face digital disadvantages in using their language across these different spaces. [00:33:24] Speaker A: Thank you for giving us perspective. And I think we're probably going to have to write an article about the impact of Unicode on digitally disadvantaged languages. I definitely feel the pressures that it can generate in several conversations, like, if you don't have a technology to support it, then that's one thing, but if you have the technology to support it, but not the adoption, then some other things have to happen. I'm talking about Multilingual Magazine, if you allow me. For those that are listening, remember, the Multilingual Magazine is the main source of information in globalization, localization, internationalization, all the Asians with the. What is it? I18N. If you want a subscription of the Multilingual magazine, let us know and we will hook you up. Remember to hit subscribe to this podcast if you're listening to it on Spotify or if you're looking at the recording on YouTube. And remember that today we are talking about the Unicode Consortium with Stephen Loomis, Elango Sharan, and Bridget Chase. My name is Eddie Arrieta. I'm the CEO here at Multilingual Media. And with that, yes, I think the impact of Unicode is undeniable. It's very difficult to not recognize it there, Bridget. So you've been working with Unicode for some time. You've been vocal advocates for open source tools, indigenous language preservation. What has been the impact of Unicode's nonprofit open source model on its adoption? [00:35:01] Speaker D: Well, from a community perspective, it is absolutely critical that Unicode was set up in the way that it was. And we're very grateful that Unicode has this amazing open source, open access framework and model, because there are already so many barriers of access that communities Indigenous communities and digitally disadvantaged language communities have to face when it comes to revitalizing their languages and using their languages in technological spaces and settings. So many barriers to access. And so to have just one less barrier, right? One less thing that they have to deal with thinking about, oh, you know, we have to convince companies, you know, to make the decision to put financial resources or time and effort resources into supporting, you know, our language. That type of barrier, especially for languages that are, you know, dealing with severe levels of endangerment, communities who are working very hard to combat, you know, the oppressive forces of colonialism, to try and revitalize languages. Just having that simple barrier removed, of being able to say, yeah, we can make sure that our entire orthography, our entire Alphabet can be represented in technology. It's a first amazing step that is so critical for the adoption of these languages within technology. I know from a community perspective that the impact has been massive. I'm sure that Stephen and Elango can speak to the other end, the technology and sort of business perspective. In terms of Unicode's nonprofit and open source status. [00:37:12] Speaker A: I'm very curious now which Colombian indigenous languages have adopted the standard? I'm very curious. I will ask around there. There are some amazing professionals at our universities. It'd be amazing because I believe this to be very revolutionary. And in terms of what it can do to, of course, language overall and content overall, it will really determine a lot of our experience in the future on this conversation. On Unicode's open source licensing. Steven icu CLDR Enable or limited? Has it enabled or limited the ability to collaborate with community partners or industry partners? [00:37:56] Speaker C: Yeah, certainly. And licensing is one of those things where, you know, there's, there's. When people think open source, you could think, well, everything should just be open source. But that's, that's a business decision. You know, everyone, everyone has to make to make that kind of decision. But really, when you think of licensing, sometimes people think of red tape and lawyers, and there's lawyers around the world, not just in the United States though. But really what licensing does is licensing means that a community retains ownership, right? So it's not saying it's not giving ownership, not saying this is yours. But licensing to Unicode through the processes means that that data, that code, those standards can then be picked up and adopted the world over. And without the industry and the communities and implementations being concerned that they don't have the right to use this. Because all that red tape has been sorted out, it's all been solved. And so when that information about, let's say, a language of Colombia, and that that information's in cldr, then the implementers of the world, whether that's a large multinational corporation or whether that's a single open source developer on a project who's looking for some information, they can pick that up and use it. And the terms are all clear. So I think I look at licensing. That's not one of those. It's maybe not normally one of the more exciting topics, but I look at it as a very important enabler in making it possible to do all these things. It's what makes it possible for all these different parties to sit down together and collaborate on data, to collaborate on even code exchanges, and to really benefit the world at large. Licensing is very key in that respect. [00:40:09] Speaker A: It's a wonderful mission because we are so used to the transactional nature of how value is exchanged in the planet, that then whenever you are looking into high levels of value and impact, then you immediately start thinking about, okay, how much is this worth just because of that? But of course, there is an ideal future. I imagine I'm, even as I'm saying I'm here, inspired to figure out, have Colombians thought about this? What is it that we could do to allow future generations to have access? In my mind, it's like this universal typewriter. It just can't do it. It just can't do it. But it needs to learn it somehow. If those alphabets are there, if those orthographies are there, then probably there is a future where a lot more insights can be gained from the understanding across different languages and cultures and all of that. But for anyone here, what does the ideal future look like for you? Just take a couple of minutes and I would love everyone's answer on this. What does the ideal future look like for you with Unicode? [00:41:23] Speaker B: I would love to see a future where, I know it sounds simple, but everybody can use technology to its fullest in their native language without delay. What I mean by that is we have the ability to support different languages. And so there's a breadth, like, how many languages do you support? But then on top of that, there's another dimension, which is, what are we supporting and to what level, what degree, how well are we doing that? And so then you could say breadth and depth, but I'll say breadth and height, because there's a metaphor that we have where there's sort of a internationalization stack. It's like, okay, have you gotten your language into the Unicode standard? Once you have the standard, then you define all the Properties and stuff. Then you have these algorithms, then you put that in the library, on top of the library, then you have the fonts and then you have the typing. If we just have the fonts and we don't have the typing, then it's a one way experience. But if we can type now, it's a truly interactive situation and then it goes on from there. There's more and more and then just race ahead to today we're, we're talking about AI and large language models, all this natural language processing and what languages is that being done in? What languages is that not supporting or not supporting? Well, and what, what that's exposing is that, you know, we haven't caught up all of the languages in terms of data. Like we don't have all the infrastructure in place so that they have the body of text to make that meaningful for them yet. So they're, they're kind of lagging behind. So it would be great to see that anyone can use technology. Like when a new feature rolls out, it makes sense to roll it out for all the languages instantaneously. And it's not like certain languages have to wait. [00:43:15] Speaker D: Yeah, I, 100%, I would agree and I would say that barriers towards using languages in digital spaces really negatively impact that community's ability to move ahead in language revitalization goals. If, you know, if we're talking about languages that are in that process of revitalization within a community and for, just for people to use their language reliably in day to day scenarios. Technology is so deeply integrated into our day to day lives now that not being able to use basic, what we would consider at this point, sort of basic technologies within a language is going to be a, you know, cause issues and challenges for that community to be able to actively use and maintain their language. And so we believe, or I believe that ensuring linguistic equality within technology is one piece of a much larger puzzle to advance the bigger goals of language diversity and vitality in a global context. And Unicode plays a massive role in that work. [00:44:30] Speaker C: I would agree 100% with both of those future visions. I would like to take a slightly different tack and look at it from the perspective of the language communities. And that is I would like to see where language communities expect their language to be, to be displayed, to be handled respectfully. I mean now there, there are many places in the world where that's, that's just not expected. You go to the, the mobile phone shop and you don't see your, you don't see your, your language anywhere. Right. But I would, I would like to see where that's not the case, where people expect their language to be supported, where they don't feel like their language is too small, too unimportant, too complicated to be worthwhile on the world stage. And I'd like to see where people are making use of their languages, where they're making use of the tools that are available. So that's from the user's perspective. And I'd like to see where, thinking about the ddl, I'd like to see where we have a complete story and we have all the tools available, all the information available to language communities so that they can make use of those. They know where to jump in, they know what steps to take, the next steps to take for their languages. [00:46:00] Speaker A: And I love all of those perspectives because it gives also a scope of everything that's there to be done. It's amazing the impact that Unicode can have. I'm very curious now about Latin America. How well are we doing? But of course there is a lot of volunteering work here in Unicode and the Unicode Consortium. And as a nonprofit organization, I lead a small, a non profit as well. And I know how it goes. There is never enough capital to do things. So how does the Unicode fund its operations and maintain its infrastructure? How does that structure shape its overall mission? Stephen, if you can tell us about that, certainly. [00:46:49] Speaker C: So Unicode has, I think it has three, three and a half full time employees and it's, much of the work is done, is done through volunteers. It is supported by its members of all different, different levels of membership. It's supported by donations. I'm a lifetime member of Unicode and it's. So it's supported by, by different ways and there's, there's even ways you can get involved in that. But I would say because of being largely volunteer driven, there's certainly always room to help put plugs in if something piques your interest, certainly get in touch with Unicode. And I would also say that oftentimes the level of support, realistically the level of support for something might vary depending on what volunteer time is. Volunteers, as you know from nonprofit organizations, there's always ways that people are more or less engaged over time and there's all those challenges and opportunities with that type of a structure. I think that the, like I said, the opportunity I think of with efforts such as the DDL is to at least have a place where people can talk about where needs are and provide a place for additional work to plug in. [00:48:26] Speaker A: Of course, we found that out from Multilingual as well. Those that are a part of our audience, they've seen the Unicode events being promoted on our website, they are always great. We have a few different articles that we have worked on in the past and we look to forward, forward to working on more of those. Elango as vice chair of the Community Engagement Team, we're here doing some community engagement. You must be loving it. How do you recruit? How do you support all the new contributors and then tell us anything else that you want to tell us today? [00:49:04] Speaker B: Yeah, yeah, thanks for the question because I think it's a really important question because Unicode as a primarily volunteer organization has very few actual paid staff. I think just a few, just a little over three full time staff to do all of the work that we're talking about and we haven't even gotten into the details of it, but I think from what we said, you can just surmise just how much is going on. Being able to continue to punch above our weight is really, really important. I really like what Stephen was saying about his vision of Unicode as making things as easy for language communities to get involved as possible. Because I think that is kind of relevant to this question here too. How do we make the information that Unicode has as easily accessible to language communities so that there's more awareness and knowledge? The more that we share knowledge about Unicode, the more likely we have understanding and better productivity from language communities trying to get themselves into Unicode or to improve their situation. But that also naturally extends to how people contribute, whether it's to the language data in CLDR or if it's code in icu. We get a lot of bug fixes and language data fixes from outside open source contributors. So that type of information preservation and surfacing education, I think that's just so critical. And so I really like the strategy that we've taken in that regard to better engage. Because I think what's been the case for a long time is that we've got super, incredibly smart people who have just been so productive again punching above their weight just like for the past several decades. And they continue to do it and it's just like it's amazing to be within and be a spectator. But also, you know, I have to contribute but like it's just, you just like pop up and you just watch what they do. But then they're so in the trenches, they're so productive that there hasn't really been time to get this information out. We used to have different types of community engagement. I would say there was one primary event called the Unicode Conference. And I think before this inflection point, around 2007, 2008, they were still trying to make the case for we need to adopt Unicode and here's why it's important. And going around the world, they had that conference actually multiple times a year. Once Unicode became the de facto way that we get all the languages into technology, then it became more of a yearly cadence. But then the pandemic kind of threw a wrench in a lot of things. And then our last Unicode conference was the year after, so it was like 2021. And so then after that we said, okay, there's not going to be this event and this forum anymore, but what do we do now? And there was a need, because those types of in person engagement, even just among contributors, is important so that we can brainstorm and come up with new ideas. The ICU4X project, which was taking ICU and making sure that it can run in new devices that are a lot smaller, right? Portable wearable type devices, just any type of resource constrained thing, or even new programming languages. That project, which is part of the future of Unicode work, came from the hallway track. It came from conversations in a hallway at the sidelines of the Unicode conference. And so having those spaces available is important for the practitioners and for the implementers. But you know, on top of that it was like, well, let's not just recreate the same old thing. Let's just not have a few hundred people meeting somewhere in California every year in the same spot and it's the same people. So that's the, so in the, in kind of the, the planning for what to do next, you know, we, we kind of went back to basics of like, what does it mean to engage with the community, how, what communities are we trying to reach, what audiences? And so then that's also the origin of the community engagement team. And before jumping into in person events, we actually thought, well, there's a lot of low hanging fruit, there's a lot of engagement like online that we can do in which we can reach so many people who can't travel easily, but they have an interest and you know, travel is one thing, the cost is another. So we started with some building a library of content and so we have our YouTube video recordings. We also did webinars to say, here's an intro to, here are the basics of Unicode and internationalization, here are the basics of bi directional text, here's the basics of, you know, other things. And as we did that, it was actually incredible to see the numbers we would have these webinars and we announce them and we would see a few hundred people join, but we would also see them from 60, 70 countries in one webinar. And that is just amazing. We had never had that type of participant. How could you have that type of participation in an in person event? And we were seeing that for those, especially the really, really popular ones, we were seeing that. And it was just, it was just confirmation that we're on the right track and that we're, we're doing things to reach more of the people that we want to reach. Right. And we're, we're not, we're unlocking this information and we're being able to share it in real time with, with people. So, you know, and this is also to say, like the vision for Unicode, it's to do more of these types of events, but take it on the road, take it to the people. And how do we do that? How do we try to make sure that instead of always trying to have them meet us where we are in our time zone or in person, how do we kind of meet them or where they are? So one of the things that we did was we said, we got these webinars, we've got these recordings, and that allows us to have this immediate reach, this global reach, and still has that real time feel. And you could see people talking in a chat. And that was great. But there's still no replacement for this in person interaction. I still believe that. I know that this is after the pandemic, but I still believe we're social creatures. And so we did come up with a new event to replace. I wouldn't say replace, but to kind of create something new that's not the same as the Unicode Conference. And that's the Unicode Technology Workshop. So the Unicode Technology Workshop is Now in its third year. So we've had two instances that were hosted at Google in 2023 and 2024, and the upcoming one is going to be held at Microsoft. So we have a new host, slightly different town, one time over, and that's this year, November 11th through 13th. And the structure of the Unicode Technology Workshop and the name of it kind of is meant to imply that there's something different about this. This isn't just a bunch of expert practitioners speaking at everybody else and saying, here's, here's what I've done. We have that because it's about sharing and you need some time to kind of go into the details of cool things that you've done. But we also want to have it be more interactive. So we have deep dives where people can go and say, okay, here's hands on, way to take this cutting edge thing and use it for the first time, and we'll walk you through it. So we have deep dives like that. We have on conferences, which you may have participated in, and other events where people who were there in person decide in real time before they start, like, what topics am I interested in? They kind of propose them, they vote on them, and then they actually vote with their feet. They kind of mingle. They go as the sessions are going in parallel, they go back and forth and then that you have like lightning talks and things. So it's very interactive in that sense. And it's a little bit, you know, it gives different people different ways to participate. This year, for the first time, we're going to have tutorials. So there's been a need, there's been a desire for people who are beginners to show up to an event and learn about things before they jump into the deep end of the pool. So we're going to have tutorials for the first time. So we're, we're excited about that because we think that there's still a lot of need for this knowledge. I think there's an increasing awareness that just average software engineers who are creating applications, they need to know this stuff. And, you know, when you already have all of these experts who've been doing this for so long, why not go to the source and learn it? So we're looking forward to that. And it's still early stages. There's still a long time between now and then. So we're still looking for people to propose, propose sessions, things that they want to present, or deep dive and propose tutorials and of course get themselves and their other teammates aware and signed up. [00:59:28] Speaker A: I think that sounds great. Where do we go to find all this wonderful information? [00:59:33] Speaker B: Yes, the website is unicode.org eventsutw but if you just go to unicode.org events you can find a link from there. [00:59:45] Speaker A: Fantastic. We'll find a way to include it into our promotional materials for those that are listening. Hopefully you can participate, at least get to know what Unicode is doing. We are almost wrapping up, but before we go, we will do a little exercise as a segue to a question to Steven. What is the Adopt a Character program? Steven. But to get there, we're going to do your favorite Unicode character and why. We'll do Bridget. We'll do a lango, then Steven, and then we'll be with You, Steven, My. [01:00:18] Speaker D: Favorite Unicode character out of the. How many are there now? Hundreds of thousands of options to choose from. I would say that my favorite Unicode character is the combining comma above 0313 and that's because it, it's a really simple little character, but it shows up a lot across a lot of different indigenous languages here in so called British Columbia. And, and fun fact, actually what Stephen talks about, the Adopt the character program, everyone will understand. But a number of years ago, myself and some colleagues actually adopted the combining comma above character as a thank you gift for another colleague who was with our team for a number of years and he was leaving to start a new role in a different company, but he was Daniel Yonah. He's a massive champion of indigenous language technologies. And as a thank you from the rest of our team to him, we adopted the pesky combining comma above character that gave us so much trouble over the years of trying to make it work within the various things that we were building. So that is definitely my favorite Unicode character. [01:01:44] Speaker A: Elango. [01:01:46] Speaker B: For me, this is just more of a personal thing. I think my favorite character is the Thammal vowel sign E or I as hex code 0B87. And that's because it's the first letter in my name. But if I didn't have to, if it weren't that, then I think I would have the thinking face emoji. I like that. Just kind of like, nice. [01:02:10] Speaker A: I've always liked the shark. The shark, the one that's like biting in the air. It's like, it's like sideways. I love that emoji. I love it. I use it in very weird moments. Happy birthday. Here's a shark. There you go. It's like my signature. [01:02:25] Speaker C: Now. [01:02:28] Speaker B: I was wondering, how do you get a shark into a normal conversation? [01:02:31] Speaker A: But this is a weird one. Yeah. Stephen, what's your favorite character? [01:02:37] Speaker C: Yeah. Of all the 150,000 plus encoded ones to choose, there's many more code points in the 21 bit space. But I would say my favorite is the. Is 0127, which is the. In. In Maltese it's called H Matua or cut H. It's the, it's H with a bar through it, sometimes confused with a Planck constant, but it's the voiced H in the Maltese language. Very dear to my heart. Spent a year there and many other times. But it's, it's basically, it's one I've used as a way to see is Unicode fully supported here or not. And I kind of Picked that up as a. I even named a cat with that character in its name. But one thing, the ways you can support Unicode and even, especially the work of ddl, is through Unicode's Adopt a Character program. And you can actually adopt a character. So I've adopted the Maltese H. I even put it here on my business card. You get a little ring with the adoption certificate. And that work goes to fund a lot of the Consortium's work on digitally disadvantaged languages. I myself, I received two grants funded through this for my work as the editor of the keyboard specification, which is where we're bringing the world of keyboards into this local data system. Yes, exactly. So that keyboards also work properly everywhere. But you can go sign up to this. You can just go to unica.org and you can find the Adopt a Character program, and you can sign up at different levels. There's different benefits that you have at different places. You can also see what other adoptions are there, and you can. You can learn more about the different types of grants that are being supported by this program. [01:04:39] Speaker A: Thank you so much. Steven, Bridget, Elango, any final thoughts, comments, concerns, before we say goodbye to our wonderful audience? [01:04:49] Speaker B: I think it's really great that they're here trying to learn about this, because I think this is so impactful to so many people. I mean, this is everybody. This is everybody in the world having the ability to access technology in their language, which I think it matters so much. And I'm glad that they're here, and I hope that they will realize that the more that they dig, the more that they'll keep finding. [01:05:20] Speaker C: Yeah, I would say go out and expect your language to be supported well. And if it's not, find out why not? And maybe there's something you, too, can do about it. [01:05:29] Speaker D: Absolutely. And there's lots of people out there, including folks right now on this podcast, who are here to help. So I'm sure that if you're listening and you're thinking, oh, I want my language to be properly supported and I want to be able to use my language in digital spaces, but, you know, there's something preventing you from doing that. Reach out. There's people that can help answer questions and put you in touch to the right places to make it happen. [01:06:06] Speaker A: All right, thank you all. Thank you, Camila. And backstage, you can cue us out, Mila, with that. Wonderful. That's what we mean. Some hot tea, perhaps? Who knows? Who knows? But thank you, everyone who has listened to us. Today we had a conversation focused on Unicode Consortium, the nonprofit that develops the Unicode standard so that text and symbols display correctly everywhere in every language. Now we have a much better understanding of what we are talking about. About. You can volunteer your time. You can volunteer your resources and your life, even if you want. Thank you, Steven, Bridget Elango, for joining us. [01:06:55] Speaker B: Thanks, Eddie. [01:06:56] Speaker C: Thank you. [01:06:57] Speaker D: Thank you very much. [01:06:58] Speaker A: All right. And for everyone who's listening, remember to subscribe like comment, follow our podcast and our magazine. Our team will gladly follow up if you need a subscription as well. Without any further ado, this is the end of localization today. My name is Eddie Arrieta, CEO here of Multilingual Media. Until the next time, goodbye.

Show Notes

Episode Transcript

Other Episodes

Episode 97

LocWorld50 - Interview with Stephanie Harris-Yee of Argos Multilingual

Episode 291

Stick-Joy and Joysticks: A Gen-Xer’s take on gaming, language, and the alchemy of translation

Episode 147

Hameed Afssari: Localized Software for Everyone, Everywhere