Episode Transcript
[00:00:03] Speaker A: Hello. Welcome to Localization Today, where we explore how language, technology and community converge to unlock ideas for everyone everywhere. I'm Eddie Arrieta, CEO here at Multilingual Media, and today we're joined by two innovators in multilingual content. First, Helena Batt, Director of localization at TED Conferences and author of the multilingual magazine article voices without how AI shapes global communication, who has led Ted's evolution to work with volunteers and build an AI driven and human dubbing experience in 115 languages. And joining her is Guy Picard, CEO of Panjaya, whose AI deep real technology delivers seamless lip sync, dubbing and natural voiceovers for global brands like Ted and JFrog. Together they are transforming how video content reaches and resonates with audiences around the world. Helena and Guy, welcome and thank you both for being here.
[00:01:12] Speaker B: Pleasure.
[00:01:13] Speaker A: It is great to have you here and there is a good amount of planning that goes behind every episode of Localization today. So I'm really glad we have the opportunity to do it. And if you'll allow me, we'll do several questions for each of you. If you want to jump at any point, just feel free to do so. Our audience really appreciates that we can begin with you, Helena. Your article in Multilingual Magazine opens with a Stark statistic that 90% of digital content is inaccessible to 6.5 billion people due to language.
How did that realization shift ted's strategies and priorities?
[00:01:54] Speaker C: Yeah, thanks, Eddie, and really pleasure to be here.
This topic actually means quite a lot to me personally, because it's where two things I care about meet.
Making great ideas actually reach people and doing it in a way that feels real.
And that stat you mentioned, the 90%, you know, that really forced us to confront that we were falling short, you know where we were falling short at ted, we've been struggling for years to help audiences beyond English to fully engage with our content. And even with subtitles, the gap wasn't quite closing.
And so reframed localization as part of the product experience, not just a nice to have or an afterthought. And yeah, we finally had the right tools to close the gap and in a way that really felt more human. And that was kind of why we wanted to capture the journey with Tenjaya, our partner, while it was still imperfect and real. But yeah, the main idea behind the article, to write it now, was to show what thoughtful AI looks like at a moment when people are kind of questioning it.
[00:03:26] Speaker A: And Helena, there was a transition there for ted.
So you started with Subtitles, and you mentioned that you, you know, it didn't really close the gap there. So why, why did TED decide that that subtitles were. Weren't really closing the gap? What sort of behaviors or feedback made that clear?
[00:03:45] Speaker C: Yeah, so, you know, subtitles are great, but what they do, essentially, they. They deliver information. Right. Dubbing delivers the experience.
And in a way, subtitles had always been a compromise and they gave access to the information, but not really, not quite the experience.
And the experience is what makes people care and, you know, stay and tune. Stay tuned in. So we've heard from our users and we know from, you know, in various markets, dubbing isn't just preferred, it's expected. People would actually tell us, you know, I want to hear the speakers in my own language, not just read the words on the screen.
[00:04:31] Speaker A: That is a huge part of the experience. And, you know, we've had guests in our Localization Today podcast that talk specifically about how some of these things are preference.
And at the end of the day, it's also about access.
So many people do not speak English. And for us, the lucky ones that have all these presentations in English, well, we don't feel. Feel the pain. But for those that only speak their original language, let's say Spanish, their access is already so limited without tools like this. I imagine this no easy task. Right. And we've seen solutions trying to do there. You know, you partner with Panjaya to create AI adapted videodubs. What was for Ted, for you, the aha moment when you realized that Genai could actually replicate the emotion and lip sync to make it have that connection that you are talking about?
[00:05:26] Speaker C: Yeah, I would say seeing the, seeing some of our earlier dubs, some examples that Panjaya shared with us in the beginning, and really seeing the speaker's own rhythm and emotion, I think it was, we chose French, which is one of the languages I speak, and it was just transforming.
Plus the perfect lip sync. Yeah, I remember it actually gave me chills and it felt really real. And it showed us that this was really about preserving the humanity of the speaker while meeting the audiences where they are. And yeah, it's one thing to kind of point out. I said, I feel like a lot of people are sort of say they're tired of AI just being slapped into everything, into all the experiences. And I personally get it. As a consumer, I feel that too. But, you know, at ted, what we really focus on is solving a real pain point, a real problem, not just trying to replace humans or chasing some sort of trends. You Know, audiences don't really care that it's AI, what they care about, that it feels like the speaker is talking directly to them. And it's really important to use AI, you know, where it actually improves the experience for the people, you know, for the people on the other side of the screen.
[00:06:53] Speaker A: It's very, very interesting. AI seems to fall in this category of things that you only care about if you feel they are there.
It seems like through the experience that you're offering, things are a bit more organic and it feels real. Just like you were saying a person. Perfect. Lip sync. So. So Guy, what has been the process of building a technology that can deliver those chills, that level of perceived perception perfection?
[00:07:21] Speaker B: Well, you know, it's, it's a, it's a pretty long and deep process here and I'll try to keep it as simple as possible. But you know, it's. When you think about what, what needs to be to come together to create a perfect, let's say a reshoot, we actually consider our outputs as a reshoot and replacing. As if you would take the same content and recreate this in a different language. And there's so many technologies and processes that need to work together like an orchestra and work really well together. So I would talk a little bit about what comes behind that. But if you think about a video and what we try to do is here to really automate the whole process. So any video that is being uploaded to the pipe, our pipeline, we start with cinematics channel separation, which we need to separate the speech, the speakers, the ambience with a frame level precision so we know exactly what happens at each second of the video. And then there's everything around transcription accuracy, speaker derisation, understand which speak, who speaks when in which segment, creating the perfect translation that is context aware. And that's another part that we've been investing a lot into. Models that will understand the context, understand the personality and create the right translation that fits dubbing versus translation that fits text on top of it. We have everything around adapting the speech, so resynthesizing speech for a timing that fits automatically. And as, as we know different languages have different lengths. And if you'll take English and just translate into German, you're probably about 30% off. If you take other languages, there's a vary there between the original and the new language. We have to take all this into account and build that properly and then create a speech that sounds natural and sounds right and captures the emotion of the original and reflected back into the new audio track and on top of all of that, we had to build and we built what we call real face, which is our video sync model. That is pretty much a breakthrough. The way we saw it, it took us a couple years to get there. But the way we do is that we modify the speaker's face to achieve accurate lip sync across syllables. And the new language, we preserve the original speaker's expression as well as video, lighting, color and motion and all that, just to make sure it feels as possible.
[00:09:38] Speaker A: And that is the seamless experience that Helena was talking about. It's very, very interesting.
We. We understand, you know, just barely what the technical difficulties would be. So you're mentioning voice analysis, advanced translation, the whole lip sync, the whole expression aspect of it. It's a very interesting how you kind of like tilt or generate an expression for anything in any language.
Which step would you say was the most difficult in the years that this took to evolve? And this is a question for both of you. Helena, probably, you know, as you are orchestrating this Guy, nicely put it, when you're orchestrating this, all these players, right, all these speakers, do they want their language to be in Chinese? Do they want their language to be, sorry, their presentation in a different language.
So how do you both navigate those realities?
[00:10:32] Speaker C: Yeah, it's a good point and Guy already touched on that. The lip sync, I think, was one of the hardest parts. You know, some of the earlier versions felt a little bit uncanny, I think, especially, like, sight angles and things like that. So we work with Panjaya, which, you know, again, has been an incredible partner throughout this journey. And we've added human review, kept iterating until it really felt natural. But I'd say beyond the technical side, you know, this is. This is someone's voice and reputation. Right.
So from day one, we built safeguards, you know, to give speakers full visibility, full control over whether or not their talk is dubbed. And we made sure everything's clearly labeled as AI adapted. Even those who were hesitant, you know, at first have started opting in. And not because we pushed, but really because the dubs feel authentic and respectful. Yeah, I think for us, we. Permission and transparency aren't just like best practices. They are built into our process from the get go.
[00:11:50] Speaker B: Yeah, I mean, I would echo Helena here, you know, on the lip sync itself, which is the tail end of this pipeline. But eventually what it does, the lip sync process, that is the last process that kicks in, pretty much wraps everything up. And if it wraps it up in a right way, the output of the video is going to Be as natural as possible. And to do that there's a lot of prep work. All the pieces that happen in before that in the pipeline are the output for the input, which is eventually the lip sync model. And the challenges here were, you know, there's one thing that you've seen, you can see out there some lip sync models where you have a single person facing camera. It will work pretty well. This is derivative of avatar tech in our case, since we're dealing with real people, human, human performance, that's our camp. We believe in this genre of people there's a lot of things that are imperfect. We have these angles, you have occlusions, you have multiple screens. Like for example, the TED Talk had Jumbotrons in the background and you want to capture this properly. And there's different sizes of speaker, but the same speaker appears differently in each of these Jumbotrons. Plus the, the main stage and making sure that we orchestrate the lip movement and all these screens at the same time. If you have multiple speakers, if you have angles, if you have lighting condition that is inconsistent or even non studio grade footage. So for all of that we have to really, really train for a very long time and iterate and iterate until we nail that. But I would say that's one part that is really hard, but the other part is making sure that we maintain the context, the intention and the delivery of the original in a speech that's natural, that understands where to apply pauses, that understands how to deliver the context of the original talk in a way. And all this with automation.
So there's one thing, to be able to create a craft, a high quality dub with a lot of manual work through a studio. But I would say that the innovation here is to create 90 something percent in automation and then leave some 10% for the creative agency and the control and the QC to really perfect it. And our goal is to create the best out of the box output to minimize the work of any human reviewer.
[00:13:58] Speaker A: And it's great that you, that you mention the human review and I'll ask you, Elena, a little bit about that. But Guy, before we get to ask Elena about how humans are involved in that review, what is the role of humans in your process for Panjaya?
Where do you see them as the most valuable assets in there?
[00:14:18] Speaker B: You know, while there are metrics to evaluate outputs in AI, what we find is that the most accurate metrics eventually is people reviews, reviewers. So we have language, people understand the native aspect of each language and we have a mechanism by which we're grading all the aspects on the delivery, the quality, the, the prosody. Every, every aspect that is, is part of a video delivery. And we have a way to rank and grade them. And we take this into account as we fine tune and improve our models. So that's something that we cannot do this without people involved. That's one aspect. The second aspect is there's a lot of here art and science involved. And that's part of what we do. We have, we're doing a lot of experiments. When we try to develop a new model, experimentation is a big piece of it. And there is breakthroughs and there's setbacks and we just need to iterate through that until we get to the point where we actually feel that we have a stable model that can run and it can regenerate in a persistent way and I would say in a predictable way create output that is a high.
[00:15:20] Speaker A: Quality and it's great. It always feels that we come back to the humans for that element of like creativity, innovation, experimentation. So really glad to hear from you, Helena. You know, in your case, how. Why is the human review step guarding against AI hallucinations? Non negotiable. How do you integrate volunteer translators specifically now into that workflow?
[00:15:44] Speaker C: You know, AI is something that really gives us that speed and scale. But I think humans essentially help us.
Humans will, can help us tell if it's a, if it still sounds like a real person. Yes, we have seen AI hallucinate, we think seen some instances where maybe cultural cues are missed or some meaning is distorted. But you know, our translators, that's where they step in to really refine the output from the AI, correcting any cultural mismatches and just making sure the speaker's voice stays true. You know, these may seem small nuances or. Yeah, but you know, they are pretty critical to get it 100% right. And I would say, you know, in earlier tests, we know audiences don't just notice what you get, right. But they notice what maybe if there's something that feels off. And so that's our job basically to make sure it never does. It never sounds off, but you know, very natural and as seamless as possible.
[00:17:02] Speaker A: That's great. And we'll talk a little bit about the future in a bit. But Guy, you did tell us a bit about metrics and how do humans look into quality? Could you tell us a more detail into what metrics or checks do your human experts use to ensure emotional fidelity and accuracy, which is, is, you know, the end goal here?
[00:17:24] Speaker B: Yeah, so you know, we break down every aspect from translation to speech delivery, quality of sync and then eventually quality of lip sync, naturalness, identity. Each one has its own subcategories and we rank it, you know, from 1 to 10 and then we collect all that information, summarize it, create trends. And as we iterate through every release, we go through the same thing again and again. A, making sure there's no regression, but B, trying to figure out where there are opportunities for us to improve. And it's really. There is the whole delivery at the end and there's questions around, you know, how natural it felt or did, did. Could you just watch through this whole thing? More generic questions. But as you break it down to the sub components, there's a lot more precision there. And we realized that there is a variation between languages and we didn't get all the languages right. The same quality. They're different, different levels. But as we've improved and improved and it.
We found ways to improve subcategories in different languages. So the overall, I would say sigma of the whole pipeline will produce a result that is not uncanny. Eventually. That's what we need to. To make sure. Now there's the next level, which is there's something that. To create a new talk in this case that feels natural and incorrect. And there's the next level which is did we actually capture the human performance?
So when someone is more exciting, when someone pauses, when someone has a posture specific. And these are the areas where the advancements that models that we train manage to capture. That's what we feel that. Okay, this is. We feel that we nailed it. We got, we. These are the, these specific pinnacles that if you actually reach it there you feel that we created something that is, is not. Doesn't always feel AI. You know, the fact that we're using AI, it's not exactly. Or we're not necessarily using AI as a mechanism to create this. AI is AI eventually helps us generate this, but it's just, it's just a process. It's like compute. The fact that this is AI is just a different way to create the compute and replicate. But almost like Elena said, this is a use case where advancement of AI helped us create this product. But it's not that. It's not what we're leading with. We're not leading with AI aspects. We're eventually leading with the fact that we managed to create and to dis. I would say disrupt a traditional way of doing things. Maybe inefficient and expensive and slow and actually Add a lot of efficiency and hopefully create an experience that actually is better than what you can otherwise create. If you had to use a dubbing studio where there's no way to manipulate the lips in a way that you do this with computer vision.
[00:20:04] Speaker A: And Guy, I think that the interesting, crazy funny thing here is that as Helena was mentioning, it looks like we are becoming a bit cynical. We, the consumers, we're becoming a bit cynical. We expect perfect lip sync, we expect perfect dubbing, we expect zero mistakes. We don't want to even smell that AI is there. And it's so difficult to orchestrate all these elements. And I guess that's the beauty of sophisticated, simple solutions that you don't really feel them. Things are going, things should improve. I mean we're no longer surprised by the improvements in, in AI. We are, we're kind of like expecting the next big thing. So, so looking ahead for you, Guy and Helena, what AI capabilities, real time dubbing, which would be crazy adaptive personalization, real, real time lip sync, I guess excites you the most. What excites you the most? For global communications and and of course Elena for you for the future of communicating ted's amazing content.
[00:21:01] Speaker B: I, I, I can start on a few, I think you, you pretty much mentioned a lot of it is, you know, live dubbing. For me live dubbing is not today a problem of architecture or streaming or quick processing. It's, it's really the, the challenge that we need to solve now is, is accuracy and quality. And as we've been working really closely with Elena and team with ted, which is again an amazing partnership, here is where the, the way we emphasize quality at every step of, of this pipeline is a foundation for then turning this eventually to be live dubbing, but maintaining the quality bar. So I think that's a big piece and there's use cases around sports and news where if you can have high quality commentating of a game and deliver this in hundred languages in close to real time and let people experience this in their own language, that's, that's for me is an amazing experience and I don't think that's so far out. But then if the next layer on top of it is adaptive and personalization is being able to really receive communication that is relevant for you at the time and also in real time adapted, generated and adopted and delivered, I think that's where we can see a lot of breakthrough in terms of like use cases and applications. And I definitely see that we're going to get there. I don't Think it's me too far out, like a couple, two, three years. I'll probably going to see these work and maybe even, even earlier.
[00:22:22] Speaker C: Yeah, I feel very similarly. I'm mostly interested in and seeing how personal and meaningful we can make the experiences for people. You know, it would be amazing to see dubs, you know, that adapt to your dialect, not just your language, but very granularly, you know, and specific and personal to you. Yes. Life Events is I think a very awesome use case, you know, where if everyone can hear the speaker in real time and not be left out, I think that would be really, really cool. I think, you know, if you, like I was saying earlier, if you, if you use AI to really restore that connection between the speaker and the audience and not try to replace it, I think, you know, that's where you're on. That's when you're on the right track. And yeah, I think less chasing trends and more like asking, you know, does this help people feel connected? But yeah, I think there's some exciting things ahead of us this morning.
[00:23:30] Speaker A: We were having a conversation as well and we were talking about the shiny objects that some companies chase. Even in content strategy and content marketing, they don't really look into some of the things that we've looked into in this conversation. Guy and Helena, I want to thank you first of all for your time.
There are lots of insights and probably will have follow up conversations with each of you to learn more about what you're doing and how things are progressing. But before we go, is there anything else that you like to add about how AI is redefining multilingual content and what companies and organizations can learn from the experience that you both had had dealing with TED in all these languages.
[00:24:15] Speaker C: So I think from what I'm, what I'm hearing across the industry, it does, you do hear about depth content viewing, you know, is growing, especially in the U.S. i think there is some like, stat I was saying like over 20%, maybe 25% year over year since in, I guess in 2024 that was showing especially younger viewers watching dub content.
I know, I heard Netflix has been adding more depth content over the years and seeing higher engagement and retention. So, you know, there's few numbers that give, that indicate that there is a growing demand for multilingual content. And for a long time it's kind of, it's been overseen. I feel like now it's, it's becoming more important and companies are realizing, you know, there is, there's, there are gaps that could be closed and with, you know, the, the technology we have access to now, Ang, for example, it's, it really is. I feel like, you know, it's a missed opportunity to not go out there and use it and really make sure you are speaking the language of your audience, you're reaching them. And I can only reiterate how the reaction we've seen from our audience after, shortly after we published this content, the dubs in different languages and hearing how, you know, the feedback from the viewers themselves was incredible. And then looking at the data and really seeing a huge spike in viewership and completion rates and then people coming back to view this content. So all that.
Yeah, it was very encouraging to see and I would encourage other companies to go out there and, you know, explore this, this kind of, this opportunity.
[00:26:22] Speaker B: Yeah, I would add in saying that, you know, in the last two years worked extremely hard and extremely. We were extremely particular to get the level that we managed to get with the pipeline today. And I truly believe that we were in this inflection point. Now we feel that we got to the level where it's not a gimmick, it's not something for just long tail, who cares content. It's really here for primetime content and it delivers quality results that actually create a different level of engagement. I'll share one stats that we have. You have an edtech company, that music education company that works with us. Originally they would translate their education content in Europe only, keep it in English and now they started translating and dubbing it with Panjaya into multiple languages. In markets where dubbing is expected, they've seen forex lift in engagement levels and that was a, that blew their mind that people engage with this in the local language way more than they did with English only. And that's another, you know, proof that this works and it moves the needle. It's not just a gimmick.
[00:27:27] Speaker A: That is a great way to put it, a great, great way to conclude and hopefully we'll be able to share with our audience an example of this so you'll be able to watch it if you're on YouTube, if you're listening to this on Spotify or Apple podcasts, make sure you listen to it, but also make sure that you look for the video so that you can see this perfect lip sync that we are going to make sure we align on so that we can share it with everyone. Right, Helena and Guy.
[00:27:56] Speaker C: Yes.
[00:27:57] Speaker A: Your perfect case, I guess.
[00:27:59] Speaker C: Definitely.
[00:28:00] Speaker B: Absolutely.
[00:28:03] Speaker A: Yes. Thank you so much once again, both of you for being here for our entire audience.
This is going to be our goodbye. So this wraps up our conversation with Helena Batt from TED and Gabby Picard from Panjazia on AI powered localization at TED. If you enjoyed this episode, visit multilingual.com for the full article. Voices Without Borders thanks for listening to Localization today. Please subscribe, rate us on your favorite podcast app and follow us on LinkedIn and Twitter for updates. Until the next time, Guy Elena, goodbye.
[00:28:40] Speaker B: All right, thank you. Pleasure.