Beyond Text: Translated's Multimodal Tech

[00:00:03] Speaker A: Hi, I'm Sebastien Brattiere, I'm the director of AI at Translated. We're here in Rome, in Italy. We are today we're launching Project dupes, which stands for Diversibus Vis Plurima Solvo. It's a European project about multimodal foundation models with lots of amazing partners. We've got epfl, we've got University of Oxford, we've got ETH Zurich and many others on board. Please enjoy my conversation with Multilingual. [00:00:36] Speaker B: Hello and welcome to a bus conversation with Sebastien Brattier, Director of AI at Translated. [00:00:50] Speaker A: It's an amazing day, especially for me. This is an effort that I started almost a year ago, August 16th. It's, you know, it's grant writing, proposal writing, and then having the intense emotion of actually handing in the proposal on 18th of September, against all odds, because many of the people sitting on this bus actually were telling me, navag going to make it. This is just too far fetched. There's too little time. Everybody started in April, we're starting this thing in August. What are you thinking? Handed it in. And not only we handed it in, but then after we handed it in, people went like, oh, you know, actually what we wrote isn't that bad. And then in January, good news came. We'd been retained, so the project was awarded. And here we go. So 3rd of July, 2025, project is now starting kicking off. [00:01:45] Speaker B: And almost 10 years after you went to Cambridge, we were talking about this. Could you tell us a little bit for those that don't know the story? How did you start working in artificial intelligence? Machine learning before it was popular? [00:02:00] Speaker A: Okay, yeah, yeah, of course. So my personal journey started, I guess in 2000. I was finishing my engineering studies at Ecole Central in Paris. I'm French, so I grew up bilingual. Okay. My other language, apart from French, is German. And ever since I was a kid, I've been reading too many grammar books too. Dictionaries and things like this, like, I guess many language professionals around, or translators. Certainly from an early age on, I was interested in language as a, as a thing, some might say pathologie que sur. But also I was interested in engineering and science. So when I did the engineering school back in France, at some point I decided, you know, I was looking into ways to bring language and engineering together. I had the opportunity at that point to do my last year at Cambridge in this Master's which still, you know, has evolved but still exists today. So I studied on this master, I studied natural language and speech processing. So an amazing mix like you were commenting back in 2001, this was still a very specialistic niche thing. You know, performance and obviously uses were quite restricted, you know, some things kind of work, but very many things were just unachievable and so forth. I was attracted by the field and then I started working there in speech driven telephony applications. To start with, I was with Tell me Networks, this was one of the early companies doing this, and then worked on chatbots and so forth. At some point I saw that the field was sort of being shaken up in ways that were deep and fundamental by inferences that were coming from statistical machine learning. And so what that meant for me was I wanted to go back to kind of research, get a very good grounding in machine learning, to make a difference in natural language processing and language technology. And so this is what I did in the form of a PhD. I went back to Cambridge to do research then formally on, you know, the maths and the prediction hairy things that we have in machine learning. Always with a view on language, but it was detached at the time. And then I was lucky enough to run into Translated and Marco Trombezi, my boss today, founder of translated, back in 2017. So it was when I arrived in Rome, more or less, and I was looking out tech scene there and you know, there were many fingers pointing to translated, telling me, you know, there's this very special place here, you should speak to them and stuff. And yeah, it was, you know, there was a spark there. Sergey, so joint translated, now the director of AI. And well, so many things have happened in the meantime, ever since I joined. I mean, 2017 is the year when the transformer was invented. That was in the summer of 2017. And I remember very shortly after we had Wuka Scheiser, who's one of the inventors who came and gave a talk about a devaluation talk about the Transformer architecture. Well, it's been, you know, a journey that many have been following ever since, because then the transformer, originally built for translation, let's repeat this has been, you know, obviously put in application and we were one of the first companies to actually put it in production in machine translation. But then it's, you know, as we know now, it's the building block for language, large language models and all of the rest of the AI sort of wave that we are witnessing. So that's the story until today, I guess. [00:05:40] Speaker B: Oh, thank you for giving us that context. And Sebastian, you were mentioning off the record, as we were preparing for the conversation, that this is a very Important project for language. What does it mean for language? And what does language mean for this project? [00:05:54] Speaker A: Yeah, so, yeah, indeed, this is important because. So maybe a word about the project itself. So DOOPS is about building multimodal foundation models. Inside doops, we do the science, the technology and the applications of these multimodal foundations. So in doops, we're doing the science, the engineering and the applications of these multimodal foundation models. The core sort of scientific, I would say, matter that's in this project here is how do we build what we today know works on text, namely these LLMs. How do we build them for all these other modalities that currently are not, I would say, catered for, that says there's no support for. How do we do this? This is something that a couple of people have been trying to do, for instance, for speech, for video and images. So you have models that mix text and images. You can generate, say, videos from a still image and from a piece of text. [00:07:01] Speaker B: Okay. [00:07:02] Speaker A: But in this project, we're looking at modalities in a much wider, much more expansive way. For us, modalities mean things like. So imagine in language. In language, you're writing on a touchscreen with a stylus, okay? Handwriting, the dynamics of your stylus, the recording of the points where your stylus lands. Well, this is a piece of data, a signal, if you will. Okay. That is certainly correlated to language in the sense that you're tracing the outlines of letters. A letter can be transcribed through a typed letter, if you will. Obviously, it can be aligned with text, can be aligned with speech. So if you write down what something says, what somebody says, sorry, then this can be aligned too. So these are modalities that have to do with language. Okay. And often people don't think, for instance, of writing as one of the modalities. Handwriting, OCR is another one. But now we go much beyond inside dupes. So we have two other, we call them application domains inside the project. One is cardiology, so medicine. Another one is environment, specifically earth observation, meaning satellite imagery. And you might wonder then, what do we translated have to do with these domains? Very far out. So I can tell you why we're doing this. Essentially, in a nutshell, we think that the future of language technologies lies in multimodality. This is catchword, physical AI. So we've seen in the evolution of language technology, like, you know, machine translation, very specifically, in our case, we've gone from machines that would translate sentence by sentence to systems that could take into account interactive corrections. So adaptive language learning that would take into account, like the edits by a translation professional and then review the translation in real time in order to produce a better output, you know, on a local basis. Next generation we had systems that would translate in context, so it would not. I mean, today's system, certainly they would not only do it sentence by sentence, but the meaning of a word would be certainly correlated to what's going on inside a passage, a paragraph or a document. And as you absorb more and more context, you see, you have more knowledge of the surroundings, of how the language you're interested in is produced. Well, the more you have that, the more accurate and the better and the richer the understanding of the semantics of that language. And what that means very concretely is, well, speech conveys richer expressivity than just text. Because in speech you've got a lot of, well, emotions, you've got prosody, you've got the music of speech, you've got rhythm, you've got all sorts of non verbal effects like breathing or size or laughing or hesitations, disfluencies, all these things or nonverbals, all stuff that is not present in text. If you look at it, you go one step beyond video of a speaker. There's a lot more going on there than just the audio. There's gaze, there's face direction, there's face expressions, there's. I'm doing all these gestures with my hands now. I think I'm progressively become more and more Italian. So body language and hand gestures, these are all part of the expressivity as well. But you can go one step still beyond the surroundings. So here we're speaking, we're sitting on a bus taking us to this evening location for the project kickoff. And just the fact that we have these surroundings means that maybe I'll be speaking a bit louder to you because there's some sorrowing noise or sometimes I'll be interrupting myself and I'll be pointing you to the Colosseo quadrato that we just passed minutes ago. And so if you can see the surroundings of where speech is produced, then you've got an even richer level of semantics in the language. Okay, so here I'm illustrating this for language in order to sort of open up the horizon for why multimodality is so important for language technology. Simply because essentially we're grounded human beings living in a world, we have social interactions and this is where language is produced and this is where language acquires its meaning. When babies learn to speak, they learn to speak because their parents and friends, they point things to them, they repeat words. Every time they say a ball, they say ball. Every time they find something blue, they say the word blue. And so there's this grounding, which is the relationship between language and the physical reality around us that comes into play. And so far this has all been, I would say, super challenging at say the least for language technology. Now with the foundation models getting at it, we're starting to see technical solutions for incorporating all this context into the understanding of language. Understanding meaning could be tasks like transcription, could be rewriting of speech, translation. Of course, translation preserves expressivity from source to target speech. But it could also be things like producing or generating speech or speech plus video that makes social sense. Okay, so this is a task that is there out there and that currently can't really be handled. But we're looking at this because we think this is the frontier for language technology. [00:13:01] Speaker B: Of course, one of the things that of course comes to mind, there are two things that come to mind immediately. And we're talking about context and that's very human. All of the things that we're talking about seems like more human cues. This evolves over time and that's one of the things that we always hear. It's always about the training data. What is happening with the information that's required to train models in this new way of looking into it? Are the models going to be needing the inputs constantly? And I remember having a conversation with Marco last year. He said, I cannot tell you about this. Or at the beginning of this year he said, I cannot tell you yet about it, but eventually you'll learn more about this. What's happening with the training data? What is going to be the evolution in this new way of looking into it? [00:13:55] Speaker A: It's got many, many answers. One of the answers is we don't really know. Another answer is so on the face of it, text data that can be harvested by LLMs is plateauing. We're getting to the, to the bottom of the, of the vats. Okay, there's no more, no more data to be harvested here. It's even worse than that. The data we're harvesting progressively is being sort of invaded, polluted, you might say invaded by machine generated text. Very difficult to tell apart from human generated text. Our ex colleague Marcello Federico, who was working on the MATECAT and modern MT project, he's now at Amazon, principal scientist and he's written an article some time ago on specifically this phenomenon, showing how much of the text on the Internet is nowadays machine generated as opposed to generally human. So, okay, we're getting to the bottom of that, which means that. Which kind of pushes the answer to be right. So in order to understand how literally, like you're saying, humans are communicating in socially meaningful contexts is to actually, well, observe, record how they're doing there. I have no idea how to. And currently, this is something that is very difficult to do at scale. So it's not like you can just record people's everyday conversations and pretend that this is okay, legit training data to use in our machine learning. This is not the way it works. What is true, however, is that there's, I would say, untapped language data out there in the form, for instance, of, you know, movies. Okay. Or recorded, I don't know, like all those YouTube videos that have got licenses that allow them, you know, that are compatible with training machine learning models for the. There are some YouTube videos like this, obviously. So these are natural conversations, or as I was saying before, socially meaningful conversations in sometimes formal, sometimes informal, colloquial settings with different types of people, adults, children, different subject matters altogether. I mean, you've got a host of situations there that currently are not being used at all. So it's examples of how humans interact that it's material that is getting completely ignored. When you have this multimodal approach where you take into account the video of speakers, you take smallest inflections in their audio speech into account. All these things where there you can start using these recordings, these data that are still there and that nobody has. And the best that's happened so far is that YouTube videos have been transcribed and then the resulting text has been used to kind of do as if we had extra text material. But this is very, you know, this is poor. The material you get is very poor with respect to what the original contained in terms of knowledge, understanding. And for an alien, an alien that comes down to us, makes a hell of a lot of difference whether they only have access to transcriptions of dialogues in text or whether they actually see how these humans behave when in social interaction. So here we go. That's one answer to the. The data conundrum. It's still data that still exists, but used in ways that is, you know, that exploits the depth and richness much more than so far. [00:17:54] Speaker B: And we continue to see that in multilingual, as we talk with content creators or we talk with companies that are creating global content, they are starting to realize that this dilutes the effect and the impact that their contact the content has. So this pendulum of going completely AI and machine being generated content is coming back to more originality. And at the end of the day, humans are the ones that bring that originality. [00:18:19] Speaker A: And that's lucky. I mean, fortunately there's these swings and people realize that there's, you know, there are usages of technology that let's rewind this and formulation would be. Even the best professionals sometimes make big mistakes about the expected performance of technological systems and language technology. And it's very easy to get it wrong. So that means what they use algorithms for use cases where actually this is not gonna make it. So you keep seeing this and obviously it makes for anecdotes, for size, for people going told you. For people telling you I am never gonna make it. For other people telling, oh, but this was wrong. It should, should have didn't use it. Right, wrong training data, whatever. So both ways. It's just very, I would say there needs to be an empirical response to using the right technology in the right context. So it's not like over promising or dreaming that any use case can be solved by, by technology. Even poor that this is going to work. So I can just shrug and say you need to have evaluations when you put systems in production to make sure that the quality, the performance is the one that is adequate for that level of quality that you need for your use case. And there's no way around this. [00:19:51] Speaker B: I think, I think the other part of the conversation, and we were talking about it before we started the recording, is the fear that people have. And one of the evolutions we've seen in the conversation also within multilingual, from people writing and people coming to us is fear is fading away and people are starting to realize that, you know, there are those that are taking on the opportunity, those that are not taking on the opportunity. This new multimodality, what opportunities does it bring for the language, for language, technology and the industry as a whole? [00:20:24] Speaker A: Okay, yeah, that's the difference. I mean, the point is it's not going to be AI against the human professional. It's going to be the human professional with AI against the human professional. Who wants to ignore or who is ignoring AI solutions when we're speaking of professional work? This is, this is really what it's down to whether you are able to, whether there exists some technology that can support you and whether you can use it or that you need to do without. So it might be sad, but it's really human against human in sort of market competition. It's totally absurd to say it's human against machine or full replacement. So your question is on how does multimodality change that game somehow? So one way it does is that the. So, okay, before, until now, let's say language professionals had a hard time going beyond tasks, I would say, or types of jobs that were more than just text. It was difficult. So typically because of the tooling required, it's easy for professionals to working, for instance in translation or in content creation to work with text because it's easy to access a keyboard and text processing documents. Now doing semi automatic subtitling or even dubbing. Okay, this is more difficult because for instance, subtitling, well, you need a bit of. It's already the start of it. You need a bit of equipment to be listening to audio and then transcribe it. But there use your keyboard so it kind of can make it work. Semi automatic dubbing, meaning you have a speech synthesis system that you use to voice the text that you as a professional maybe have translated and then it gets resynthesized to produce the target language audio track for some video that you're working on. Okay, this requires quite a bit more algorithms than the scenario I had before and then and so, and, but you know, this is what's happening and it means that nowadays language professionals can actually access these jobs as well. And what I'm saying is the more modalities we add, the more sort of job types language professionals can access and the more the boundaries between these job types becomes fuzzy. That means that for those of us in the profession who are creative and who can play with these boundaries, well, there are certainly more business opportunities really that arise, okay, because the material we can work with, consume, produce as professionals there becomes much vaster than just the good old text document that it was until now. [00:23:42] Speaker B: So do you see the linguists jobs becoming more sophisticated, needing more tools, needing more context, even education of a different type? So universities are going to have to change, linguists are going to have to change. How do you see that system evolving? [00:24:00] Speaker A: That's very interesting and I would say yes to these questions. So using more tools, using more technology. Yes. Which also raises the question, how do we educate future, I don't know, translators? How do we redesign translation studies at universities so that language professionals are educated with the right understanding of tools? The question of context is very accurate as well, I think. To do, for instance, to do subtitling you need or to do, you know, we all know that in text translation there are different levels of context sensitivity. So for instance, we would easily argue that between a technical manual and a literary text or poetry, where the Two latter ones are the ones that require a lot more understanding of human context, of emotions, social context of human relations and so forth. What is changing in the, I would say job of the professional is taking into account that sort of context. It's the. And this is starting to be more important. So what does it mean? It means that there's some contents that didn't used to be translated that will simply be affordable to translate because, well, professionals will get some of their job pre masticated, pre done. But then maybe they need to add the social understanding that no algorithm is able to afford or to produce. So they will need to give their human touch to the, to a translation because as humans they have this level of sensitivity that doesn't fall out of any algorithm. So this is certainly happening. Continues to be. So we have, you know, in language technology we have things like competitions, okay, to test technology on sort of challenging edge cases, very difficult tasks and stuff. And obviously, and still for a long time I believe literary translation will continue to be one of these challenging tasks because it contains so many of these situations that are just very difficult for any algorithm to grasp. So there's all these changes underway and challenges. I was saying specifically on education of language professionals that come out. So these are things to be very attentive to as a, I would say as a profession, as a profession for professional bodies certainly because we don't want to be churning out language professions whose skills are inadequate and they need to change jobs straight out of university because they haven't learned the right set of things. And it's true that these years, this is changing very, very fast and it's at an unprecedented speed. So we also need to change the curricula at university very fast. It is important. [00:27:09] Speaker B: Yeah, I think we are on the same page. We see a relentless space in innovation. Also adoption, business use cases everywhere today. And this is such a treat to have this conversation with you. We are going to the kickoff. What's next after the kickoff? [00:27:27] Speaker A: Okay, so here the project is only starting today. Literally the project is meant to last for four years. It's a very long endeavor. We're speaking of finishing in 2020. I differently from maybe projects that we used, research projects we used to do in years past, thinking of modern MT or even of maid cat. How do I find nowadays seeing the state of technology out there in two, three, four years, it's not just difficult, it's impossible. I would say like every. Anybody would get it wrong one way or another. If we try to predict what is going to be the case then. So what's going to happen next for us? Okay, we're going to start sort of parallel streams with fundamental research on multimodality and foundation models. We're going to try and find engineering solutions. For instance, for the compute in language models there's a thing called scaling laws which tell you how big a model needs to be with respect to the data you have and the amount of compute you can put into its training. Nobody knows whether these things exist for multimodal models. We're going to work on this too. We're going to have to access high performance compute infrastructures so big GPU farms in order to train these models that we experiment with, which bears its own set of engineering challenges. We're going to work on these and at the same time we will bring the models. It's not just going to be one big model, it's going to be a series of models. We're going to apply these models to a couple of use cases. Currently we have 12 defined in loops. We're going to apply these models to these use cases. So some in language, some in, like I was saying earlier, environment issues. This is like flood prediction or interpretation of satellite images, some in medicine. So it's going to be things like interpreting cardiology imagery and trying to, you know, support physicians in delivering diagnosis, for instance. So we're going to build all these use cases using the technology that we build. So this is essentially going to be the next steps. And all during these four years, I expect that we're going to have to rewrite our roadmap several times. We've written the first version of that roadmap with the grand proposal and that was already 12 months ago. And just this morning we were discussing multiple ways in which already what we were writing 12 months ago is already obsolete and we might want to change a number of things. And this is just 12 months. So you know, we're going to, we're looking forward to really rewriting these plans as we move along. And you know, some might find it scary. I find it really exciting because it means literally that, you know, this is a consequence of working at the cutting edge. The cutting edge keeps moving. It's unpredictable because there's so many talented people working on these issues just now that, well, research outside dupes, research inside dupes will put some of the well beliefs models that work important problems to work on. They will put all that into question. And so we're going to have to rewrite these plans and you know, who knows what what is out there in 2029. One thing we're sure of is that the matter of it, multimodality bears such richness for applications and sort of the greater good, that it's worth doing this effort and putting these things into question multiple times. So it's going to be a lot of hard work. But we know that there's so much happening at the intersection between modalities that it's really worth doing this very big effort. [00:31:22] Speaker B: Sebastian, I don't want to put you on the spot, but after Deep SEQ was launched, there was this conversation about the gap in advancement in Asia, specifically in China, as compared to Europe and North America. In that context, as an overarching conversation, what does this project tell us about that? Is the gap widening? Is this project hoping to close the gap? But what can we think about in those terms? [00:31:52] Speaker A: Oh, there's a lot of things to be said here. I mean, part of the things that. So this is a project that is of a certain kind. Okay. It's a project that represents one of the flagship efforts of the European Commission in terms of collaborative research. So one of its important effects is also to have research centers from across Europe collaborate, you know, sort of forced dialogues, including between, I would say, machine learning people, for instance, and say, cardiologists in ways that wouldn't happen otherwise. So, okay, it's got this initial funding, which is the grant that we got. Then we hope to aggregate some, you know, some more projects as we move on. We presumably will apply and will be granted some more money just for compute infrastructure. So this is going to happen as well, but deal with the reality that we're working by constraint of the project, working at a certain type of problems with a certain time frame. And many of us are, you know, many universities there, a couple of industry players. The example you cited with Deep SEQ reminded many people that that for one, there's not just the Western world working on these problems, but many other very clever and talented people, many people also who are good at putting the same problems into different perspectives. So essentially, one amazing thing that deepseek did is that they applied all the engineering tricks in the toolkit to make inference and then inference during training much, much faster than we ever dreamed of. They applied plenty of other tricks to make much smaller models work to levels of performance that in the other larger models only reached with difficulty. So the reminder was, I wouldn't say it's a reminder of naturally, of sort of geopolitical nature or something, but it's really a reminder that, okay, you can go the big, bulky, monolithic way. But then there's also a treasure trove of approaches, technical approaches, literally, which work just because they go lightweight, very fast, agile, niche problems that, when you solve them, actually end up having reproducible solutions. So amazing, amazing ways of making progress that we tend to forget when we work in those settings where we have a bit of blindness, tunnel vision about the way we try and solve our problems. So I think, you know, your point is very valid and I think it serves as a very healthy reminder of all these things. So not to forget that the way we pose our problems, the setting in which we try and solve them, they're very, you know, they will determine the kind of solutions that we bring and the success we have eventually. [00:34:57] Speaker B: Well, Sebastian, thank you so much for this conversation. We've come to stop here in the bus and we're going to go to the event. I wish you success for the next four years and I hope we can get some updates on that schedule and that timeline soon. [00:35:12] Speaker A: Thank you so much for the conversation and yes, happy to give you updates sometime down the line. [00:35:18] Speaker B: All right, bye. [00:35:19] Speaker A: Bye.

Show Notes

Episode Transcript

Other Episodes

Episode 21

The dispute over Sanremo Music Festival's sign language performances

Episode 122

Need Practice Speaking English? Meet Your New Tutor: Google Search

Episode 227

Modern Mayans: A Hidden World Closer Than You Think