Sam Charrington: [00:00:00] Welcome to The TWIML AI Podcast. I’m your host, Sam Charrington. Before we jump into the interview, I’d like to take a moment to thank Microsoft for their support of the show and their sponsorship of this series of episodes highlighting just a few of the fundamental innovations behind Azure Cognitive Services. Cognitive Services is a portfolio of domain-specific capabilities that brings AI within the reach of every developer without requiring machine learning expertise. All it takes is an API call to embed the ability to see, hear, speak, search, understand and accelerate decision making into your apps.
Visit to learn how customers like Volkswagen, Uber and the BBC have used Azure Cognitive Services to embed services like realtime translation, facial recognition, and natural language understanding to create robust and intelligent user experiences in their apps. While you’re there, you can take advantage of the $200 credit to start building your own intelligent applications when you open and Azure free account. That link again is And now, on to the show.

All right, everyone, I am here with Arul Menezes. Arul is a distinguished engineer at Microsoft. Arul, welcome to the TWIML AI podcast.

Arul Menezes: [00:01:43] Thank you, Sam. I’m delighted to be here.

Sam Charrington: [00:01:45] I’m really looking forward to our chat, which will focus on some of the work you’re doing in the machine translation space. To get us started, I’d love to have you introduce yourself and share a little bit about your background. How did you come to work in NLP and, and translation? And tell us a little bit about your story.

Arul Menezes: [00:02:03] Yeah, so I’ve actually been at Microsoft 30 years at this point.

Sam Charrington: [00:02:07] Wow.

Arul Menezes: [00:02:07] I, yeah, I know. God, it’s a long time. I was actually in a PhD program. I came here for the summer, loved it so much I never went back. So I worked at Microsoft in the various engineering teams for a while, and then eventually I drifted back into research and I joined the natural language processing team in Microsoft Research, and I started the machine translation project, and I’ve been doing that ever since, so I’ve been doing machine translation for, like, 20 years now, and it’s been, it’s been a great ride because it’s just a fascinating field. So many interesting challenges and we have made so much progress from when we started, you know, and we’ve gone through so many evolutions of technology. It’s been, it’s been a great ride, yeah.

Sam Charrington: [00:02:49] Yeah, there are some pretty famous examples of, you know, how the introduction of deep learning has changed machine translation. I’m assuming that your experience there i- is no different.

Arul Menezes: [00:03:04] Yeah.

Sam Charrington: [00:03:04] Can you share a little bit about how the, the evolution that you’ve seen over the years?

Arul Menezes: [00:03:08] Sure. Sure. I mean, historically, you know, machine translation is something people s- tried to do, you know, in the ’50s. It was one of the first things they wanted to do with computers, you know, along with simulating sort of nuclear sort of bombs. But for the longest time, it was very, very hard to make progress, so all the way through, I would say, the late ’90s, early 2000s, we were still in sort of rule based and knowledge sort of engineered approaches, but then the first real breakthrough that came in the late ’90s, well actually starting a little earlier in terms of some papers published at IBM, but really taking off in the late ’90s and early 2000s was statistical machine translation, where for the first time, you know, we were able to take advantage of, like, large amounts of previously translated data, right?
So you take documents and web pages and things that, that have previously been translated by people and you get these parallel texts, which is, let’s say, English and French, and you align documents and sentences, and then eventually words and phrases so you can learn these translations, and so with statistical machine translation, we were learning from data for the very first time, instead of having people hand code it.
And it worked, actually, astonishingly well compared to what we were doing before. But eventually, we ran into the limits of the technology, because while we had the data, we didn’t have the techniques to do a good job of learning what that data was telling us because, you know, the machine learning techniques that we had back then just weren’t good enough at… They were good at memorizing, right? If you said something exactly the way they had seen in the data, they would do a good job of translating it. But they were terrible at generalizing from what they saw in the data, and that’s where neural models come in.
Like, neural models are amazing at generalizing, you know. People always talk about how some of the latest models, you know, you can probe them to figure out what was in their training data and get them to reproduce what was in their training data. But what we forget is it takes work to actually make them do that, because most of the time, they’re generalizing. They’re paraphrasing. They’re not just replicating their training data, and that’s something we were not able to do before.
So if you look at the evolution over the last 20 years of machine translation, we had our statistical machine translation, which did really well for a while, but then eventually plateaued. Then, you know, we had sort of the advent of neural networks, and the first thing that people tried to do was, you know, we did feedforward neural networks. We tried to shoehorn them into the framework we already had and combine feedforward networks and statistical techniques, and that worked okay. You got a few incremental improvements. But it wasn’t until we had the sort of pure neural LSTM models that we, for the first time, were really capturing the power of neural models, right?
So what an LSTM model would do would be, you know, you have this encoder that you feed the source language sentence in, and it basically embeds the meaning of that entire sentence in the LSTM state. And then you feed that through a decoder that is now generating a fluent sentence, sort of based on this very abstracted embedded understanding of what the source language said. And so that’s very different from the way we were doing it, just sort of copying words and phrases that we’d memorized.
So that was the first revolution, and, and it gave us amazing results, actually, compared to what we were doing before. And then, of course, along, after that came transformers, which sort of take that whole encoder/decoder architecture, but take it to the next level. Instead of having the meaning of the entire source sentence be encoded into a single LSTM state, which may work well for short sentences but gets, you know, worse as you get to longer sentences. In a transformer, you know, we have the self attention that’s basically looking at every word in the source and every word in the target, and so you have, like, full context available to the model at any point in time.
So that’s where we stand today is, you know, transformers are the state of the art, but of course there’s lots of really cool, interesting variations and things we’re doing, which I think we’re going to talk about at some point.

Sam Charrington: [00:07:25] And, and when you talk about transformers being the state of the art, is that what is powering the current kind of production Azure machine translation service? Or is that the state of the art in research and, you know, there’s some combination of the various techniques you mentioned that is powering the live service?

Arul Menezes: [00:07:46] So the live service is very much powered by transformers. We have, you know, all 180 language pairs or something that we support powered by transformers running in production. Now, one thing we do do is that we take advantage of what’s called knowledge distillation, right, to take the knowledge that’s embedded in these very large transformers that we train offline and then condense that or distill that into smaller, still transformers, but smaller, shallower, and narrower models that we use in production, right?
So we typically go through multiple stages of these teacher models before we get to the student, so our pipeline’s actually fairly complex. We take the parallel data, which I mentioned earlier, which is sort of the lifeblood of machine translation. This is the previously translated human text. And we train, like, a first teacher based on that data. Then we typically do what’s called back translation, which is a technique in machine translation to take advantage of monolingual data, so data that’s not parallel, so it’s not translated source and target. It’s just in one language, typically the target language. And what we do there is we want to take advantage of this monolingual data to teach the model more about the syntax and the, you know, semantics of the target language so it gets more fluent.
And the way we incorporate that data into a machine translation model is through something called back translation, where we take the, the target language data, we translate it back to the source using one of our models, and then we use it to train the model in the other direction. So this is a little complicated. So basically, if you’re training an English to French model…

Sam Charrington: [00:09:28] Mm-hmm [affirmative].

Arul Menezes: [00:09:29] … in addition to the parallel English-French data, you also take some French monolingual data, you translate it back to English using your other direction translation system, the French to English system…

Sam Charrington: [00:09:41] Mm-hmm [affirmative].

Arul Menezes: [00:09:41] … and then you put that synthetic data back into training your English-French system.

Sam Charrington: [00:09:45] Okay.

Arul Menezes: [00:09:45] So, so that’s-

Sam Charrington: [00:09:47] [crosstalk 00:09:49] essentially a, a data augmentation technique?

Arul Menezes: [00:09:51] It is, yeah, it’s a data augmentation technique, and it works, like, incredibly well, actually. Adds several points to our metric. The metric we use in machine translation is called a blue score. I mean, there are other metrics and I, I mean, I could talk about that at some point if we want to get into it, but, you know, we get several points of blue score out of the back translation.
And then, so that’s our final sort of teacher model, which is typically huge, and then what we do is we use that model to teach the student model. And the way we do that is essentially we run, like, a huge amount of text through this teacher model, and then we take the data generated by the teacher and we train the student on it. And the reason that works is because unlike sort of natural data that we train the teacher on, which can be confusing, contradictory, diverse, the, the data generated by the teacher is very uniform and it’s very standardized, and so you can use a much simpler student model to learn all of that knowledge from the teacher, because it’s a simpler learning problem.
And having done that, that model runs, like, super fast and we can h- host in production and translate, like, trillions of words, you know, so yeah.

Sam Charrington: [00:10:59] And so the, the student teacher part of the process is kind of interesting to explore a little bit further. Are you essentially trying to do something your- the task that you’re trying to, or goal that you’re trying to achieve with that is model compression.

Arul Menezes: [00:11:13] Right.

Sam Charrington: [00:11:14] Very different approach to it than, like, pruning or…

Arul Menezes: [00:11:18] Right.

Sam Charrington: [00:11:18] … you know, some of the other ways you might approach compression.

Arul Menezes: [00:11:20] Yeah, right. So we do, like, we do a lot of different things to model compression, right? So one of the things we do is we, we do quantization, for example, within all our models in eight bits. We’ve experimented with less than eight bits. It’s not quite as effective, but, you know we, we do that. We do some other, like, pruning techniques as well, but the biggest one is the knowledge distillation, and what you’re trying to do there is get a smaller model to basically mimic the behavior of the big teacher model just running a lot cheaper.
And by combining all the techniques, we published a paper last year on this at a workshop, and from our big teacher with all of the knowledge distillation, the compression, the quantization and, and so on, we’re running something like 250 times faster…

Sam Charrington: [00:12:06] Wow.

Arul Menezes: [00:12:06] … on the student than the teacher with, I mean, there is a small loss in quality, right? But we lose maybe half a blue point, not too much, and in some cases not even any. We can, like, actually maintain the quality as is, so…

Sam Charrington: [00:12:20] The, my next question for you, it, the way you describe the process, and in particular the idea that the teacher is outputting more consistent examples than what is found in the training data…

Arul Menezes: [00:12:35] Mm-hmm [affirmative], right.

Sam Charrington: [00:12:35] My next question was, or the intuition that I had was that that would cause the student to be far less effective at generalizing and would make it perform worse, but it sounds like that’s not the case in practice.

Arul Menezes: [00:12:51] So the key to that is to make sure that the data that you feed through the teacher to teach the student is diverse enough to cover all the situations that you may encounter, right? So the students are a little weird, I mean, and I think you’re sort of hinting at that. We do, for example over-fit the student to the training data, which is something that you typically wouldn’t do in your teacher model, because you, in fact, are trying to make the teacher match the student as much as possible.

Sam Charrington: [00:13:18] Mm-hmm [affirmative].

Arul Menezes: [00:13:18] So some of the things that you do to make the, to the teachers better at generalization, you don’t do in the student. And in fact, if you look at the student distributions, they’re much sharper than the teacher distributions, because they have overfit to the data that they’ve seen. But, you know, there’s a little evidence that you could get into some corner cases that are brittle, like you know there’s this problem of neural hallucination that all of the neural models are subject to where, you know, occasionally they’ll just output something that is completely off the wall, unrelated to anything that they’ve seen. And there’s some evidence that there’s a little bit of amplification of that going on. Like if it’s… You know, the teachers are also subject to hallucination, but maybe at a very, very low frequency, and that maybe that’s being amplified a little bit in the student.
So w- we’re, you know, we’re working on managing that, but yeah, so there’s, there’s, you know, it’s a trade-off. Like, the students have lower capacity, but that’s what enables us to run them, and we, you know, we run them on CPU. We don’t, we, we don’t use GPUs in production inference. We use, of course, all models are trained and, and all the knowledge [inaudible 00:14:26] are done on GPUs, but in, but in production we’re just using CPUs.

Sam Charrington: [00:14:26] And is, is that primarily based on the cost benefit analysis, or is it based on a latency envelope that you have to work with and not needing, not wanting a kind of batch…

Arul Menezes: [00:14:38] Right.

Sam Charrington: [00:14:38] … Inference requests?

Arul Menezes: [00:14:39] Yeah, that, that’s exactly right. I mean, you know, latency is a big concern. Our API’s a real- real-time API, and so, you know, latency is the biggest driving factor. And honestly, if you do inference on GPUs, you know, you get some latency benefit, but the big benefit is on large batches. And so unless you have a matched batch translation API, you can’t really take advantage of the full capacity of your, of your GPU, so, you know, in a real-time API.

Sam Charrington: [00:15:05] Mm-hmm [affirmative]. And are both the teacher and the student models transformers for you?

Arul Menezes: [00:15:12] Yeah, they are. They are. Yeah, the, the students are transformer large or a little bit larger, and then the s- sorry, that’s the teachers, and then the students, they’re very highly optimized transformer. I mean, they, we start with transformer base, but then we do a lot of really strange stuff. I would refer you to the paper, actually. [laughs]

Sam Charrington: [00:15:30] Okay. When you were describing the data augmentation technique that you use…

Arul Menezes: [00:15:36] Right.

Sam Charrington: [00:15:37] … it kind of called to mind ideas about incorporating a GAN type of approach where you’re doing the pass-back translation and then, you know, maybe there’s some GAN that is trying to…

Arul Menezes: [00:15:47] Right.

Sam Charrington: [00:15:47] … figure out if the results going backwards…

Arul Menezes: [00:15:50] Right.

Sam Charrington: [00:15:51] … is like a human translation. Is there a role for that kind of technique? Is that something that comes up in the research?

Arul Menezes: [00:15:57] Yeah, so we’ve, we’ve looked at GANs. There were some exciting results, but but in the end, I mean, I think we have some okay research results. We haven’t seen much benefit, but more broadly, in terms of data augmentation, we’re using it all over the place, right? So it’s we have the back translation, but there are a lot of phenomenon that we want to address in machine translation that is maybe not well represented in the data, and so we use data augmentation pretty heavily to cover those cases, right?
To give you a simple example, when you translate a sentence and you get a particular translation and then you go in and let’s say you remove the period at the end of the sentence, sometimes it changes the translation entirely. They may both be perfectly good translations, right? But they’re different. So one way to look at it is, well, they’re good, both good translations, but people don’t like that. So if you look at our customers, and we’re very sensitive to what our users, the feedback we get from our users. So one of the feedback we got was that, you know, we want a little more stability in our translation. So, you know, just because I lost a period at the end of the sentence, I shouldn’t get a drastically different translation. And so, you know, it’s very easy to augment the data and say, well, you know, stochastically I’m going to, like, delete the period on my sentences, and so then the model learns to basically be robust whether there’s a period or not.
Now, of course, you know, that’s different than a question mark. You definitely want to leave the question mark in because that changes the meaning of the whole…

Sam Charrington: [00:17:22] Mm-hmm [affirmative].

Arul Menezes: [00:17:23] … sentence. But, you know, things like that, punctuation, the period, commas, things like that. Maybe, you know, capitalization, for example. One of the the other examples would be like an all caps sentence. You know, you take the whole sentence and you change it to all caps. Well, you get a totally different translation, right? So we, again, generate some synthetic all caps data so that the model learns to do a good job of translating that as well.
And then there’s, you know, there’s all these, like, I, I would call them, you know, long-tail phenomenon that and, you know we feel that data augmentation’s a good way to address some of these, yeah.

Sam Charrington: [00:17:53] Your examples are really interesting to me because I’m refer- I’m comparing them to, like, your textbook NLP types of examples where the first thing you’re doing is making everything lowercase and getting rid of all of your punctuation.

Arul Menezes: [00:18:05] Yeah.

Sam Charrington: [00:18:05] Sounds like that does not work for translation.

Arul Menezes: [00:18:08] No, because there’s a lot of information in casing and punctuation, right? Like, I mean, if you want to handle names, for example, you need to pay attention to the case of the input. Like, everything in the input has information, and so actually even the punctuation, right? Like, sometimes if you take the period off the end of the sentence, it should change things because it may be a noun phrase rather than an actual sentence, right? So it’s not so much we’re preprocessing the data and trying to be clever. It’s about exposing the model to different variations so that the model can figure things out for itself.

Sam Charrington: [00:18:41] Mm-hmm [affirmative]. One of the questions this prompts is, like, the unit of, you know, work or the unit of thing that you’re trying to translate.

Arul Menezes: [00:18:50] Right.

Sam Charrington: [00:18:50] You know, translating a word being different from translating a sentence…

Arul Menezes: [00:18:54] Sure.

Sam Charrington: [00:18:54] … being different from translating a, an entire document.

Arul Menezes: [00:18:57] Right.

Sam Charrington: [00:18:57] Sounds like most of what we’ve been talking about is kind of phrase by phrase now relative to the word by word that, you know, we were doing 20 years ago. Are you also looking at the entire document? Are you able to get information from a broader context to impact the translations?

Arul Menezes: [00:19:18] Yeah, so that’s, that’s a very good question, Sam. Yeah, so the context matters a lot, right? So one of the reasons why neural models are so great at translating now is because they, they are looking at the whole sentence context and they’re translating the entire conte- the sentence and every w- they, they’re basically sort of figuring out the meaning of every word and phrase in the context of the whole sentence, which is something we couldn’t do with statistical machine translation before.
So now the next step is to expand that context to beyond the sentence, right? So there are a lot of phenomenon that it’s impossible to translate well without context beyond the sentence. Like, in many languages, unless you have document-level context or paragraph-level context, you can’t generate the right pronouns because you don’t actually know. The sentence doesn’t have enough clues to let you know what is the gender of the subject or the object or the person you’re talking about in that sentence. Beyond just the pronouns, it’s also like you know, the senses of words and you know, disambiguating those. So, we’re, we’re actually moving towards translating at the whole document level context, or at least, you know, very large, multi-sentence fragments. And then there, we’ll be able to use, you know, the, the, the, the context of the entire document to translate each individual sentence.
And we actually have some really great research results based on translating at the document level. Yeah, so we’re pretty excited about that. That model is not in production yet.

Sam Charrington: [00:20:48] Mm-hmm [affirmative].

Arul Menezes: [00:20:48] But it’s something that we’re working on. We did ship a document-level API. I think it’s in public preview right now. Which addresses sort of the other half of the problem, which is, you know, people have documents. They’ve got formatting. You know, it’s in PDF. It’s in Word. It’s in PowerPoint, whatever, and HTML, and it’s a hassle getting all the text out of that…

Sam Charrington: [00:21:11] Yeah.

Arul Menezes: [00:21:11] … getting it translated, and then worse still trying to reassemble the document and reconstruct the formatting of that document on the translated thing. So we’ve made that easy. We just shipped this API. Just give us your PDF. We’ll tear it apart, we’ll do the translation, we’ll put it back together, and we’ll preserve the format. And you know, especially for PDF, that is actually really hard. Doing the format preservation is tricky. But we’re pretty excited about that API.
And so, then, that’s the place where our document level neural model would fit right in, right? Because now we have, the user’s giving us the whole document. We can not only handle all the stuff about the formatting and all that. We can go one better. We can actually use the whole document context to give you better quality translations.

Sam Charrington: [00:21:53] Mm-hmm [affirmative]. Can you give us an overview of some of the techniques that go into looking at the entire document when building the, the model?

Arul Menezes: [00:22:03] Yeah, so there’s, I mean, right now, as I said, we haven’t actually shipped this, so we’re looking at a, a bunch of variations. You know, there’s several things that people have looked at, like mo- you know, there are hierarchical models where you do the, you run transformers at the sentence level, and then you run a second level to sort of, like, collect the sentence level information into, like, a document level context vector, and then you feed that back into translating each sentence.
We’re finding that actually, if you just make it super simple and treat the whole thing as, as if it were a giant sentence, in effect, you get really good results. You do have to deal with the performance issues, right, because transformers are n-squared in the size of the input and the output, so instead of, you know, handling, you know, a 25-word sentence, if we’re not translating a thousand-word para- you know, document or paragraph, then the, you know, you’ve got, like, an n-squared problem in terms of the performance, right? It’s going to be that much more expensive, but we have, we have things that we’re looking at to make that faster as well, so we’re pretty optimistic we can do that, and and I think we can do that with just letting the transformer figure it out for itself rather than trying to be very clever about all this hierarchical stuff.

Sam Charrington: [00:23:10] Nice. Nice. Let’s talk a little bit about the role of different languages. You know, we, we’ve already talked about how you can use back translation to help augment the performance of your translation of a language in, in one direction or the translation between a couple of language pairs.

Arul Menezes: [00:23:27] Right.

Sam Charrington: [00:23:28] Are there ways to take advantage of the other 130 or so languages…

Arul Menezes: [00:23:33] Sure.

Sam Charrington: [00:23:33] … that y- that you support when you’re building the n-plus-1th model for a given language?

Arul Menezes: [00:23:38] Absolutely. Absolutely. That’s been one of the most exciting things, I would say, that came out of sort of transformers and, and neural models in general is the ability to do this sort of transfer learning between languages, right? And the reason we can do that is because transformers or neural models in general are representing the meanings of words and sentences and phrases as embeddings in, you know, the space, and by training on multiple languages together, you can actually get the representations of these languages to merge and have the similar concepts be represented through relative points in spa- in that space, right?
So, as a practical matter, we’ve basically found that if we group languages by family, right, and take, so for example we took all our Indian languages and we put them together and we trained one joint model a- across all of the languages and now we’re talking, you know, languages where you have a very different amount of data. You have Hindi, where we have quite a lot of data, and then we have, like, Assamese, which was, I think, the last one that we shipped, that has probably, like, two orders of magnitude less data. And the, the wonderful thing is that by training them jointly, the Assamese model learns from the huge amount of data that we have for Hindi and does, like, dramatically better than if we had just trained on Assamese by itself.
In fact, we have done those experiments and, you know, for the smaller languages, we can get, like, five, 10 blue points, which is, like, a crazy level of improvement just from the transfer learning and multilingual.
We also do that with, like, Arabic all of our Middle Eastern languages. So we’re just, like, grouping more and more language families together and getting huge benefits out of this.

Sam Charrington: [00:25:31] And when you’re grouping the, the language families, have you ever, do you experiment with going across language families and seeing if there’s some improvement,

Arul Menezes: [00:25:41] yeah.

Sam Charrington: [00:25:42] … improvement there?

Arul Menezes: [00:25:43] Yeah, so we, you know, we’ve trained models that are, like, 50 or 100 languages in them. What you run into is, you know, as you add languages, you have to increase the size of your vocabulary to accommodate all of these languages, and you have to increase the size of the model, because at some point you run into model capacity a little bit. So you can have a model that does a ni- a really nice job of learning from 50 or 100 languages, but it gets to be a really huge model, and so in terms of cost effectiveness, we’ve found that, like, you get, like, almost all of the benefit of the transfer learning at, like, a much reduced cost by just grouping 10, 15 languages at a time. And if they’re related, it’s better. But actually, even if they’re unrelated, it still works. [laughs] It’s quite amazing how well it works even if the languages are not related, yeah.

Sam Charrington: [00:26:32] We may think of it as, like, a computation test of Chomsky’s universal grammar and, you know, these ideas that suggest that all languages have these common elements.

Arul Menezes: [00:26:41] Yeah.

Sam Charrington: [00:26:42] I- if you are able to train these models across languages and im- improve them, that would seem to support those kinds of theories.

Arul Menezes: [00:26:48] I mean, definitely the models do a really good job of bringing related concepts together in the, in the embedding space, right?

Sam Charrington: [00:26:56] Would you consider this, you, you referenced this as, like, multilingual transfer learning. Would you also think of it as a type of multitask learning as well, or is, is that not technically what you’re doing in this task?

Arul Menezes: [00:27:09] So we’re also doing, in addition to multilingual just machine translation, we’re also doing multilingual multitask learning, and what we’re doing there is we are combining the sort of so there’s let me back up a bit. There’s been this whole line of research based on models like BERT, right? Pretrained language models where, if you look at BERT, it’s actually the encoder half of a machine translation model, but it’s trained on monolingual data. It’s trained on a, on a single language data on this objective that’s a reconstruction objective where, you know, you’re given a, a, a sentence where you have couple of words or phrases blanked out. You need to predict that, right?
And then you have multilingual BERT where you take multiple separate monolingual corpora, right, so it’s like a bunch of English text, a bunch of French text, and all of it, and you train them jointly in the same model. And it does a pretty good job of actually pulling the representations of those things together. So that’s one line of research that’s sort of really driven a revolution in, like, the whole natural language understanding field, right? So for example, today if you want to train a named entity tagger, you wouldn’t start from scratch on your ta- on your named entity data. You would start with a pretrained model.
So one of the things that we’re very excited about is we have this project that we call [ZICOR 00:28:43] where we’re bringing the machine translation work and this sort of pretrained language model, BERT-style work together, right? And we train, we’re training this multitask, multilingual model that’s, architecturally, it’s just a machine translation model, right? But in addition to training it on the parallel data, let’s say the English-French data and the English-German data and conversely the German-English data and the French-English data and, you know, 10 or 15 or 50 or 100 other languages. In addition, we have a separate task where we have the BERT tasks, where we take monolingual data and we, we have it reconstruct, you know, the, the missing words. And we also have what’s called a denoising autoencoder task, which is where you give it a scrambled sentence, and then it has to output the unscrambled sentence through the decoder.
And then now you have these three tasks, and we train them in rotation on the same model, so they’re sharing parameters. So the model has to figure out how to use the same parameters to do a good job of the BERT task, to do a good job of the denoising autoencoder task, as well as to do a good job of the machine translation task. And this, we find, leads to, like, much better representation that work for better natural language understanding quality, but also better machine translation quality.

Sam Charrington: [00:29:43] Nice. And the, the BERT task in this example is within the same language, as opposed to…

Arul Menezes: [00:29:49] Right.

Sam Charrington: [00:29:49] … across the target, to the target language?

Arul Menezes: [00:29:53] Yeah, there’s actually, like, a whole family of tasks, right? I mean, people have come up with, I mean, we’ve, we’ve experimented with, like, 20, 25 tasks. Like, so you can do a monolingual mass language model task, which is the BERT task, but you can do a cross-lingual ma- mass language task as well, and you can do the denoising autoencoder task monolingually, where you have to reconstruct the same language, but you can also do that cross-lingually where you have to reconstruct sort of a scrambled foreign language task, so there’s, like, a real, like, sort of stone soup approach where people are just throwing in all kinds of tasks, and they all help a little bit.
But we need to figure out, like, what’s the minimal set that you need? Because, you know, it’s work. It’s computational expense to train these huge models on all these tasks, so if we can find the minimal set that works, that would be ideal. And so far, what we’re working with is, like, a denoising autoencoder, a mass language model, and a machine translation task.

Sam Charrington: [00:30:49] Very, very cool. Very cool. So I think one of the things that, you know, often users of this kind of machine translation services experiences that, you know, they weren’t great in the general case, but when you start to try to apply them to specific domains, it’s a lot more challenging, you know, and kind of the technical conversations or translating, you know, medical conversations or, you know, construction or what have you. Is there anything that you’re doing to make the domain-specific performance better for these kinds of systems?

Arul Menezes: [00:31:26] Yeah, definitely you know, domain performance in specialized domains is a real challenge. We’re doing several things to get better there, right? So the, the first thing is that the quality is really determined by the availability of data, right? So in the domains like, let’s say news or web pages where we have a ton of data, you know, we’re doing really, really well. And then if you go into a more specialized domain like, let’s say, medical or legal where we don’t have as much data, we’re maybe not doing quite as well. And so one of the things we’re doing is we’re now taking the same neural models that are good at translation and we’re using them to identify parallel data in these domains that we can find on the web that we maybe weren’t finding before, and we can do that because these models, you know, because the representations are shared in the multilingual models, they are actually very good at identifying potential training data that, that is translations of each other. So that’s one thing we’re doing.
The other thing we’re doing, of course, is the same kind of transfer learning approach that we’re using cross-lingually applies within domains, as well, right? So if you have a small amount of medical domain data, you don’t want to, like, just train a l- a model that’s based just on that you know, small data. What we’re doing instead is we’re taking, you know, our huge model that’s trained on a ton of, like, general data across a bunch of domains, and then you fine-tune it for the specific domains that you’re interested in. And we actually have a product called Customer Translator that we have, like, you know, thousands of customers using, where they are using this approach to customize the machine translation to their company or their application needs, right?
So let’s say you’re a car company or something and you have a bunch of data that’s about, like, automotive manuals, right? So you come to our website, you log in, you create your account, etc., you upload this data, and then what we do is we take your small amount of domain-specific data, we take our large model, and then we fine-tune it to that data, and now you have the model that does, like, you know, sometimes dramatically, again, 10, 15, 20 blue points better than the baseline because, you know, we’ve learned the vocabulary and the specifics of your domain, but we’re still leveraging, we’re standing on this platform of, like, the broad general domain quality.
So that’s been extremely popular and valuable, actually. We just shipped a new version of that based on transformers a couple of months ago.

Sam Charrington: [00:33:45] And in that case, the user is presumably bringing translated documents so that, that you’re able to train or fine tune all, with both source and target translations?

Arul Menezes: [00:33:56] Yeah, that’s exactly right. I mean, a lot of the companies that we work with have some data, right? Like, let’s say they had a previous version of their vehicle or, you know whatever and they had manuals that were translated. In Microsoft’s case, for example, you know, we have, let’s say the manuals for Microsoft Word going back, you know, a couple of decades, and this is the kind of data you can use to customize it so that anything, any new content that you want to translate can have, like, a very consistent, like, vocabulary and, and tone and so on, yeah.

Sam Charrington: [00:34:28] Mm-hmm [affirmative]. And then in that first example or the first technique that you mentioned, that sounds really interesting. So you’ve got this index of the web and Bing…

Arul Menezes: [00:34:37] Right.

Sam Charrington: [00:34:37] … you know, for example, or maybe you have a separate one, but you go- have this mechanism to kind of crawl the web and…

Arul Menezes: [00:34:43] Right.

Sam Charrington: [00:34:44] It sounds like the idea is that you can use the model to identify, hey, I’ve got these two documents.

Arul Menezes: [00:34:52] Right.

Sam Charrington: [00:34:52] They look really similar, but there’s a, a high percentage of words that I don’t know…

Arul Menezes: [00:34:57] Yeah, right.

Sam Charrington: [00:34:57] … that occupy similar positions in the same documents.

Arul Menezes: [00:35:00] Yeah.

Sam Charrington: [00:35:00] And then you have someone translate- oh, well actually, then once you know that, you can just align them, so to speak, and you’ve got more domain-specific document to add to your training set? Is that the general idea?

Arul Menezes: [00:35:13] Yeah. I mean, it, it’s like you’re trying to find two very similar-looking needles in a very, very, very large haystack, right? [laughs]

Sam Charrington: [00:35:20] Mm-hmm [affirmative].

Arul Menezes: [00:35:21] And so, so you have to have a magnet that finds exactly those two needles and rejects everything else. So the, the cross-lingual embedding space is pretty key here, right? So you’re basically, in principle, if you embedded every single sentence or document on the web and then were able to look at every single document and find all of its very similarly close embeddings, you’d be done. But, you know, that’s [laughs] that’s,

Sam Charrington: [00:35:47] easier said than done?

Arul Menezes: [00:35:48] Easier said than done, right? So that’s the kind of thing that we’re trying to do at scale, right, is, like, you got these, you know, trillions of documents and, you know, we want to find the matching one, so you need to do it efficiently, and so there’s a lot of, like, clever engineering that goes into, like, indexing this stuff and, and, like, computing the embeddings efficiently. And, of course, also, you know, we’re not really trying to match every page in the web to every other page in the web, because you have, you know, a lot of clues that says whe- you know, if I have a document here, you know, is it likely I’d have a translated document somewhere? It’s going to be either in the same, like, top-level domain or, or related sites, things like that. So there are, there are ways to constrain that search.

Sam Charrington: [00:36:27] Mm-hmm [affirmative]. Our conversation thus far has focused primarily on text translation. Are you also involved in voice translation?

Arul Menezes: [00:36:38] Yeah, so we actually have been doing speech translation for a while. Several years ago we shipped a feature for speech translation in Skype called Skype Translator. It was, you know, really well received, super exciting. A lot of people use it even today, right? Especially, you know, people talking to their relatives in another country, and, you know, there’s a lot of interesting challenges in speech translation because it’s not that you just take the output of a speech recognition system and then just pass it to machine translation, right? There’s a, there’s a real mismatch in what comes out of speech recognition and what is needed to do a good job of translation, because of course translation’s expecting, like, you know, well-formatted text, capitalization, punctuation, br- sentence breaks, things like that.
So we put a, we put a lot of effort into bridging that gap, you know, post-processing the output of speech recognition so that we have, you know, h- really accurate sentence boundaries. So that, that matters a lot. I mean, you break the sentence in the middle and you try to translate… Like, if you break a sentence in the middle, the speech recognition there is okay, because as a human reading it, you know, there’s a period in there. You just ignore it and move on. But the machine doesn’t know that, and so when you’re trying to translate, you’ve got these two separate sentences and then it does a terrible job of it. So doing, getting the sentence breaks right, getting punctuation right and so on is really important, and so, so that’s what we’ve been doing.
We actually have a project going on now with the European Parliament where they are going to be using our technology, well, it’s, it, there, there’s three contestants or three bidders in this project, and so there’s an evaluation that will happen in a few months, but we’re hoping that they’ll adopt our technology for live transcription and translation of the European Parliament sessions in all 24 languages of the European Parliament, which is super exciting.

Sam Charrington: [00:38:26] Oh, wow. Wow.

Arul Menezes: [00:38:27] Yeah.

Sam Charrington: [00:38:28] So when you think about kind of where we are with, you know, transformers and some of the innovations that we’ve talked about and, you know, relative to your 20, 30 years in the space and I’m curious what you’re most excited about and, and where you see it going.

Arul Menezes: [00:38:44] Yeah, I mean, the pace of innovation has just been amazing. There’s so many things that are happening that, like, you know, would be a really dramatic impact, right? So one is just much larger models, right? As we scale up the model, we see continual improvements. And so as the hardware and the, you know, our ability to serve up these larger models keeps growing, the quality will also keep growing, right?
The architect of these large models also matters, right? Like, it’s not just matter of taking the smaller model and scaling it up exactly as is, so there are things like mixture of experts models that, for example, allow you to scale the number of parameters without the cost scaling as linearly, right, because you have parts of the model that specialize in different parts of the problem.
And then, you know, multilingual is definitely the future. Pretrained models is definitely the future, right? So, so, like, if you put that all together, like pretrained, multilingual, multitask trained, maybe with mixture of experts, huge models, and then we would specialize them for individual language pairs or groups of languages and then distill them do- down to something we can ship. So that’s one area that there’s a lot of innovation happening.
The other thing is that, you know, 10 years ago, people were just amazed that translation worked at all, right?

Sam Charrington: [00:40:07] [laughs]

Arul Menezes: [00:40:08] And now we’re doing a really good job and expectations have risen, so you get to the point where a lot of sort of smaller, let’s call them long-tail problems start to matter a lot, right? So if you look at translation of names, we probably get them 99% right, right? But a few years ago it would have been fine to say, “Hey, we’re 97% accurate on names.” But maybe now that’s not good enough, right? Like, screwing up 1% of the names is not acceptable, so, you know, how do we get that last 1% of names a- and, you know, I’m just making up the nu- it, it may be 99.9%. You’re still going to have upset customers if you get, you know, 0.1% of your names or your numbers, numbers are even worse, right?

Sam Charrington: [00:40:47] Mm-hmm [affirmative].
Arul Menezes: [00:40:48] Like, if you misstate a number even, like, 0.1% of the time, it could have catastrophic consequences, right? So that’s an important area. I mentioned neural hallucination before. That’s something we see where, again, may happen only 0.1% of the time, but if you get, like, a completely unrelated sentence that has nothing to do with your input but is really fluent, it’s pretty deceptive, right? Like, because especially if I’m just putting my faith in this translation that, and I don’t understand the source language at all, you’d be like, “Well, sounds okay,” and move on. But maybe it says something completely different from what the source said, right? And so that’s, that’s a challenge.

Sam Charrington: [00:41:25] Mm-hmm [affirmative].

Arul Menezes: [00:41:25] Yeah, I mean, there’s lots of really cool things happening in this space.

Sam Charrington: [00:41:30] Awesome. Awesome. Well, Arul, thanks so much for taking some time to share a bit about what you’re up to. Very cool stuff.

Arul Menezes: [00:41:38] Thank you. You’re welcome, Sam.

Sam Charrington: [00:41:40] Thank you.

Arul Menezes: [00:41:40] Happy to be on, on the show. Take care. Bye.

Sam Charrington: [00:41:43] Thank you. All right, everyone, that’s our show for today. To learn more about today’s guest or the topics mentioned in this interview, visit Of course, if you like what you hear on the podcast, please subscribe, rate, and review the show on your favorite pod catcher. Thanks so much for listening, and catch you next time.