We could not locate the page you were looking for.

Below we have generated a list of search results based on the page you were trying to reach.

404 Error
Sam Charrington: [00:00:00] Welcome to The TWIML AI Podcast. I’m your host, Sam Charrington. Before we jump into the interview, I’d like to take a moment to thank Microsoft for their support of the show and their sponsorship of this series of episodes highlighting just a few of the fundamental innovations behind Azure Cognitive Services. Cognitive Services is a portfolio of domain-specific capabilities that brings AI within the reach of every developer without requiring machine learning expertise. All it takes is an API call to embed the ability to see, hear, speak, search, understand and accelerate decision making into your apps. Visit aka.ms/cognitive to learn how customers like Volkswagen, Uber and the BBC have used Azure Cognitive Services to embed services like realtime translation, facial recognition, and natural language understanding to create robust and intelligent user experiences in their apps. While you’re there, you can take advantage of the $200 credit to start building your own intelligent applications when you open and Azure free account. That link again is aka.ms/cognitive. And now, on to the show. All right, everyone, I am here with Arul Menezes. Arul is a distinguished engineer at Microsoft. Arul, welcome to the TWIML AI podcast. Arul Menezes: [00:01:43] Thank you, Sam. I’m delighted to be here. Sam Charrington: [00:01:45] I’m really looking forward to our chat, which will focus on some of the work you’re doing in the machine translation space. To get us started, I’d love to have you introduce yourself and share a little bit about your background. How did you come to work in NLP and, and translation? And tell us a little bit about your story. Arul Menezes: [00:02:03] Yeah, so I’ve actually been at Microsoft 30 years at this point. Sam Charrington: [00:02:07] Wow. Arul Menezes: [00:02:07] I, yeah, I know. God, it’s a long time. I was actually in a PhD program. I came here for the summer, loved it so much I never went back. So I worked at Microsoft in the various engineering teams for a while, and then eventually I drifted back into research and I joined the natural language processing team in Microsoft Research, and I started the machine translation project, and I’ve been doing that ever since, so I’ve been doing machine translation for, like, 20 years now, and it’s been, it’s been a great ride because it’s just a fascinating field. So many interesting challenges and we have made so much progress from when we started, you know, and we’ve gone through so many evolutions of technology. It’s been, it’s been a great ride, yeah. Sam Charrington: [00:02:49] Yeah, there are some pretty famous examples of, you know, how the introduction of deep learning has changed machine translation. I’m assuming that your experience there i- is no different. Arul Menezes: [00:03:04] Yeah. Sam Charrington: [00:03:04] Can you share a little bit about how the, the evolution that you’ve seen over the years? Arul Menezes: [00:03:08] Sure. Sure. I mean, historically, you know, machine translation is something people s- tried to do, you know, in the ’50s. It was one of the first things they wanted to do with computers, you know, along with simulating sort of nuclear sort of bombs. But for the longest time, it was very, very hard to make progress, so all the way through, I would say, the late ’90s, early 2000s, we were still in sort of rule based and knowledge sort of engineered approaches, but then the first real breakthrough that came in the late ’90s, well actually starting a little earlier in terms of some papers published at IBM, but really taking off in the late ’90s and early 2000s was statistical machine translation, where for the first time, you know, we were able to take advantage of, like, large amounts of previously translated data, right? So you take documents and web pages and things that, that have previously been translated by people and you get these parallel texts, which is, let’s say, English and French, and you align documents and sentences, and then eventually words and phrases so you can learn these translations, and so with statistical machine translation, we were learning from data for the very first time, instead of having people hand code it. And it worked, actually, astonishingly well compared to what we were doing before. But eventually, we ran into the limits of the technology, because while we had the data, we didn’t have the techniques to do a good job of learning what that data was telling us because, you know, the machine learning techniques that we had back then just weren’t good enough at… They were good at memorizing, right? If you said something exactly the way they had seen in the data, they would do a good job of translating it. But they were terrible at generalizing from what they saw in the data, and that’s where neural models come in. Like, neural models are amazing at generalizing, you know. People always talk about how some of the latest models, you know, you can probe them to figure out what was in their training data and get them to reproduce what was in their training data. But what we forget is it takes work to actually make them do that, because most of the time, they’re generalizing. They’re paraphrasing. They’re not just replicating their training data, and that’s something we were not able to do before. So if you look at the evolution over the last 20 years of machine translation, we had our statistical machine translation, which did really well for a while, but then eventually plateaued. Then, you know, we had sort of the advent of neural networks, and the first thing that people tried to do was, you know, we did feedforward neural networks. We tried to shoehorn them into the framework we already had and combine feedforward networks and statistical techniques, and that worked okay. You got a few incremental improvements. But it wasn’t until we had the sort of pure neural LSTM models that we, for the first time, were really capturing the power of neural models, right? So what an LSTM model would do would be, you know, you have this encoder that you feed the source language sentence in, and it basically embeds the meaning of that entire sentence in the LSTM state. And then you feed that through a decoder that is now generating a fluent sentence, sort of based on this very abstracted embedded understanding of what the source language said. And so that’s very different from the way we were doing it, just sort of copying words and phrases that we’d memorized. So that was the first revolution, and, and it gave us amazing results, actually, compared to what we were doing before. And then, of course, along, after that came transformers, which sort of take that whole encoder/decoder architecture, but take it to the next level. Instead of having the meaning of the entire source sentence be encoded into a single LSTM state, which may work well for short sentences but gets, you know, worse as you get to longer sentences. In a transformer, you know, we have the self attention that’s basically looking at every word in the source and every word in the target, and so you have, like, full context available to the model at any point in time. So that’s where we stand today is, you know, transformers are the state of the art, but of course there’s lots of really cool, interesting variations and things we’re doing, which I think we’re going to talk about at some point. Sam Charrington: [00:07:25] And, and when you talk about transformers being the state of the art, is that what is powering the current kind of production Azure machine translation service? Or is that the state of the art in research and, you know, there’s some combination of the various techniques you mentioned that is powering the live service? Arul Menezes: [00:07:46] So the live service is very much powered by transformers. We have, you know, all 180 language pairs or something that we support powered by transformers running in production. Now, one thing we do do is that we take advantage of what’s called knowledge distillation, right, to take the knowledge that’s embedded in these very large transformers that we train offline and then condense that or distill that into smaller, still transformers, but smaller, shallower, and narrower models that we use in production, right? So we typically go through multiple stages of these teacher models before we get to the student, so our pipeline’s actually fairly complex. We take the parallel data, which I mentioned earlier, which is sort of the lifeblood of machine translation. This is the previously translated human text. And we train, like, a first teacher based on that data. Then we typically do what’s called back translation, which is a technique in machine translation to take advantage of monolingual data, so data that’s not parallel, so it’s not translated source and target. It’s just in one language, typically the target language. And what we do there is we want to take advantage of this monolingual data to teach the model more about the syntax and the, you know, semantics of the target language so it gets more fluent. And the way we incorporate that data into a machine translation model is through something called back translation, where we take the, the target language data, we translate it back to the source using one of our models, and then we use it to train the model in the other direction. So this is a little complicated. So basically, if you’re training an English to French model… Sam Charrington: [00:09:28] Mm-hmm [affirmative]. Arul Menezes: [00:09:29] … in addition to the parallel English-French data, you also take some French monolingual data, you translate it back to English using your other direction translation system, the French to English system… Sam Charrington: [00:09:41] Mm-hmm [affirmative]. Arul Menezes: [00:09:41] … and then you put that synthetic data back into training your English-French system. Sam Charrington: [00:09:45] Okay. Arul Menezes: [00:09:45] So, so that’s- Sam Charrington: [00:09:47] [crosstalk 00:09:49] essentially a, a data augmentation technique? Arul Menezes: [00:09:51] It is, yeah, it’s a data augmentation technique, and it works, like, incredibly well, actually. Adds several points to our metric. The metric we use in machine translation is called a blue score. I mean, there are other metrics and I, I mean, I could talk about that at some point if we want to get into it, but, you know, we get several points of blue score out of the back translation. And then, so that’s our final sort of teacher model, which is typically huge, and then what we do is we use that model to teach the student model. And the way we do that is essentially we run, like, a huge amount of text through this teacher model, and then we take the data generated by the teacher and we train the student on it. And the reason that works is because unlike sort of natural data that we train the teacher on, which can be confusing, contradictory, diverse, the, the data generated by the teacher is very uniform and it’s very standardized, and so you can use a much simpler student model to learn all of that knowledge from the teacher, because it’s a simpler learning problem. And having done that, that model runs, like, super fast and we can h- host in production and translate, like, trillions of words, you know, so yeah. Sam Charrington: [00:10:59] And so the, the student teacher part of the process is kind of interesting to explore a little bit further. Are you essentially trying to do something your- the task that you’re trying to, or goal that you’re trying to achieve with that is model compression. Arul Menezes: [00:11:13] Right. Sam Charrington: [00:11:14] Very different approach to it than, like, pruning or… Arul Menezes: [00:11:18] Right. Sam Charrington: [00:11:18] … you know, some of the other ways you might approach compression. Arul Menezes: [00:11:20] Yeah, right. So we do, like, we do a lot of different things to model compression, right? So one of the things we do is we, we do quantization, for example, within all our models in eight bits. We’ve experimented with less than eight bits. It’s not quite as effective, but, you know we, we do that. We do some other, like, pruning techniques as well, but the biggest one is the knowledge distillation, and what you’re trying to do there is get a smaller model to basically mimic the behavior of the big teacher model just running a lot cheaper. And by combining all the techniques, we published a paper last year on this at a workshop, and from our big teacher with all of the knowledge distillation, the compression, the quantization and, and so on, we’re running something like 250 times faster… Sam Charrington: [00:12:06] Wow. Arul Menezes: [00:12:06] … on the student than the teacher with, I mean, there is a small loss in quality, right? But we lose maybe half a blue point, not too much, and in some cases not even any. We can, like, actually maintain the quality as is, so… Sam Charrington: [00:12:20] The, my next question for you, it, the way you describe the process, and in particular the idea that the teacher is outputting more consistent examples than what is found in the training data… Arul Menezes: [00:12:35] Mm-hmm [affirmative], right. Sam Charrington: [00:12:35] My next question was, or the intuition that I had was that that would cause the student to be far less effective at generalizing and would make it perform worse, but it sounds like that’s not the case in practice. Arul Menezes: [00:12:51] So the key to that is to make sure that the data that you feed through the teacher to teach the student is diverse enough to cover all the situations that you may encounter, right? So the students are a little weird, I mean, and I think you’re sort of hinting at that. We do, for example over-fit the student to the training data, which is something that you typically wouldn’t do in your teacher model, because you, in fact, are trying to make the teacher match the student as much as possible. Sam Charrington: [00:13:18] Mm-hmm [affirmative]. Arul Menezes: [00:13:18] So some of the things that you do to make the, to the teachers better at generalization, you don’t do in the student. And in fact, if you look at the student distributions, they’re much sharper than the teacher distributions, because they have overfit to the data that they’ve seen. But, you know, there’s a little evidence that you could get into some corner cases that are brittle, like you know there’s this problem of neural hallucination that all of the neural models are subject to where, you know, occasionally they’ll just output something that is completely off the wall, unrelated to anything that they’ve seen. And there’s some evidence that there’s a little bit of amplification of that going on. Like if it’s… You know, the teachers are also subject to hallucination, but maybe at a very, very low frequency, and that maybe that’s being amplified a little bit in the student. So w- we’re, you know, we’re working on managing that, but yeah, so there’s, there’s, you know, it’s a trade-off. Like, the students have lower capacity, but that’s what enables us to run them, and we, you know, we run them on CPU. We don’t, we, we don’t use GPUs in production inference. We use, of course, all models are trained and, and all the knowledge [inaudible 00:14:26] are done on GPUs, but in, but in production we’re just using CPUs. Sam Charrington: [00:14:26] And is, is that primarily based on the cost benefit analysis, or is it based on a latency envelope that you have to work with and not needing, not wanting a kind of batch… Arul Menezes: [00:14:38] Right. Sam Charrington: [00:14:38] … Inference requests? Arul Menezes: [00:14:39] Yeah, that, that’s exactly right. I mean, you know, latency is a big concern. Our API’s a real- real-time API, and so, you know, latency is the biggest driving factor. And honestly, if you do inference on GPUs, you know, you get some latency benefit, but the big benefit is on large batches. And so unless you have a matched batch translation API, you can’t really take advantage of the full capacity of your, of your GPU, so, you know, in a real-time API. Sam Charrington: [00:15:05] Mm-hmm [affirmative]. And are both the teacher and the student models transformers for you? Arul Menezes: [00:15:12] Yeah, they are. They are. Yeah, the, the students are transformer large or a little bit larger, and then the s- sorry, that’s the teachers, and then the students, they’re very highly optimized transformer. I mean, they, we start with transformer base, but then we do a lot of really strange stuff. I would refer you to the paper, actually. [laughs] Sam Charrington: [00:15:30] Okay. When you were describing the data augmentation technique that you use… Arul Menezes: [00:15:36] Right. Sam Charrington: [00:15:37] … it kind of called to mind ideas about incorporating a GAN type of approach where you’re doing the pass-back translation and then, you know, maybe there’s some GAN that is trying to… Arul Menezes: [00:15:47] Right. Sam Charrington: [00:15:47] … figure out if the results going backwards… Arul Menezes: [00:15:50] Right. Sam Charrington: [00:15:51] … is like a human translation. Is there a role for that kind of technique? Is that something that comes up in the research? Arul Menezes: [00:15:57] Yeah, so we’ve, we’ve looked at GANs. There were some exciting results, but but in the end, I mean, I think we have some okay research results. We haven’t seen much benefit, but more broadly, in terms of data augmentation, we’re using it all over the place, right? So it’s we have the back translation, but there are a lot of phenomenon that we want to address in machine translation that is maybe not well represented in the data, and so we use data augmentation pretty heavily to cover those cases, right? To give you a simple example, when you translate a sentence and you get a particular translation and then you go in and let’s say you remove the period at the end of the sentence, sometimes it changes the translation entirely. They may both be perfectly good translations, right? But they’re different. So one way to look at it is, well, they’re good, both good translations, but people don’t like that. So if you look at our customers, and we’re very sensitive to what our users, the feedback we get from our users. So one of the feedback we got was that, you know, we want a little more stability in our translation. So, you know, just because I lost a period at the end of the sentence, I shouldn’t get a drastically different translation. And so, you know, it’s very easy to augment the data and say, well, you know, stochastically I’m going to, like, delete the period on my sentences, and so then the model learns to basically be robust whether there’s a period or not. Now, of course, you know, that’s different than a question mark. You definitely want to leave the question mark in because that changes the meaning of the whole… Sam Charrington: [00:17:22] Mm-hmm [affirmative]. Arul Menezes: [00:17:23] … sentence. But, you know, things like that, punctuation, the period, commas, things like that. Maybe, you know, capitalization, for example. One of the the other examples would be like an all caps sentence. You know, you take the whole sentence and you change it to all caps. Well, you get a totally different translation, right? So we, again, generate some synthetic all caps data so that the model learns to do a good job of translating that as well. And then there’s, you know, there’s all these, like, I, I would call them, you know, long-tail phenomenon that and, you know we feel that data augmentation’s a good way to address some of these, yeah. Sam Charrington: [00:17:53] Your examples are really interesting to me because I’m refer- I’m comparing them to, like, your textbook NLP types of examples where the first thing you’re doing is making everything lowercase and getting rid of all of your punctuation. Arul Menezes: [00:18:05] Yeah. Sam Charrington: [00:18:05] Sounds like that does not work for translation. Arul Menezes: [00:18:08] No, because there’s a lot of information in casing and punctuation, right? Like, I mean, if you want to handle names, for example, you need to pay attention to the case of the input. Like, everything in the input has information, and so actually even the punctuation, right? Like, sometimes if you take the period off the end of the sentence, it should change things because it may be a noun phrase rather than an actual sentence, right? So it’s not so much we’re preprocessing the data and trying to be clever. It’s about exposing the model to different variations so that the model can figure things out for itself. Sam Charrington: [00:18:41] Mm-hmm [affirmative]. One of the questions this prompts is, like, the unit of, you know, work or the unit of thing that you’re trying to translate. Arul Menezes: [00:18:50] Right. Sam Charrington: [00:18:50] You know, translating a word being different from translating a sentence… Arul Menezes: [00:18:54] Sure. Sam Charrington: [00:18:54] … being different from translating a, an entire document. Arul Menezes: [00:18:57] Right. Sam Charrington: [00:18:57] Sounds like most of what we’ve been talking about is kind of phrase by phrase now relative to the word by word that, you know, we were doing 20 years ago. Are you also looking at the entire document? Are you able to get information from a broader context to impact the translations? Arul Menezes: [00:19:18] Yeah, so that’s, that’s a very good question, Sam. Yeah, so the context matters a lot, right? So one of the reasons why neural models are so great at translating now is because they, they are looking at the whole sentence context and they’re translating the entire conte- the sentence and every w- they, they’re basically sort of figuring out the meaning of every word and phrase in the context of the whole sentence, which is something we couldn’t do with statistical machine translation before. So now the next step is to expand that context to beyond the sentence, right? So there are a lot of phenomenon that it’s impossible to translate well without context beyond the sentence. Like, in many languages, unless you have document-level context or paragraph-level context, you can’t generate the right pronouns because you don’t actually know. The sentence doesn’t have enough clues to let you know what is the gender of the subject or the object or the person you’re talking about in that sentence. Beyond just the pronouns, it’s also like you know, the senses of words and you know, disambiguating those. So, we’re, we’re actually moving towards translating at the whole document level context, or at least, you know, very large, multi-sentence fragments. And then there, we’ll be able to use, you know, the, the, the, the context of the entire document to translate each individual sentence. And we actually have some really great research results based on translating at the document level. Yeah, so we’re pretty excited about that. That model is not in production yet. Sam Charrington: [00:20:48] Mm-hmm [affirmative]. Arul Menezes: [00:20:48] But it’s something that we’re working on. We did ship a document-level API. I think it’s in public preview right now. Which addresses sort of the other half of the problem, which is, you know, people have documents. They’ve got formatting. You know, it’s in PDF. It’s in Word. It’s in PowerPoint, whatever, and HTML, and it’s a hassle getting all the text out of that… Sam Charrington: [00:21:11] Yeah. Arul Menezes: [00:21:11] … getting it translated, and then worse still trying to reassemble the document and reconstruct the formatting of that document on the translated thing. So we’ve made that easy. We just shipped this API. Just give us your PDF. We’ll tear it apart, we’ll do the translation, we’ll put it back together, and we’ll preserve the format. And you know, especially for PDF, that is actually really hard. Doing the format preservation is tricky. But we’re pretty excited about that API. And so, then, that’s the place where our document level neural model would fit right in, right? Because now we have, the user’s giving us the whole document. We can not only handle all the stuff about the formatting and all that. We can go one better. We can actually use the whole document context to give you better quality translations. Sam Charrington: [00:21:53] Mm-hmm [affirmative]. Can you give us an overview of some of the techniques that go into looking at the entire document when building the, the model? Arul Menezes: [00:22:03] Yeah, so there’s, I mean, right now, as I said, we haven’t actually shipped this, so we’re looking at a, a bunch of variations. You know, there’s several things that people have looked at, like mo- you know, there are hierarchical models where you do the, you run transformers at the sentence level, and then you run a second level to sort of, like, collect the sentence level information into, like, a document level context vector, and then you feed that back into translating each sentence. We’re finding that actually, if you just make it super simple and treat the whole thing as, as if it were a giant sentence, in effect, you get really good results. You do have to deal with the performance issues, right, because transformers are n-squared in the size of the input and the output, so instead of, you know, handling, you know, a 25-word sentence, if we’re not translating a thousand-word para- you know, document or paragraph, then the, you know, you’ve got, like, an n-squared problem in terms of the performance, right? It’s going to be that much more expensive, but we have, we have things that we’re looking at to make that faster as well, so we’re pretty optimistic we can do that, and and I think we can do that with just letting the transformer figure it out for itself rather than trying to be very clever about all this hierarchical stuff. Sam Charrington: [00:23:10] Nice. Nice. Let’s talk a little bit about the role of different languages. You know, we, we’ve already talked about how you can use back translation to help augment the performance of your translation of a language in, in one direction or the translation between a couple of language pairs. Arul Menezes: [00:23:27] Right. Sam Charrington: [00:23:28] Are there ways to take advantage of the other 130 or so languages… Arul Menezes: [00:23:33] Sure. Sam Charrington: [00:23:33] … that y- that you support when you’re building the n-plus-1th model for a given language? Arul Menezes: [00:23:38] Absolutely. Absolutely. That’s been one of the most exciting things, I would say, that came out of sort of transformers and, and neural models in general is the ability to do this sort of transfer learning between languages, right? And the reason we can do that is because transformers or neural models in general are representing the meanings of words and sentences and phrases as embeddings in, you know, the space, and by training on multiple languages together, you can actually get the representations of these languages to merge and have the similar concepts be represented through relative points in spa- in that space, right? So, as a practical matter, we’ve basically found that if we group languages by family, right, and take, so for example we took all our Indian languages and we put them together and we trained one joint model a- across all of the languages and now we’re talking, you know, languages where you have a very different amount of data. You have Hindi, where we have quite a lot of data, and then we have, like, Assamese, which was, I think, the last one that we shipped, that has probably, like, two orders of magnitude less data. And the, the wonderful thing is that by training them jointly, the Assamese model learns from the huge amount of data that we have for Hindi and does, like, dramatically better than if we had just trained on Assamese by itself. In fact, we have done those experiments and, you know, for the smaller languages, we can get, like, five, 10 blue points, which is, like, a crazy level of improvement just from the transfer learning and multilingual. We also do that with, like, Arabic all of our Middle Eastern languages. So we’re just, like, grouping more and more language families together and getting huge benefits out of this. Sam Charrington: [00:25:31] And when you’re grouping the, the language families, have you ever, do you experiment with going across language families and seeing if there’s some improvement, Arul Menezes: [00:25:41] yeah. Sam Charrington: [00:25:42] … improvement there? Arul Menezes: [00:25:43] Yeah, so we, you know, we’ve trained models that are, like, 50 or 100 languages in them. What you run into is, you know, as you add languages, you have to increase the size of your vocabulary to accommodate all of these languages, and you have to increase the size of the model, because at some point you run into model capacity a little bit. So you can have a model that does a ni- a really nice job of learning from 50 or 100 languages, but it gets to be a really huge model, and so in terms of cost effectiveness, we’ve found that, like, you get, like, almost all of the benefit of the transfer learning at, like, a much reduced cost by just grouping 10, 15 languages at a time. And if they’re related, it’s better. But actually, even if they’re unrelated, it still works. [laughs] It’s quite amazing how well it works even if the languages are not related, yeah. Sam Charrington: [00:26:32] We may think of it as, like, a computation test of Chomsky’s universal grammar and, you know, these ideas that suggest that all languages have these common elements. Arul Menezes: [00:26:41] Yeah. Sam Charrington: [00:26:42] I- if you are able to train these models across languages and im- improve them, that would seem to support those kinds of theories. Arul Menezes: [00:26:48] I mean, definitely the models do a really good job of bringing related concepts together in the, in the embedding space, right? Sam Charrington: [00:26:56] Would you consider this, you, you referenced this as, like, multilingual transfer learning. Would you also think of it as a type of multitask learning as well, or is, is that not technically what you’re doing in this task? Arul Menezes: [00:27:09] So we’re also doing, in addition to multilingual just machine translation, we’re also doing multilingual multitask learning, and what we’re doing there is we are combining the sort of so there’s let me back up a bit. There’s been this whole line of research based on models like BERT, right? Pretrained language models where, if you look at BERT, it’s actually the encoder half of a machine translation model, but it’s trained on monolingual data. It’s trained on a, on a single language data on this objective that’s a reconstruction objective where, you know, you’re given a, a, a sentence where you have couple of words or phrases blanked out. You need to predict that, right? And then you have multilingual BERT where you take multiple separate monolingual corpora, right, so it’s like a bunch of English text, a bunch of French text, and all of it, and you train them jointly in the same model. And it does a pretty good job of actually pulling the representations of those things together. So that’s one line of research that’s sort of really driven a revolution in, like, the whole natural language understanding field, right? So for example, today if you want to train a named entity tagger, you wouldn’t start from scratch on your ta- on your named entity data. You would start with a pretrained model. So one of the things that we’re very excited about is we have this project that we call [ZICOR 00:28:43] where we’re bringing the machine translation work and this sort of pretrained language model, BERT-style work together, right? And we train, we’re training this multitask, multilingual model that’s, architecturally, it’s just a machine translation model, right? But in addition to training it on the parallel data, let’s say the English-French data and the English-German data and conversely the German-English data and the French-English data and, you know, 10 or 15 or 50 or 100 other languages. In addition, we have a separate task where we have the BERT tasks, where we take monolingual data and we, we have it reconstruct, you know, the, the missing words. And we also have what’s called a denoising autoencoder task, which is where you give it a scrambled sentence, and then it has to output the unscrambled sentence through the decoder. And then now you have these three tasks, and we train them in rotation on the same model, so they’re sharing parameters. So the model has to figure out how to use the same parameters to do a good job of the BERT task, to do a good job of the denoising autoencoder task, as well as to do a good job of the machine translation task. And this, we find, leads to, like, much better representation that work for better natural language understanding quality, but also better machine translation quality. Sam Charrington: [00:29:43] Nice. And the, the BERT task in this example is within the same language, as opposed to… Arul Menezes: [00:29:49] Right. Sam Charrington: [00:29:49] … across the target, to the target language? Arul Menezes: [00:29:53] Yeah, there’s actually, like, a whole family of tasks, right? I mean, people have come up with, I mean, we’ve, we’ve experimented with, like, 20, 25 tasks. Like, so you can do a monolingual mass language model task, which is the BERT task, but you can do a cross-lingual ma- mass language task as well, and you can do the denoising autoencoder task monolingually, where you have to reconstruct the same language, but you can also do that cross-lingually where you have to reconstruct sort of a scrambled foreign language task, so there’s, like, a real, like, sort of stone soup approach where people are just throwing in all kinds of tasks, and they all help a little bit. But we need to figure out, like, what’s the minimal set that you need? Because, you know, it’s work. It’s computational expense to train these huge models on all these tasks, so if we can find the minimal set that works, that would be ideal. And so far, what we’re working with is, like, a denoising autoencoder, a mass language model, and a machine translation task. Sam Charrington: [00:30:49] Very, very cool. Very cool. So I think one of the things that, you know, often users of this kind of machine translation services experiences that, you know, they weren’t great in the general case, but when you start to try to apply them to specific domains, it’s a lot more challenging, you know, and kind of the technical conversations or translating, you know, medical conversations or, you know, construction or what have you. Is there anything that you’re doing to make the domain-specific performance better for these kinds of systems? Arul Menezes: [00:31:26] Yeah, definitely you know, domain performance in specialized domains is a real challenge. We’re doing several things to get better there, right? So the, the first thing is that the quality is really determined by the availability of data, right? So in the domains like, let’s say news or web pages where we have a ton of data, you know, we’re doing really, really well. And then if you go into a more specialized domain like, let’s say, medical or legal where we don’t have as much data, we’re maybe not doing quite as well. And so one of the things we’re doing is we’re now taking the same neural models that are good at translation and we’re using them to identify parallel data in these domains that we can find on the web that we maybe weren’t finding before, and we can do that because these models, you know, because the representations are shared in the multilingual models, they are actually very good at identifying potential training data that, that is translations of each other. So that’s one thing we’re doing. The other thing we’re doing, of course, is the same kind of transfer learning approach that we’re using cross-lingually applies within domains, as well, right? So if you have a small amount of medical domain data, you don’t want to, like, just train a l- a model that’s based just on that you know, small data. What we’re doing instead is we’re taking, you know, our huge model that’s trained on a ton of, like, general data across a bunch of domains, and then you fine-tune it for the specific domains that you’re interested in. And we actually have a product called Customer Translator that we have, like, you know, thousands of customers using, where they are using this approach to customize the machine translation to their company or their application needs, right? So let’s say you’re a car company or something and you have a bunch of data that’s about, like, automotive manuals, right? So you come to our website, you log in, you create your account, etc., you upload this data, and then what we do is we take your small amount of domain-specific data, we take our large model, and then we fine-tune it to that data, and now you have the model that does, like, you know, sometimes dramatically, again, 10, 15, 20 blue points better than the baseline because, you know, we’ve learned the vocabulary and the specifics of your domain, but we’re still leveraging, we’re standing on this platform of, like, the broad general domain quality. So that’s been extremely popular and valuable, actually. We just shipped a new version of that based on transformers a couple of months ago. Sam Charrington: [00:33:45] And in that case, the user is presumably bringing translated documents so that, that you’re able to train or fine tune all, with both source and target translations? Arul Menezes: [00:33:56] Yeah, that’s exactly right. I mean, a lot of the companies that we work with have some data, right? Like, let’s say they had a previous version of their vehicle or, you know whatever and they had manuals that were translated. In Microsoft’s case, for example, you know, we have, let’s say the manuals for Microsoft Word going back, you know, a couple of decades, and this is the kind of data you can use to customize it so that anything, any new content that you want to translate can have, like, a very consistent, like, vocabulary and, and tone and so on, yeah. Sam Charrington: [00:34:28] Mm-hmm [affirmative]. And then in that first example or the first technique that you mentioned, that sounds really interesting. So you’ve got this index of the web and Bing… Arul Menezes: [00:34:37] Right. Sam Charrington: [00:34:37] … you know, for example, or maybe you have a separate one, but you go- have this mechanism to kind of crawl the web and… Arul Menezes: [00:34:43] Right. Sam Charrington: [00:34:44] It sounds like the idea is that you can use the model to identify, hey, I’ve got these two documents. Arul Menezes: [00:34:52] Right. Sam Charrington: [00:34:52] They look really similar, but there’s a, a high percentage of words that I don’t know… Arul Menezes: [00:34:57] Yeah, right. Sam Charrington: [00:34:57] … that occupy similar positions in the same documents. Arul Menezes: [00:35:00] Yeah. Sam Charrington: [00:35:00] And then you have someone translate- oh, well actually, then once you know that, you can just align them, so to speak, and you’ve got more domain-specific document to add to your training set? Is that the general idea? Arul Menezes: [00:35:13] Yeah. I mean, it, it’s like you’re trying to find two very similar-looking needles in a very, very, very large haystack, right? [laughs] Sam Charrington: [00:35:20] Mm-hmm [affirmative]. Arul Menezes: [00:35:21] And so, so you have to have a magnet that finds exactly those two needles and rejects everything else. So the, the cross-lingual embedding space is pretty key here, right? So you’re basically, in principle, if you embedded every single sentence or document on the web and then were able to look at every single document and find all of its very similarly close embeddings, you’d be done. But, you know, that’s [laughs] that’s, Sam Charrington: [00:35:47] easier said than done? Arul Menezes: [00:35:48] Easier said than done, right? So that’s the kind of thing that we’re trying to do at scale, right, is, like, you got these, you know, trillions of documents and, you know, we want to find the matching one, so you need to do it efficiently, and so there’s a lot of, like, clever engineering that goes into, like, indexing this stuff and, and, like, computing the embeddings efficiently. And, of course, also, you know, we’re not really trying to match every page in the web to every other page in the web, because you have, you know, a lot of clues that says whe- you know, if I have a document here, you know, is it likely I’d have a translated document somewhere? It’s going to be either in the same, like, top-level domain or, or related sites, things like that. So there are, there are ways to constrain that search. Sam Charrington: [00:36:27] Mm-hmm [affirmative]. Our conversation thus far has focused primarily on text translation. Are you also involved in voice translation? Arul Menezes: [00:36:38] Yeah, so we actually have been doing speech translation for a while. Several years ago we shipped a feature for speech translation in Skype called Skype Translator. It was, you know, really well received, super exciting. A lot of people use it even today, right? Especially, you know, people talking to their relatives in another country, and, you know, there’s a lot of interesting challenges in speech translation because it’s not that you just take the output of a speech recognition system and then just pass it to machine translation, right? There’s a, there’s a real mismatch in what comes out of speech recognition and what is needed to do a good job of translation, because of course translation’s expecting, like, you know, well-formatted text, capitalization, punctuation, br- sentence breaks, things like that. So we put a, we put a lot of effort into bridging that gap, you know, post-processing the output of speech recognition so that we have, you know, h- really accurate sentence boundaries. So that, that matters a lot. I mean, you break the sentence in the middle and you try to translate… Like, if you break a sentence in the middle, the speech recognition there is okay, because as a human reading it, you know, there’s a period in there. You just ignore it and move on. But the machine doesn’t know that, and so when you’re trying to translate, you’ve got these two separate sentences and then it does a terrible job of it. So doing, getting the sentence breaks right, getting punctuation right and so on is really important, and so, so that’s what we’ve been doing. We actually have a project going on now with the European Parliament where they are going to be using our technology, well, it’s, it, there, there’s three contestants or three bidders in this project, and so there’s an evaluation that will happen in a few months, but we’re hoping that they’ll adopt our technology for live transcription and translation of the European Parliament sessions in all 24 languages of the European Parliament, which is super exciting. Sam Charrington: [00:38:26] Oh, wow. Wow. Arul Menezes: [00:38:27] Yeah. Sam Charrington: [00:38:28] So when you think about kind of where we are with, you know, transformers and some of the innovations that we’ve talked about and, you know, relative to your 20, 30 years in the space and I’m curious what you’re most excited about and, and where you see it going. Arul Menezes: [00:38:44] Yeah, I mean, the pace of innovation has just been amazing. There’s so many things that are happening that, like, you know, would be a really dramatic impact, right? So one is just much larger models, right? As we scale up the model, we see continual improvements. And so as the hardware and the, you know, our ability to serve up these larger models keeps growing, the quality will also keep growing, right? The architect of these large models also matters, right? Like, it’s not just matter of taking the smaller model and scaling it up exactly as is, so there are things like mixture of experts models that, for example, allow you to scale the number of parameters without the cost scaling as linearly, right, because you have parts of the model that specialize in different parts of the problem. And then, you know, multilingual is definitely the future. Pretrained models is definitely the future, right? So, so, like, if you put that all together, like pretrained, multilingual, multitask trained, maybe with mixture of experts, huge models, and then we would specialize them for individual language pairs or groups of languages and then distill them do- down to something we can ship. So that’s one area that there’s a lot of innovation happening. The other thing is that, you know, 10 years ago, people were just amazed that translation worked at all, right? Sam Charrington: [00:40:07] [laughs] Arul Menezes: [00:40:08] And now we’re doing a really good job and expectations have risen, so you get to the point where a lot of sort of smaller, let’s call them long-tail problems start to matter a lot, right? So if you look at translation of names, we probably get them 99% right, right? But a few years ago it would have been fine to say, “Hey, we’re 97% accurate on names.” But maybe now that’s not good enough, right? Like, screwing up 1% of the names is not acceptable, so, you know, how do we get that last 1% of names a- and, you know, I’m just making up the nu- it, it may be 99.9%. You’re still going to have upset customers if you get, you know, 0.1% of your names or your numbers, numbers are even worse, right? Sam Charrington: [00:40:47] Mm-hmm [affirmative]. Arul Menezes: [00:40:48] Like, if you misstate a number even, like, 0.1% of the time, it could have catastrophic consequences, right? So that’s an important area. I mentioned neural hallucination before. That’s something we see where, again, may happen only 0.1% of the time, but if you get, like, a completely unrelated sentence that has nothing to do with your input but is really fluent, it’s pretty deceptive, right? Like, because especially if I’m just putting my faith in this translation that, and I don’t understand the source language at all, you’d be like, “Well, sounds okay,” and move on. But maybe it says something completely different from what the source said, right? And so that’s, that’s a challenge. Sam Charrington: [00:41:25] Mm-hmm [affirmative]. Arul Menezes: [00:41:25] Yeah, I mean, there’s lots of really cool things happening in this space. Sam Charrington: [00:41:30] Awesome. Awesome. Well, Arul, thanks so much for taking some time to share a bit about what you’re up to. Very cool stuff. Arul Menezes: [00:41:38] Thank you. You’re welcome, Sam. Sam Charrington: [00:41:40] Thank you. Arul Menezes: [00:41:40] Happy to be on, on the show. Take care. Bye. Sam Charrington: [00:41:43] Thank you. All right, everyone, that’s our show for today. To learn more about today’s guest or the topics mentioned in this interview, visit TWIMLAI.com. Of course, if you like what you hear on the podcast, please subscribe, rate, and review the show on your favorite pod catcher. Thanks so much for listening, and catch you next time.
Over the past couple weeks I got to sit on the other side of the (proverbial) interview table and take part in a few fantastic podcasts and video conversations about the state of machine learning in the enterprise. We also cover current trends in AI, and some of the exciting plans we have in store for TWIMLcon: AI Platforms. Each of these chats has its own unique flavor and I’m excited to share them with you. The New Stack Makers Podcast.I had a great chat with my friend, Alex Williams, founder of The New Stack, a popular tech blog focused on DevOps and modern software development. We focused on MLOps and the increasingly significant convergence of software engineering and data science. Minter Dialogue. I spoke with Minter Dial, host of the popular podcast, Minter Dialogue, and author of the book Heartificial Empathy: Putting Heart into Business and Artificial Intelligence. We had a wide ranging conversation in which we talked about the future of AI, AI ethics, and the state of AI in businesses. Datamation. In this video chat with James Maguire for Datamation, we discuss some of the key trends surrounding AI in the enterprise, and the steps businesses are taking to operationalize and productionalize machine learning. Hope you enjoy the talks! If you're not already registered for TWIMLcon we'd love to have you join us! Register now!
A few weeks ago I had the opportunity to visit Siemens’ Spotlight on Innovation event in Orlando, Florida. The event aimed to bring together industry leaders, technologists, local government leaders, and other innovators for a real-world look at the way technologies like AI, cybersecurity, IoT, digital twin, and smart infrastructure are helping businesses and cities unlock their potential. Siemens put together a nice pre-event program the day before the formal conference which included a tour of their Gamesa Wind Turbine Training Center. We got a peek into the way these machines are assembled, repaired, and managed. As expected, wind turbines are increasingly being fitted with sensors that, when coupled with machine learning algorithms, allow the company to optimize their performance and do predictive maintenance. AI figured prominently into the discussions at the main conference and the highlight for me was Norbert Gaus, head of R&D at Siemens, presenting an overview of the four main AI use cases that the company is interested in: Generative product design Automated product planning Adaptable autonomous machines Real-time simulation and digital twin He covered, though not in much detail, examples in each of these areas. (My Industrial AI ebook is a good reference for more on the opportunities, challenges, and use cases in this space.) Gaus also provided an interesting overview of the systems and software tools the company was building for internal and customer/partner use. These spanned AI-enabled hardware, industry-specific algorithms and services, AI development tools and workflows, pretrained AI models and software libraries, and industrial knowledge graphs. I was able to capture a couple of really interesting conversations with Siemens applied AI research engineers about some of the things the company is up to. Over on Twitter you can check out a short video I made with Siemens engineer Ines Ugalde where she demonstrates a computer vision powered robot arm that she worked on that uses the deep learning based YOLO algorithm for object detection and the Dex-Net grasp quality prediction algorithm designed in conjunction with Ken Goldberg’s AUTOLAB at UC Berkeley, with all inference running in real time on an Intel Movidius VPU. I also had an opportunity to interview Batu Arisoy for Episode 281 of the podcast. Batu is a research manager with the Vision Technologies & Solutions team at Siemens Corporate Technology. Batu’s research focus is solving limited-data computer vision problems. We cover a lot of ground in our conversation, including an interesting use case where simulation and synthetic data are used to recognize spare parts in place, in cases where the part cannot be isolated. “The first way we use simulation is actually to generate synthetic data and one great example use case that we have developed in the past is about spare part recognition. This is a problem if you have a mechanical object that you deploy in the field and you need to perform maintenance and service operations on this mechanical functional object over time. In order to solve this problem what we are working on is using simulation to synthetically generate a training data set for object recognition for large amount of entities. In other words, we synthetically generate images as if these images are collected in real world from an expert and they’re annotated from an expert and this actually comes for free using the simulation. […]We deployed this for the maintenance applications of trains and the main goal is a service engineer goes to the field, he takes his tablet, he takes a picture, then he draws a rectangle box and the system automatically identifies what is the object of interest that the service engineer would like to replace and in order to make the system reliable we have to take into consideration different lighting conditions, texture, colors, or whatever these parts can look like in a real world environment.” There’s a ton of great detail in this conversation. In particular, we dive into quite a few of the details behind how this works, including a couple of methods that they apply which were published in his group’s recent CVPR papers, including Tell Me Where to Look, which introduced the Guided Attention Inference Network, and Learning Without Memorizing. Definitely check out the full interview! Thanks once again to Siemens for hosting this event and for sponsoring my visit, this post, and my conversation with Batu.
Bits & Bytes Microsoft open sources Bing vector search. The company published its vector search toolkit, Space Partition Tree and Graph (SPTAG) [Github], which provides tools for building, searching and serving large scale vector indexes. Intel makes progress toward optical neural networks. A new article on the Intel AI blog (which opens with a reference to TWIML Talk #267 guest Max Welling’s 2018 ICML keynote) describes research by Intel and UC Berkeley into new nanophotonic neural network architectures. A fault tolerant architecture is presented, which sacrifices accuracy to achieve greater robustness to manufacturing imprecision. Microsoft research demonstrates realistic speech with little labeled training data. Researchers have crafted an “almost unsupervised” text-to-speech model that can generate realistic speech using just 200 transcribed voice samples (about 20 minutes’ worth), together with additional unpaired speech and text data. Google deep learning model demonstrates promising results in detecting lung cancer. The system demonstrated the ability to detect lung cancer from low-dose chest computed tomography imagery, outperforming a panel of radiologists. Researchers trained the system on more than 42,000 CT scans. The resulting algorithms turned up 11% fewer false positives and 5% fewer false negatives than their human counterparts. Facebook open-sources Pythia for multimodal vision and language research. Pythia [Github] [arXiv] is a deep learning framework for vision and language multimodal research framework that helps researchers build, reproduce, and benchmark models. Pythia is built on PyTorch and designed for Visual Question Answering (VQA) research, and includes support for multitask learning and distributed training. Facebook unveils what secretive robotics division is working on. The company outlined some of the focus areas for its robotics research team, which include teaching robots to learn how to walk on their own, using curiosity to learn more effectively, and learning through tactile sensing. Dollars & Sense Algorithmia raises $25M Series B for its AI platform Icometrix, a provider of brain imaging AI solutions, has raised $18M Quadric, a startup developing a custom-designed chip and software suite for autonomous systems, has raised $15M in a funding Novi Labs, a developer of AI-driven unconventional well planning software, has raised $7M To receive the Bits & Bytes to your inbox, subscribe to our Newsletter.
Bits & Bytes Google announced a bunch of interesting ML/AI-related news at last week’s Next conference. Here are the highlights, along with a few other tidbits. Google launches new AI-powered contact center solution. The global market for cloud-based contact center solutions is expected to exceed $30B by 2023. It’s no surprise that Google wants a piece of this, and to that end launched the Contact Center AI alpha. The new offering combines Google’s Dialogflow chat platform with other AI technologies—e.g. agent assist and a conversational topic modeler—to help customers reduce wait times, improve customer satisfaction, and gain greater insights. A full host of technology and services partners were announced as well. Furthering its edge initiatives, Google releases new Cloud IoT Edge. Cloud IoT Edge includes Edge IoT Core, which facilitates the connection of edge devices to the Google Cloud and simplifies their management, and Edge ML, which supports running pre-trained TensorFlow Lite models on edge hardware. Cloud IoT Edge is designed to take advantage of the newly announced Edge TPU as well (see below). Google unveils new AI chips for edge machine learning. Google is bringing its TPU accelerator chips from the cloud to the edge with the launch of Edge TPU, currently in early access. Aiming to compete with offerings like the Nvidia Jetson and Intel Movidius product families, Edge TPU brings high-performance ML inference to small, power-constrained devices. Google adds Natural Language and Translation services to the Cloud AutoML family. I covered the launch of Google Cloud AutoML Vision in the newsletter earlier this year. Last week Google pulled back the covers on new AutoML services for natural language classification and translation. Skip the press releases though and check out Rachel Thomas’ great series of posts on these new tools. For more from Google and Next, check out these roundups of all announcements and analytics/ML announcements. Dollars & Sense Snap40, which uses ML/AI for remote patient monitoring, has secured US $8 million in seed financing Zorroa, which provides a platform for managing visual assets, has closed a $7M funding round Shanghai-based Wayz.ai, a smart location and mapping start-up (not to be confused with Waze) announced that it has raised a US$80 million series A Unisound, a Chinese AI solutions provider, specialized in voice recognition and language processing, has received RMB600 million ($89 million) in Series C-plus funding Sign up for our Newsletter to receive the Bits & Bytes weekly to your inbox.
Bits & Bytes Elon Musk, DeepMind co-founders promise never to make killer robots. The founders have signed on to the Future of Live Institute’s pledge to develop, manufacture or use killer robots, which was published at the annual International Joint Conference on Artificial Intelligence in Stockholm, Sweden Huawei plans AI chips to rival Nvidia, Intel. The company is reportedly developing AI chips for both networking equipment and the datacenter in an effort to strengthen its position in growing AI market and to compete with the likes of Nvidia and Intel. Let the sniping continue. Facebook has hired Shahriar Rabii to lead its chip initiative. Rabii previously worked at Google, where he helped lead the team in charge of building the Visual Core chip for the company’s Pixel devices. Apple has appointed former Google AI exec John Giannandrea to lead a new artificial intelligence and machine learning team, to include the Siri unit. Interesting projects. Researchers at Nvidia, MIT, and Aalto University presented an approach to automatically removing noise, grain, and even watermarks from photos at ICML. A Google researcher along with collaborators from academia have developed a deep learning-based system for identifying protein crystallization, achieving a 94% accuracy rate and potentially improving the drug discovery process by making it easier to map the structures of proteins. Google revealed “Move Mirror,” an ML experiment that matches user’s poses with images of other people in the same pose. Dollars & Sense R4 Technologies, a Ridgefield, Connecticut-based AI startup created by Priceline.com founders and executives, secured $20m in Series B funding Cambridge-based SWIM.AI, which provides edge intelligence software for IoT applications, announced $10 million in Series B funding Viz.ai, a company applying AI in healthcare secured $21 million in Series A funding Computer vision technology provider AnyVision announced at it has secured $28 million in Series A financing Salesforce has signed a definitive agreement to acquire Datorama, an AI-powered marketing intelligence platform Workday announced that it has acquired Stories.bi, which uses AI to automate analytics and generate natural language business stories Robotic retail inventory specialist Bossa Nova announced the acquisition of AI video surveillance company, HawXeye Self-driving car company Pony.ai raised $102 million, putting it close to a billion dollar valuation Box announced that it has acquired Butter.ai, a startup focused on cross-silo enterprise search DataRobot announced that it has acquired Nexosis, an ML platform company whose founders we interviewed in TWIML Talk #69 Accenture has acquired Kogentix to strengthen Accenture Applied Intelligence’s growing data engineering business Sign up for our Newsletter to receive the Bits & Bytes weekly to your inbox.
We recently ran a series of shows on differential privacy on the podcast. It’s an especially salient topic given the rollout of the EU’s General Data Protection Regulation (GDPR), which becomes effective this month, not to mention scandals like the Facebook/Cambridge Analytica breach and other attacks on private data. If you hadn’t previously (or haven’t yet) heard the term differential privacy, you’re not alone. The field is relatively new–only about ten years old. Differential privacy attempts to allow data holders to make confidential data available for analysis or use via a data product while simultaneously preserving–actually, guaranteeing–the privacy of the individuals whose data is included in the database or data product. Differential privacy is often introduced in contrast to data anonymization. While anonymization might seem to be a reasonable way to protect the privacy of those data subjects whose information is included in a data product, that information is vulnerable to numerous types of attack. Consider, for example, the Netflix Prize, a frequently cited example. In support of a competition to see if someone could build a better recommendation engine, Netflix made an anonymized movie rating dataset available to the public. A group of researchers, however, discovered a linkage attack that allowed large portions of the data to be de-anonymized by cross-referencing it with publicly available IMDB user data. But what if we don’t want to publish data, but rather use it to create machine learning models that we allow others to query or incorporate into products? It turns out that machine learning models are vulnerable to privacy leakage as well. For example, consider a membership inference attack against a machine learning model. In this kind of attack, patterns in the model’s output are used to extract the data the model was trained on. These types of attacks, powered by ‘shadow’ machine learning models, have been shown to be effective against black-box models trained in the cloud with Google Prediction API and Amazon ML. In another example, an attack called model inversion [pdf] was used to extract recognizable training images (i.e. faces) from cloud-based image recognition APIs. Because these APIs return a confidence score alongside the label of a face submitted for recognition, an adversary could systematically construct an input face that maximizes the APIs confidence in a given label.   Differential privacy is an approach that provides mathematically guaranteed privacy bounds–it’s not a specific algorithm. For any given problem, there can be many algorithms that provide differential privacy. Aaron Roth provided a great example of a simple differential privacy algorithm in our interview. In his example, a polling company wants to collect data about who will vote for Trump in the upcoming election, but are concerned about the privacy of the people they poll. Roth explains that they could use a simple yet differentially private method of collecting the data. Instead of simply asking for the pollees voting data, the company could instruct the individuals to first flip a coin and if the coin is heads, answer the question honestly, but if the coin is tails, give a random answer decided by another coin flip. Because the statistical characteristics of the coin flip are known, you can still make inferences about the wider population even though your data collection has been partially corrupted. At the same time, this method ensures that the individuals in your study are protected by plausible deniability. That is to say, if the data were to be exposed there’s no way of knowing if a given answer was honest or part of the injected noise. Some tech companies are already starting to reap the benefits of differential privacy. For example: Apple. Apple uses differentially private methods of capturing user data to gain insights about user behavior on a large scale. They’re currently using this method for applications as diverse as QuickType and Emoji suggestions, Lookup Hints in Notes, crashing and energy draining domains in Safari, autoplay intent in Safari, and more. Google. In addition to using differential privacy to help understand the effectiveness of search query suggestions in its Gboard keyboard, Google, along with other cloud providers, has a huge incentive to explore these methods due to the public nature of many of the machine learning models they offer. Google has published several papers on the topic so far, including RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response and Deep Learning with Differential Privacy. Bluecore. Bluecore offers software and services to help marketers find and retain their best customers through targeted email marketing. The company uses differential privacy techniques to pool data across companies to improve customer outcomes while preventing any individual customer from being able to gain any insights into competitors’ data. Be sure to check out my interview with Bluecore director of data science Zahi Karam. Uber. Uber uses differential privacy to protect sensitive data against internal and external privacy risks. When company data analysts explore average trip distances in a city, for example, their queries go through an internal differential privacy system called Chorus, which rewrites their queries to ensure differential privacy. Open-source tools for differential privacy are beginning to emerge from both academic and commercial sources. A few examples include: Epsilon is a new differential privacy software system offered by Georgian Partners. Epsilon currently works for logistic regression and SVM models. At this time it’s only offered to the firm’s partners and portfolio companies, but the team behind that project plans to continue expanding the tool’s capabilities and availability. For more check out my interview with Georgian’s Chang Liu. SQL Elastic Privacy is an open source tool from Uber that can be used in an analytics pipeline to determine the level of privacy required by a given SQL query. This becomes a parameter that allows them to fine-tune their differential privacy algorithm. Diffpriv is an R package that aims to make differential privacy easy for data scientists. Diffpriv replaces theoretical sensitivity analysis with sensitivity sampling, helping to automate the creation of privacy assured statistics, models, and other structures. ARX is a more comprehensive open-source offering comprising a GUI-based tool and a Java library implementing a variety of approaches privacy-preserving data analysis, including differential privacy. As you might imagine, differential privacy continues to be an active research topic. According to Roth, hot research areas include the use of differential privacy to create and publish synthetic datasets, especially for medical use cases, as well as better understanding the ‘local method’ of differentially private data collection, in which the noise is injected at the time of collection as opposed to after collection. Differential privacy isn’t a silver bullet capable of fixing all of our privacy concerns, but it’s an important emerging tool for helping organizations work with and publish sensitive data and data products in a privacy-preserving manner. I really enjoyed producing this series and learned a ton. I’m eager to hear about what readers and listeners think about it, so please email or tweet over any comments or comment below. Sign up for our Newsletter to receive this weekly to your inbox.
Bits & Bytes Intel open sources nGraph neural network compiler. The newly open-sourced compiler, originally announced last summer and discussed on TWIML Talk #31, provides support for multiple deep learning frameworks while optimizing models for different hardware solutions. It supports six deep learning frameworks: TensorFlow, MXNet, neon, PyTorch, CNTK, and Caffe2. Google unveils augmented reality microscope. The prototype, which can detect cancer in real-time, was unveiled at an event organized by the American Association for Cancer Research. The new tool relays its predictions directly into the field of view of the user and has the ability to be retrofitted into existing microscopes. Google extends semantic language capabilities. Building on the hierarchical vector models at the heart of Gmails's Smart Reply feature, the new work extends these ideas by creating vectors for larger chunks of language such as full sentences and small paragraphs. The company published a paper on its Universal Sentence Encoder and launched the Semantic Experiences demonstration site. A pre-trained TensorFlow model was also released. IBM releases Adversarial Robustness Toolbox. The open-source software library aims to support researchers and developers in defending deep neural nets against adversarial attacks. The software, which currently works with TensorFlow and Keras, can assess a DNNs robustness, increase robustness as needed, and offer runtime detection of potential threats. MATLAB 2018a adds deep learning features. Many self-taught data scientists were initially exposed to MATLAB via Octave, the open source clone Andrew Ng used in his original Stanford machine learning online course. Well, the commercial software continues to evolve, with its latest version adding a host of new deep-learning related features including support for regression and bidirectional LSTMs, automatic validation of custom layers, and improved hardware support. Dollars & Sense Sword Health, a Portuguese medtech company, raises $4.6 million LawGeex, a contract review automation business, raises $12 million XpertSea, applying computer vision to aquaculture, raises $10 million Konux, a sensor and AI analytics startup, raises $20 million Citrine, materials data and AI platform, raises $8 million Eightfold.ai launches talent intelligence platform, closes $18 million round Voicera, the AI-powered productivity service, announces acquisition of Wrappup Adobe announces acquisition of voice technology business, Sayspring Sign up for our Newsletter to receive the Bits & Bytes weekly to your inbox.
In my recent podcast with Facebook AI research scientist Moustapha Cissé, Cissé shared the insightful quote, “you are what you eat and right now we feed our models junk food.” Well, just like you can’t eat better if you don’t know what‘s in your food, you can’t train less biased models if you don’t know what’s in your training data. That’s why the recent paper, Datasheets for Datasets, by Timnit Gebru (see her TWIML podcast and meetup) and her co-authors from Microsoft Research and elsewhere is so interesting. In this paper, Timnit and company propose the equivalent of food nutrition labeling for datasets. Given that many machine learning and deep learning model development efforts use public datasets such as ImageNet or COCO–or private datasets produced by others–it’s important to be able to convey the context, biases, and other material aspects of a training dataset to those interested in using it. The Datasheets for Datasets paper explores the idea of using standardized datasheets to communicate this information to users of datasets, commercialized APIs, and pre-trained models. In addition to helping to communicate data biases, the authors propose that such datasheets can improve transparency and provide a source of accountability. Beyond potential ethical issues, hidden data biases can cause unpredictability or failures in deployed systems when models trained on third-party data fail to generalize adequately to different contexts. Of course, the best option is to collect first-party data and use models built and trained by experts with deep domain knowledge. But widely available public datasets, more approachable machine learning tools, and readily accessible AI APIs and pre-built models are democratizing AI and enabling a broader group of developers to incorporate AI into their applications. The authors suggest that datasheets for AI datasets and tools could go a long way in providing essential information to engineers that might not have domain expertise, and in doing so help mitigate some of the issues associated with dataset misuse. This perspective echoes similar thoughts from Clare Gollnick in our discussion on the reproducibility crisis in science and AI. She expressed her concern for developers turning first to deeper, more complex models to solve their problems, noting that they often run into generalization issues when those models are moved into production. Rather, she finds that when AI problems are solved by capitalizing on some discovery found through a strong understanding of the domain at hand, the results are much more robust. Timnit and her co-authors suggest in the paper that AI has yet to undergo the safety regulations of emergent industries of the past, like the automobile, medicine, and electrical industries. The paper points out that, “When cars first became available in the United States, there were no speed limits, stop signs, traffic lights, driver education, or regulations pertaining to seat belts or drunk driving. Thus, the early 1900s saw many deaths and injuries due to collisions, speeding, and reckless driving.” Over the course of decades, the automobile industry and others iteratively developed regulations meant to protect the public good, while still allowing for innovation. The paper suggests that it’s not too early to start considering these types of regulations for AI, especially as it begins to be used in high-stakes applications like the health and public sectors. Such regulation will likely first apply to issues of privacy, bias, ethics, and transparency, and in fact, Europe’s impending General Data Protection Regulation (GDPR) takes on just these issues. The proposed datasheets take cues from those associated with electrical components. Every electrical component sold has an accompanying datasheet that lists the component’s function, features, operating voltages, physical details and more. These datasheets have become expected in the industry due to the need to understand a part’s behavior before purchase, as well as the liability issues that arise from a part’s misuse. The authors suggest that those offering datasets or APIs should provide a datasheet that addresses a set of standardized questions covering the following topics: The motivation for dataset creation The composition of the dataset The data collection process The preprocessing of the data How the dataset is distributed How the dataset is being maintained The legal and ethical considerations For the full breakdown of all of the questions check out the paper, it goes into a bunch of additional detail and provides an example datasheet for the UMass Labeled Faces in the Wild dataset. It’s a thorough and easy-to-use model that has the potential for big impact. Datasheets such as this will allow users to understand the strengths and limitations of the data that they’re using and guard against issues such as bias and overfitting. It can also be argued that simply having datasheets at all forces both dataset producers and consumers to think differently about their data sources and to understand that the data is not a de facto source of truth but rather a living, breathing resource that requires careful consideration and maintenance. Maybe it’s the electrical engineer in me, but I think this is a really interesting idea. What do you think? Do you think datasheets could help address the issues of bias and accountability in AI? Are there instances where you would have found this useful in your own work? Let me know via email or via the TWIML slack channel. Sign up for our Newsletter to receive this weekly to your inbox.
Bits and Bytes Apple hires Google’s AI head Google forms A.I. business unit. The latest in the AI talent wars, John Giannanderea, previously Google's chief of search and AI, was hired to run Apple’s “machine learning and A.I. strategy.” It’s an important victory for Apple who has lagged behind in AI. Google took the change as an opportunity to put AI into its own business unit under recent TWIML guest Jeff Dean. As the AI “arms race” intensifies, larger players are putting ever more resources into solidifying their positions. Last week we shared a similar story from Microsoft on its own reorg to better focus on AI. Researchers at MIT-IBM Watson AI Lab train models to recognize dynamic events. It’s easy for humans to recognize dynamic events, for example, opening a door, a book, or a bottle. MIT-IBM researchers hope to train models to recognize these types of dynamic events. They've released a Moments in Time dataset and are hosting a Moments in Time competition at CVPR. Note: I recently discussed similar work from the Univerisity of Montreal and startup Twenty Billion Neurons with its chief scientist Roland Memisevic. GridGain's newest release includes continuous learning framework. The company's in-memory computing framework based on Apache Ignite now includes machine learning and a multilayer perceptron (MLP) neural network that enables companies to run ML and deep learning algorithms against petabyte-scale operational datasets in real-time. Amazon SageMaker update. They’ve added support for more instance sizes and open sourced their MXNet and Tensorflow containers. The updated containers can be downloaded to support local development. Data scientist uses cloud ML to classify bowls of ramen. Nevermind hot dog/not hot dog... Data scientist Kenji Doi used Google Cloud AutoML Vision to successfully identify the exact shop each bowl is made at. A very impressive feat when you consider how similar the bowls of ramen actually look. Dollars and Sense Insider, an AI-enabled growth marketing platform, raises $11 million Comet.ml, a platform for managing AI projects, raises $2.3 million Audioburst, an AI-enabled audio search platform, raises $4.6 million from Samsung Conga to acquire, the contract discovery and analytics company Counselytics, to bolster AI strategy and document automation capabilities Sign up for our Newsletter to receive the Bits & Bytes weekly to your inbox.
Bits and Bytes Google open sources exoplanet discovery AI. The project came out of a collaboration between Google Brain software engineer Chris Shallue and astrophysicist Andrew Vanderburg. The team was able to discover several new exoplanets and have now open-sourced their project to the public. I got a chance to talk with Chris Shallue about his work not too long ago, check out the show to learn more. Microsoft matches human performance translating news from Chinese to English. The research incorporated novel methods of training translation models including dual learning, deliberation, joint training and agreement regularization. Google's NSynth Super is an AI synth made of Raspberry Pis. The tool comes out of Magenta, Google’s creative AI applications project. The synthesizer uses open source AI software to generate new sounds. I talked with Doug Eck, the Magenta project lead, about his work on generative AI for music a little while back; give it a listen. Gluon models now deployable to AWS DeepLens. Gluon is an open source deep learning interface developed by AWS and Microsoft. It’s now deployable to AWS DeepLens instances for computer vision applications. Google open-sources the AI-powered tool for portrait mode on their Pixel devices. The tool uses semantic image segmentation to identify optimal focal areas or areas that need higher or lower exposure. Dollars & Sense SambaNova Systems, a start-up building computer processors and software for AI raises $56 million in funding led by Alphabet. Voci Technologies Incorporated, a provider of enterprise speech-to-text transcription and analytics, raises $8m in Series B funding. Percipient.ai, a provider of analytics for national security and now corporate security missions, raises $14.7M in Series A funding. Airspace Systems, Inc., a manufacturer of comprehensive drone defense systems, raised $20m in Series A funding. TaoData, a Chinese fintech startup, has raised $15.8 million in a series B round. Fractal Analytics, an AI solutions and analytics company, announced the acquisition of behavioral AI company Final Mile. L’Oréal announces the acquisition of ModiFace, an augmented reality and AI-powered beauty company. Avaya Holdings Corp. announced its acquisition of Spoken Communications, a Contact Center as a Service solutions application built on conversational artificial intelligence. Sign up for our Newsletter to receive the Bits & Bytes weekly to your inbox.
Bits and Bytes Interesting tidbits from recent news: Microsoft develops AI powered sketch artist. The new bot, based on recent GAN research, is capable of generating “drawings” from caption-like text descriptions. Applications for this technology include the arts, design, and perhaps at some point, police sketches. Overall very cool. IBM and Salesforce announce Watson + Einstein collaboration. The two tech giants are teaming up to integrate their two eponymously named, over-marketed, poorly understood machine learning products. Oh boy! Although it’s not immediately obvious in what ways Watson and Einstein are “combining”, Salesforce and IBM are making it clear that they are prioritizing AI and fleshing out their offerings. #SnarkLevelHigh Baidu grows AI research team. The new hires are Dr. Kenneth Church a pioneer in Natural Language Pioneering, Dr. Jun Huan a big data and data mining expert and Dr. Hui Xiong who specializes in data and knowledge engineering. Dating services firm Lunch Actually to launch ICO for Viola.AI. The dating service aims to not only match couples but also track their relationships, suggest date venues, remind them of new dates and advise them on relationship problems. Potentially a very interesting AI application, but one with tons of potential privacy implications. UC Berkeley & Facebook introduce House3D for reinforcement learning. The two teamed up to enable more robust intelligent agents by publishing a new dataset called “House3D”. House3D contains 45,622 3D scenes of houses, ranging from single-room studios to multi-storeyed houses equipped fully labeled 3D objects. In doing so, the groups aim to push RL research away towards focusing on tasks that more easily applicable to the real world. App claims to predict if an image will “go viral.” ParallelDots released the app with an open API that allows user to upload images then receive a “virality” score. It’s no secret that viral sharing is the dream of many marketers, so it’ll be interesting to see if this type of service could provide beneficial insights when planning ad campaigns. Amazon launched SageMaker BlazingText. BlazingText is an unsupervised learning algorithm for generating word2vec (see TT # 48) embeddings and is the latest addition to Amazon SageMaker’s suite of built-in algorithms. Deal Flow There seemed to be an abundance of deals last week: Smartphone-maker Coolpad has raised $300 million from Chinese property mogul Chen Hua-backed Power Sun Ventures to enhance its artificial intelligence capabilities. Understand.ai, a Karlsruhe, Germany-based machine learning startup for training and validation data in autonomous vehicles, raised $2.8 million in seed funding. C3 IoT, a provider whose software offerings include AI-for-IoT tools, announced a $100 million new round of financing. Data Nerds, a Canada-based developer of data products, raised $3m in Series A funding. Techcyte, Inc. closed a $4.3 million funding round to commercialize its digital pathology platform. Babblabs, a fresh start-up in advanced speech processing, announced today a Series Seed investment of $4 million. Owkin, a NYC-based predictive analytics company that utilizes transfer learning to accelerate drug discovery and development, raised $11m in Series A funding. Pony.ai, a year-old California-based self-driving car startup, announced it recently completed a $112 million Series A funding round. Smartsheet, that builds software for corporate process management, acquires business automation chatbot startup Converse.AI. Workday, the cloud HR and financials SaaS provider, buys SkipFlag to bolster machine learning capabilities. Sign up for our Newsletter to receive the Bits & Bytes weekly to your inbox.
Yippee, newsletter number three! Wrangling with ethical AI Last week I spent a day at the Wrangle Conference in San Francisco, guest of the team at Cloudera who organizes the event. This was the third Wrangle conference ever, and the second I’ve been able to attend. Wrangle is a pretty interesting conference. It aims to bring a diverse community of data scientists to an intimate and informal setting (think cowboy hats and BBQ) to discuss real data science projects and issues. It does a nice job at that. But what it does a GREAT job at is surfacing some of the ethical issues surrounding data science, machine learning & AI. A standout example of this from last year’s event was Abe Gong’s Ethics for Powerful Algorithms talk, which suggested that attendees perform “ethics reviews” of their algorithms, and provided a framework for doing so. This time around, the talks that moved me most were along the same lines. In particular, Drew Conway’s talk on the interplay between our cognitive biases and our roles as data scientists, strategists and consumers was great. Also great was Tyler Schnoebelen’s talk (picture below) on The Ethics of Everybody Else, which argued that to be ethical, systems that classify human beings must consider the goals of the people affected by the system, not just those of their builders. The whole area of ethical AI is an important one that I’m looking forward to exploring further on the podcast. Please give me a shout if you’d like to hear more on this topic, or if there are particular people you’d like to hear from. And stay tuned for interviews with Drew, Sharath Rao, and Erin Shellman. (Only Drew’s talk touched on ethical issues—Sharath’s and Erin’s talks were about building data products and pipelines, respectively.) Sign up for our Newsletter to receive this weekly to your inbox.
Hi everyone! Woohoo, newsletter number two! Comings and Goings July’s been a busy month. Last week I was in NYC for the launch of Intel’s Xeon Scalable platform. It probably goes without saying, but AI figured very prominently into their launch. In particular, they touted increased training performance of 113x and 2.4x improved inference throughput, relative to prior generation Xeon chips. Since many AI workloads end up running on Intel CPUs, particularly for “casual” users, this should improve productivity for a lot of people. The new chips may already be available in a cloud near you, as Intel has delivered more than half a million of them to AWS and Google via its early ship program.   This morning I’m off to San Francisco for the Wrangle Conference on Wednesday. I’ve mentioned the event on the podcast a few times, and I’m looking forward it. I’ve got a few interesting interviews scheduled and the conference agenda looks great with data folk from companies like Airbnb, Facebook and Netflix presenting, among others. Let me know if you’ll be around; I’d love to connect! Next week, two of my favorite people are collaborating on an event in Toronto that the podcast is sponsoring. Charlie Oliver’s Tech2025 is hosting a workshop by Integrate.ai’s Kathryn Hume called “Explain It Like I’m 5: What’s the Difference Between AI, Machine Learning, NLP, and Deep Learning?” Given that Kathryn’s last talk explored the future of AI and Aristotelian causality via a fictional biography of a child, this one should be interesting, to say the least! Any TWIML listeners in the 416? Sign up for our Newsletter to receive this weekly to your inbox.