We could not locate the page you were looking for.

Below we have generated a list of search results based on the page you were trying to reach.

404 Error
I studied physics in Munich at the University of Technology, Munich, at the Universita degli Studi di Pavia and at AT&T Research in Holmdel. During this time I was at the Maximilianeum München and the Collegio Ghislieri in Pavia. In 1996 I received the Master degree at the University of Technology, Munich and in 1998 the Doctoral Degree in computer science at the University of Technology Berlin. Until 1999 I was a researcher at the IDA Group of the GMD Institute for Software Engineering and Computer Architecture in Berlin (now part of the Fraunhofer Geselschaft). After that, I worked as a Researcher and Group Leader at the Research School for Information Sciences and Engineering of the Australian National University. From 2004 onwards I worked as a Senior Principal Researcher and Program Leader at the Statistical Machine Learning Program at NICTA. From 2008 to 2012 I worked at Yahoo Research. In spring of 2012 I moved to Google Research to spend a wonderful year in Mountain View and I continued working there until the end of 2014. From 2013-2017 I was professor at Carnegie Mellon University. I co-founded Marianas Labs in early 2015. In July 2016 I moved to Amazon Web Services to help build AI and Machine Learning tools for everyone.
There are few things I love more than cuddling up with an exciting new book. There are always more things I want to learn than time I have in the day, and I think books are such a fun, long-form way of engaging (one where I won’t be tempted to check Twitter partway through). This book roundup is a selection from the last few years of TWIML guests, counting only the ones related to ML/AI published in the past 10 years. We hope that some of their insights are useful to you! If you liked their book or want to hear more about them before taking the leap into longform writing, check out the accompanying podcast episode (linked on the guest’s name). (Note: These links are affiliate links, which means that ordering through them helps support our show!) Adversarial ML Generative Adversarial Learning: Architectures and Applications (2022), Jürgen Schmidhuber AI Ethics Sex, Race, and Robots: How to Be Human in the Age of AI (2019), Ayanna Howard Ethics and Data Science (2018), Hilary Mason AI Sci-Fi AI 2041: Ten Visions for Our Future (2021), Kai-Fu Lee AI Analysis AI Superpowers: China, Silicon Valley, And The New World Order (2018), Kai-Fu Lee Rebooting AI: Building Artificial Intelligence We Can Trust (2019), Gary Marcus Artificial Unintelligence: How Computers Misunderstand the World (The MIT Press) (2019), Meredith Broussard Complexity: A Guided Tour (2011), Melanie Mitchell Artificial Intelligence: A Guide for Thinking Humans (2019), Melanie Mitchell Career Insights My Journey into AI (2018), Kai-Fu Lee Build a Career in Data Science (2020), Jacqueline Nolis Computational Neuroscience The Computational Brain (2016), Terrence Sejnowski Computer Vision Large-Scale Visual Geo-Localization (Advances in Computer Vision and Pattern Recognition) (2016), Amir Zamir Image Understanding using Sparse Representations (2014), Pavan Turaga Visual Attributes (Advances in Computer Vision and Pattern Recognition) (2017), Devi Parikh Crowdsourcing in Computer Vision (Foundations and Trends(r) in Computer Graphics and Vision) (2016), Adriana Kovashka Riemannian Computing in Computer Vision (2015), Pavan Turaga Databases Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases (2021), Xin Luna Dong Big Data Integration (Synthesis Lectures on Data Management) (2015), Xin Luna Dong Deep Learning The Deep Learning Revolution (2016), Terrence Sejnowski Dive into Deep Learning (2021), Zachary Lipton Introduction to Machine Learning A Course in Machine Learning (2020), Hal Daume III Approaching (Almost) Any Machine Learning Problem (2020), Abhishek Thakur Building Machine Learning Powered Applications: Going from Idea to Product (2020), Emmanuel Ameisen ML Organization Data Driven (2015), Hilary Mason The AI Organization: Learn from Real Companies and Microsoft’s Journey How to Redefine Your Organization with AI (2019), David Carmona MLOps Effective Data Science Infrastructure: How to make data scientists productive (2022), Ville Tuulos Model Specifics An Introduction to Variational Autoencoders (Foundations and Trends(r) in Machine Learning) (2019), Max Welling NLP Linguistic Fundamentals for Natural Language Processing II: 100 Essentials from Semantics and Pragmatics (2013), Emily M. Bender Robotics What to Expect When You’re Expecting Robots (2021), Julie Shah The New Breed: What Our History with Animals Reveals about Our Future with Robots (2021), Kate Darling Software How To Kernel-based Approximation Methods Using Matlab (2015), Michael McCourt
Sam Charrington: Hey, what’s up everyone! We are just a week away from kicking off TWIMLfest, and I’m super excited to share a rundown of what we’ve got in store for week 1. On deck are the Codenames Bot Competition kickoff, an Accessibility and Computer Vision panel, the first of our Wellness Wednesdays sessions featuring meditation and yoga, as well as the first block of our Unconference Sessions proposed and delivered by folks like you. The leaderboard currently includes sessions on Sampling vs Profiling for Data Logging, Deep Learning for Time Series in Industry, and Machine Learning for Sustainable Agriculture. You can check out and vote on the current proposals or submit your own by visiting https://twimlai.com/twimlfest/vote/. And of course, we’ll have a couple of amazing keynote interviews that we’ll be unveiling shortly! As if great content isn’t reason enough to get registered for TWIMLcon, by popular demand we are extending our TWIMLfest SWAG BAG giveaway by just a few more days! Everyone who registers for TWIMLfest between now and Wednesday October 7th, will be automatically entered into a drawing for one of five TWIMLfest SWAG BAGs, including a mug, t-shirt, and stickers. Registration and all the action takes place at twimlfest.com, so if you have not registered yet, be sure to jump over and do it now! We’ll wait here for you. Before we jump into the interview, I’d like to take a moment to thank Microsoft for their support for the show, and their sponsorship of this series of episodes highlighting just a few of the fundamental innovations behind Azure Cognitive Services. Cognitive Services is a portfolio of domain-specific capabilities that brings AI within the reach of every developer—without requiring machine-learning expertise. All it takes is an API call to embed the ability to see, hear, speak, search, understand, and accelerate decision-making into your apps. Visit aka.ms/cognitive to learn how customers like Volkswagen, Uber, and the BBC have used Azure Cognitive Services to embed services like real-time translation, facial recognition, and natural language understanding to create robust and intelligent user experiences in their apps. While you’re there, you can take advantage of the $200 credit to start building your own intelligent applications when you open an Azure Free Account. That link again is aka.ms/cognitive. And now, on to the show! Sam Charrington: [00:03:14] All right, everyone. I am here with Cha Zhang. Cha is a partner Engineering Manager with Microsoft Cloud and AI. Cha, welcome to the TWIML AI podcast. Cha Zhang: [00:03:25] Thank you, Sam. Nice to meet you. Sam Charrington: [00:03:27] Great to meet you as well. Before we dive in, I’d love to learn a little bit about your background. Tell us how you came to work in computer vision. Cha Zhang: [00:03:38] Sure. Sure. I actually have been at Microsoft for 16 years. I joined Microsoft originally as a researcher at Microsoft Research. I was there for 12 years. My research was primarily applying machine learning to image, audio, video; all of these different applications. I started 2016. I joined the product side, and currently I’m working as an Engineering Manager, and my primary focus is on document understanding. Sam Charrington: [00:04:11] Awesome. Awesome. So, we will be focusing quite a bit on OCR and some of your work in that space, and, you know, I think people often think of OCR as a, you know, a solve problem, right? It’s, you know, we’ve been scanning documents and extracting texts out of those documents for a long time. Obviously the advent of deep learning, you know, changes things, but I’d love to get the conversation started by having you share a little bit about, you know, what’s new and interesting in the space. How has it changed over the past few years? Cha Zhang: [00:04:50] Sure. Actually, it wasn’t very long ago, when people talk about OCR, what comes out of mind was firstly scan documents. In many people’s eyes, OCR for scan documents is sort of a solve the problem. More likely, I think there’s two major development. One is with a mobile first kind of word where everybody now have mobile phones and they take pictures everywhere. So there’s a lot of demand to do a text recognition out of images in the wild, and that certainly is a much more challenging problem than scan documents, and then technically, because of the advances in deep learning, we have realized that with deep learning, we can do OCR at a different level. We can make it a lot more accurate than before, and we can solve OCR problem in kind of imaging the wild scenario. So I think it started at 2000, early 2010 ish. I think there’s a lot of big advent advances in this area, and now we’re seeing basically OCR becomes something really that works. You know, people don’t need to worry about quality, etcetera, just mostly works. Sam Charrington: [00:06:08] Can you talk a little bit more about the challenges that arise when you’re trying to do OCR in the wild? Cha Zhang: [00:06:16] Of course. I think for documents, usually it’s white background and black text, but for images in the wild, essentially it’s a photo. So in the photo, there’s a lot of variations in the text. First there’s a huge scale variation, so some texts, if you capture a picture of a street, there might be some store name that are super big, and then there are some tiny texts that’s hard to see. So there’s a big variation in scale of the text and the aspect ratio of these texts can be a really long cause text string can be very long compared to regular objects, like a cat or a dog. Because of the mobile capture scenario, usually it’s difficult to integrate close these texts by and access a line of rectangles. For example, you’re not, there might be perspective just portions of the text when the camera sees them. The background in the image in the wild is much more complicated than the typical white background you see in scan documents, and some of these backgrounds, such as fences, breaks, and stripes, are even though they appear quite simple for human beings, but think of like fences can be a perfect, a bunch of ones, you know, on the street sitting there and they look very similar to two characters. So those create additional challenges, and I think one of the biggest one, I think technically for OCR, that’s challenging is the localization accuracy. So, typically in object detection, the localization accuracy, if it’s measured by intersection of a union, and if that criteria is bigger than 0.5, people think this is good enough, but for OCR, if you actually, the intersection is only half of the union, a lot of the characters will be missing. So, usually OCR will need a 0.9, 0.95 level kind of accuracy in order to recognize all the characters properly. So… Sam Charrington: [00:08:31] Can you explain that in more detail? What is intersection over union and how is that used in convect detection? Cha Zhang: [00:08:39] So, in order to measure the accuracy of a particular detection algorithm, you need to ground truth label the data, and so, typically what people do is they create a bounding box of the object to be determined, to be detected, and then you use a automatic algorithm to figure out where the object is, then that will also create a bounding box. Now you have two bounding boxes. and the question is how do you measure how well these two boxes align and, a common measure is to take the intersection of these two bounding boxes and you take the union of these two bounding boxes that you get two areas. You can imagine if the two bounding boxes are very close to each other, overlapping a lot, then that intersection of a union would be very high, but if they are off, they’re offset by quite a bit, then you know, the number is low. So that’s kind of academia standard, how people measure detection accuracy with this criteria. Sam Charrington: [00:09:46] Got it. And so, you were saying that the threshold that you need in the case of texts is higher because of what? Cha Zhang: [00:09:58] Because of… Let’s just think about, you know, you have a ground truth text, let’s say, “Hello world,” and it’s elongated a rectangle and you say, I have a text detection algorithm that creates also a bounding box, but have a intersection of a union, let’s say roughly 0.5, and so what that means is that the intersection area divided by the union of the two bound inbox is 50%. So very likely the detective bounding box will miss a few characters because, you know, the overlapping is not there. So, you might be missing at, you might miss a D as an N and all this will cause the OCR to produce wrong results. And so that’s the main challenge here. Sam Charrington: [00:10:48] So in the case of a traditional object detection scenario, you may miss a half of the face but you can tell that there’s a face there in the case of OCR, you’re just missing letters and it makes it a lot more difficult for the algorithm to guess what was there. Cha Zhang: [00:11:07] Yes, exactly. Sam Charrington: [00:11:08] Got it, and maybe taking a step back just to the problem as a whole, granted mobile is driving, you know, this transition to these in the wild pictures and people trying to OCR them, but what are the high value use cases there? Like, is it, you know, I’m thinking of some interesting ones as like the… when it’s in conjunction with translation, you know, maybe I’m in another country and I’m, I’ve done this. You know, you’re taking pictures of, of words and another character to try to read the menu or something like that. I’ve also done things like scan documents on a phone and, and you won’t want to OCR those, but that’s kind of back to the traditional OCR problem in a lot of ways. What are some of the other use cases that are common? Cha Zhang: [00:11:58] If you look at this kind of business opportunities, I still think the traditional document, you know, scan document, I think, some traditional kind of OCR problems that like, for example, receipts, where people can scan in the old days, but nowadays people mostly do reimburse them by taking or snapping a photo. So I think in term of the market, the revenue, I think that’s still quite a big one. There are a few others. The one that you mentioned, if you have a phone, you go to a foreign country, you snap a photo and you want to translate them as one. There’s also a lot of applications in digital asset management. So this is when you, either you are a big company or you are a personal kind of, you have some big storage of photos and where you want to organize these photos. We have shown that with OCR capability, you can increase the accuracy of processes, photos, and retrieve these photos. As a matter of fact, you know, the big search engines like Google and Bing, when they search images, OCR is integral part of that as well because the OCR, the content can help a lot in getting the best images. Sam Charrington: [00:13:22] Okay. And so, you were mentioning kind of some of the technical challenges and localization of the texts in these images is one of those challenges. How do you go about it? Is it the case that, you know, deep learning is so powerful off the shelf. Deep learning techniques just solves it for you or do you, you know, you reengineer the whole pipeline? How do you approach that? Cha Zhang: [00:13:53] So in text, this action, usually the detection pipeline is different from a traditional object detection. What’s been most popular for kind of OCR for imaging in the wild today is something called anchor free detection. So the idea… Anchor free. In a typical object detection, usually most well known anchors, like fast RCN and faster RCN, etcetera. They basically create these anchors and then they regress the actual bounding box of the objects. The challenge of using that kind of approach is that these anchors need to be preset, and so typically for normal object detection, you set at a certain density, and then you set a certain set of aspect ratios. Like your anchor box are one to two, one to three, one to one. Typically you go about there, but texts, some of the text can go like 20 to one so really you cannot, it will be a huge computational cost to go with anchor based approach. So modern days for OCR, we go anchor free, and the high level concept is essentially by using convolutional neural networks. You almost do kind of a per pixel level, a decision or classification saying, well, this region nearby this particular pixel, it looks like part of text. So there is a text/non-text classification almost kind of per pixel level. Then you rely on a few algorithms to group these into text vines by looking at how well two, for example, two texts, the region are similar to each other and you can decide, well, these two looks like the same textures and color, and maybe they should be connected. In this regard, there are quite a few well known algorithms to do this connection. In earlier days, people use a relatively kind of a rule-based approach like stable link where they link based on some features, but it’s kind of a rule-based. More recently, people start looking to new networks like relation network. So are kind of estimating the relation of two regions are features, and based on that to decide, well, these two should be connected or not. So that way you started kind of bottom up; you start with perfect kind of classification, and then you do grouping, and you come out with these text lines. Very powerful approach. It can not only detect kind of a straight lines, but even curve lines, you can handle them pretty well with those approaches. Sam Charrington: [00:16:44] So it sounds like you’re describing a pipeline. That’s not like a, end to end train single neural network that you give it images and train it on label data. It is, telling you what the text is, but rather a bunch of independent steps. Cha Zhang: [00:17:04] Yes, that’s a very good observation. Actually, so for OCR, detection is only the first step and after detection, we typically run a character model where you take the detected text lines, you normalize them into a straight line with a fixed height, and then you run a character model to actually decode the image into a character, a list of characters. There are a lot of approach actually similar to speech where, you know, speeches going from acoustic similar to these texts. But here we’re going from image to text. But a lot of the approaches that we use, like LSTM, language modeling, these are very similar. Now your question is certainly valid because in speech today, you know, people do end to end training you. They start from audio so they can directly go to text. For OCR, we are not a year yet. I think the main challenges, well first is how much data you have. I think speech, you can collect a lot more data compared with OCR. OCR data are usually very expensive to collect in a label and so, going stage by stage at this point is more economically doable than, you know, do end to end training. Sam Charrington: [00:18:25] Why is that? It seems that we have tons of pictures with words in them that we know particularly, is it just in the wild, the, in the wild examples where we don’t have the label data or is also this document use cases because I’m imagining, Microsoft has probably labeled a ton of receipts and business cards and that kind of thing. Cha Zhang: [00:18:50] Yeah. I think certainly a labeling is very, very expensive. For Microsoft, we are a company paying a lot of attention to privacy, you know, those kinds of issues and the collecting OCR data has been a major, I would say, blocking issue to go for this kind of end to end approach because if you think about it, a lot of the document that we actually carry, like if you say, talk about invoice, talk about receipts, business card, they all contain PI information. Those are data extremely difficult to obtain, and we follow very strict kind of guidelines – how we can collect them, how we can label them. So in some way we are limited by these privacy restrictions, but we do respect those a lot. So we, as a result, you know, we are now going end to end at this point. Sam Charrington: [00:19:48] Got it, got it. It makes me think a little bit about the, some of the issues with neural networks, remembering data. So for example, there are examples where you’re, you train a CNN and there are some attacks that you can do that will reproduce some of the images, you know, it’s to some degree or another, that the model was trained on. Likewise, with these very large language models, you can start to see some of the texts that the models were trained on, come out in the, in the output. I would imagine if you were training end to end, at least then that becomes an issue as well, and maybe more so than in the case of images.   What’s your intuition there? Would it be worse or are better than images? Cha Zhang: [00:20:39] I would imagine it will be similar, I would say. So after all, you know, OCR, you come from image to text, but during the learning of this OCR process, language model is actually very helpful to help improve the OCR accuracy. So, for example, during decoding of these texts lines into a text, we use some of the, like LSTM or, you know, basically these very popular language modeling schemes. Certainly it remembers the contextual information of the language in order to help the OCR to recognize these texts properly. So, I think when you go to end to end, when the amount of data that you use for training is humongous, I think, it’s difficult to imagine for me, you know, we’ll have similar level of data for training like BERT models or TBT models. Those are huge, huge amount of data, but still you will learn something from the text and they might leak into the model as well. Sam Charrington: [00:21:51] Along those lines, what enabled BERT and many of the recent innovations around language models is a shift from supervised to the semi-supervised way of framing the task. Is there a semi-supervised framing for the OCR test? That makes sense? Cha Zhang: [00:22:13] Actually for OCR today, we are not, although I think it’s definitely a very interesting research problem. I think BERT is a super nice framework for transfer learning. You know, you, you go from pre-trained model and then, you know, on a supervisor, you can… In the image word, I think, transfer learning probably exists earlier in image than language. So earlier days when we have ImageNet, we trained like a resident, those are already being used for transfer learning. So, unsupervised kind of image learning is also, I think it’s still ongoing. There’s a lot of interesting projects going on. I think for OCR right now, we’re not there yet. Like one of the main issues for building a product like OCR to use some of these pre-train model is the computational cost. I think this happens in language as well, BERT model, the GPT Model 3, like, you know, multi billions of parameter is very difficult to turn them into a product for OCR. It’s also, you know, we have the same problem. Computational cost is very sensitive. We need to make it fast, and so we’re using it relatively small models and normally we train from scratch. Transfer learning does show some benefit, but when the data reaches a certain amount, we found training from scratch is perfectly fine. Sam Charrington: [00:23:49] When you have a certain amount of data to train from? Cha Zhang: [00:23:53] Yeah. In the very early days when we started doing different learning OCR, we actually rely a lot on trans distillation – that’s teacher-student learning, where we first train a big and model, and then we gradually use teacher-student learning to create a small model so that it can run efficiently. Nowadays, we have figured out that you can train these models from scratch. The amount of data that we have on the order of, you know, hundreds of thousands and millions of images are sufficient to train from scratch on smaller model, and reach about the same accuracy. Sam Charrington: [00:24:31] Can you elaborate a little bit on that? Are you saying that you need more data to train smaller models? Cha Zhang: [00:24:37] No, I’m saying that… Take BERT as example. BERT is super beneficial for transfer of learning because it has seen so many documents. So giving any new language task, presumably your data is not much, there’s not much data that you have to train this new task, and therefore, leveraging BERT, where it has seen so many documents, will help through transfer learning to transfer some of the knowledge that the BERT has learned from this huge set of document, to the small kind of task so that it can reduce the amount of documents required to train the smaller task. The same thing happens in ImageNet transfer learning where, you know, if it’s a ResNet train on ImageNet, you learn a lot of visual information from the ImageNet dataset. Then if you have a tiny detection task, like detecting a helmet, let’s say, and you can do the transfer learning and you can use a very small amount of dataset to actually train a very good helmet detector. What I was saying just now was that for the problem of OCR where, you know, it is certainly a very important computer vision problem. Every company who invest in OCR tend to collect quite a bit of data, not to the level of, you know, billions, but hundreds or thousands, millions to that level, that amount of data is sufficient that you do not need to go transfer learning. You can train the model from scratch and you get very good results. Sam Charrington: [00:26:19] Got it. Got it. So when you were using transfer learning where you’re using models based on ImageNet, you know, along the lines of ResNet and others, or whether… Okay. Lets see… so the smaller models that you’re training are they, you know, some of the traditional architectures that we’ve already brought up or are you building out new architectures for the models themselves for this specific problem? Cha Zhang: [00:26:53] Right now we’re using some of the traditional models. There are some active research going on regarding searching the best effective architecture for OCR. We haven’t seen convincing results yet, but I think that’s a very active research area that we’re still kind of looking into, particularly when we try to make it smaller and smaller, you know, faster and faster. Sam Charrington: [00:27:20] When you say searching the best architecture for OCR, are you speaking using the word searching generally, like you have researchers are looking at different models and trying to find the best one for OCR, or are you suggesting a domain specific neural architecture search kind of…? Cha Zhang: [00:27:38] I mean neural architecture search. So that certainly can be applied to OCR and we were still exploring it, but I think that’s a very promising direction. Sam Charrington: [00:27:49] Okay. Interesting. Interesting. Earlier in the conversation you talked about one of the big use cases is some of these semi-structured data that we want to extract information out of – invoice is one example. There was a recent demonstration, or I guess that’s actually a product now of the mobile version of Excel or something. You can take a picture of a grid, grid like data, and that will, you know, both extract the text and organize it into a spreadsheet. Talk a little bit about the product that you’re working on the form recognizer, which is doing something similar. Cha Zhang: [00:28:35] Yeah, of course. So OCR certainly is pretty low level. Other than some of the application I mentioned earlier, like digital SMN and then photo managing, you know, translation, you can directly use OCR, but for many customers, what they want is not just OCR. They want to extract information from documents. Think about, you know,I need to process millions of invoices. I want to extract vendor name and the date, total amount, or if it’s an MS expense system where you want to process all the receipts, and either it can be a verification purpose, for example, like, okay, how do I make sure employees are not putting random numbers and they don’t match with the receipts that’s actually filed. It’s actually, it sounds kind of silly but you know, today, a lot of the company do this verification manually. Because of the huge manual amount of effort needed, they often can only do sampling. So you sample like 5% of these receipts to validate, but you kind of miss a huge chunk, and that you never even look at it? So we are looking at this space and we’re trying to build essentially two category of product – one is a previous set of product and these are solutions that works out of the box. For example, it can be a prebuilt receipt, pre-built business card, pre-built invoice. So these are, basically you’re sending an image or PDF file. It will extract all the fields that you’re, you’ll be interested in. Another big category that we think are super important is customization because, you know, the pre-build may never fit every need. So we have a solution called the custom form where we allow customer to basically send us a few sample images. You can either label or even, you know, not doing any labelling but we will be able to extract key value pairs out of these documents. Again, we see this as a much closer to what the customers need and that’s what the form recognizes its position as. Sam Charrington: [00:30:54] So we’ve talked about a bunch of the interesting technical challenges at the lower level at OCR. Does the form level, you know, is that a kind of a packaging of OCR? Does it have its own technical challenges to overcome…? Cha Zhang: [00:31:13] Actually it has a lot of very interesting challenges. So, one of the work recently is coming out from Microsoft research, whereas, you know, targeting exactly this problem. And so, just think about it. The language, I mean, passing these invoices and receipts are essentially sort of a language problem because you have these texts there. The challenge here is that these are images, so you run OCR on them, but unlike a typical language, a data set where you’ve scratched from the internet, you know, Wikipedia there’s basically have this ordering of these words already, but if these data coming from image, essentially you can detect these texts lines, but it’s actually very difficult to define the read order of these texts lines, and ordering of these texts lines by itself is a very challenging problem. When you have images in the wild, paper can be curved, you know, can be crunch, can be rotated here, the perspective, you know, all kinds of issues. They can have background text, you know, all these. So the particular approach that MSRA came out is called LayoutLM. It’s actually a modified a BERT model. It’s also a language model, but in addition to the language, we also embed 2D information, like what is the X, Y position of the bounding box of the text? So with that information, train, actually, this is all can also be trained without supervision. It’s unsupervised pre-training. We are able to learn this kind of spatial relationship in these invoices without coming out with explicit read order. With that, we actually can do a lot of these key value extraction really well. There’s also quite a lot of advanced research looking into say, relation networks where you see two text lines nearby each other, you can predict the relationship. Again, this is similar to the OCR where you have these bottom pixel level classification. You want a group of them here. You want a group P key and a value pairs. There’s also a lot of advanced research in this graphical convolution networks where you do convolution networks over a graph, where the graph is defined by connecting nearby text lines. Again, this is approach without requiring reading order, but just look at the spatial relationship. So these are all actually very exciting kind of extension of language, but also using visual information to help passing these vertical data more accurately. Sam Charrington: [00:34:09] Interesting. Yeah, I think it’s… At a quick thought would’ve imagined that, you know, maybe the top part of the stack, there is more rule-based than the bottom part of the stack was, you know, more machine learning base, but it sounds like they’re even, I don’t know, relatively, but there are a bunch of really interesting… Cha Zhang: [00:34:33] We are doing a lot of machine learning stuff on the top as well. Sam Charrington: [00:34:37] I’m imagining the, you know, when you talk about relation net, for example, on an invoice you could have date, and then the date, you know, horizontally next to it, or you can have date and then the date beneath it. Cha Zhang: [00:34:50] Yes. Sam Charrington: [00:34:50] You may have an address box and then a bunch of texts that comes beneath it. It would be nice to know that, you know, we’re talking about the address here. That’s part of the idea of the structured text extraction. So in that you mentioned relation net and graphical CNNs. Are those two approaches to solving the same problem or are they solving different aspects of the problem? Cha Zhang: [00:35:13] They solve different aspects of the problem, and they can be also used to solve the same. I mean, like right now, the main focus for us, for them for extracting key value pairs. This is both kind of pre-build and the customization. Think about, if it’s an invoice and you want a vendor name, so it’s a name. Certainly, you know, the text information because you see it looks like a vendor name. This probably is a vendor name and some invoice doesn’t even have the key in the invoice. Sam Charrington: [00:35:48] Right. Cha Zhang: [00:35:49] You don’t even have the word vendor name there, so how do you figure out this thing is still vendor name? So, there, you rely on information that’s language and that’s also kind of how the document is laid out. Like, okay, the font size may matter. You know, the position of the same may matter. So we are looking into combining all this information to come out with a better decision on those fields. Sam Charrington: [00:36:21] So, how does a graphical representation or way of thinking about the document gets you to a solution to these kinds of problems? You know, for example, the unlabeled vendor name? Cha Zhang: [00:36:33] The graphical kind of approach is basically… so you’ve got a bunch of text lines detected by the OCR and you connect to these texts lines with their neighbors. You define basically how strong these connections are. Actually it’s not defined. You actually learn these relationships by looking at the texts, looking at their relative positions, looking at their font similarity. Like one issue that you actually just mentioned was like address as you connect ’cause you have multiple lines of addresses. How do you know they actually belong to the same address? Right? So there’s this kind of, all these side information could be very helpful in determining that they should be grouped together. In the convolutional kind of graphical model, you learn a convolutional network by computing from all the neighboring nodes where each node is a text line to aggregate basically at the center node. So basically, the model learns by not only looking at the current text line that’s in focus, but also look at all the nearby text lines and decided, well, given all these contextual information, it does look like this is a vendor name. I guess that’s a very high level conceptual description of why it would work, but it’s the data driven machine learning so that the model [inaudible]. Sam Charrington: [00:38:06] As you’re solving problems like this, are you often needing to re-label your dataset? For example, imagining early on in developing an algorithm like this, you have a bunch of invoices, and you draw a bounding box around the addresses and you say, this is the address. Then you say, ‘Oh, well the font information is a whole new dataset,’ you have to label, well, this is… Are you going in and having people label Helvetica versus Arial? That seems a bit fine grain and hard to actually get an experts to label, or is it more abstract than that? Cha Zhang: [00:38:48] We usually only label the end goal, which is the field that you’re going to extract. So, for example, you want to extract a vendor name, vendor address, total text, you basically draw a bounding box in those regions and use that as a ground use data. Sam Charrington: [00:39:06] Got it. I think we’re going to the same place. When you say font… Cha Zhang: [00:39:11] When I say font, actually it’s in some way, implicit in the sense that we’re taking these bounding boxes, we’re extracting image information. Right? So think of it as let’s say, run a convolution network to extract a feature of that part of the text region, the text line. So, this feature is essentially all the visual information that can be helpful in deciding or determining the relationship between text lines. So if features are similar, it probably mean they are similar font, they are similar size, you know, so those kinds of… So, yeah, I think that seems to be sufficient. Sam Charrington: [00:39:55] So you’re not trying to kind of featurize your underlying images into these distinct things because what I inferred, when you said font. Do you look at the, you know, is there an analogy to kind of looking at the layers of the network, and when we do this with CNN, GC, like textures and things like that, is there some analogy that you’ve seen in looking at the layers of the network that says, ‘Oh, this layer is like identifying fonts.’ Cha Zhang: [00:40:32] No, we haven’t been going there yet. Well, I guess it’s certainly interesting to look at it. My take is most likely, font is just one attribute. I believe there are many other things. Yeah, I think it’ll be interesting to look at these features visually. Yeah. Sam Charrington: [00:40:54] We’ve talked throughout the discussion about kind of the ways that OCR and this form recognition problem kind of blends the vision domain and NLP domain and language models has come up quite a bit. Is there a little bit more kind of depth we can go into there? Some of the ways that, that you see, NLP, and particularly the advances in NLP over the past few years kind of influencing the problem and the way you solve it? Cha Zhang: [00:41:32] Yeah. We set up, I see NLP plays a very important role in these verticals. After all, these invoice receipt, business card, these are all human artifacts. They’re kind of language artifacts in some way. Right? So, all of the kind of latest state of the art in language modeling, we definitely want to leverage The thing I mentioned earlier, like the layout or it’s a one way to leverage them by using the language model, but also embed additional visual information, and hopefully to solve these problems effectively because input is really different, right? You know, the priorities like you take texts, it’s input here. We’re taking a bunch of texts lines to the locations and bounding boxes as inputs, and the algorithm can naturally kind of solve these problems. Sam Charrington: [00:42:30] And,is it also trying to do the traditional language model predicting the next character or word or set of texts? Cha Zhang: [00:42:38] Yeah, the way we train them are very similar, basically, merge texts – you merge some words and try to predict. Certainly you can use a lot of others. I think, you know, like I know recently people use translation targets. You can use alpha virgin coder kind of targets. This is a really active research area at this point. I don’t think, I think we’re still just scratching the surface, although we already seeing very, very promising results. So we definitely want to look deeper into this and see how well this really can push the state of the art. Sam Charrington: [00:43:21] Kind of continuing on that thread of the active research areas and what the future holds in this area, what are you most excited about in this domain of OCR and in general, extracting text from documents, vertical applications and the like. Cha Zhang: [00:43:42] Yeah, I think, we have been working on this problem for quite a while, but I think there’s still a lot of interesting problems. Only when we start to work with customers, we realize, you know, there are problems we haven’t been able to solve. I can just name one, for example, like table extraction sounds trivial, but when you actually look at all the existing tables in the word, the simplest one are those with explicit cell borders where you have straight lines but in reality, these tables can have no cell boundaries at all. It can be mixed on top with STEM, you know, all these things that are kind of making the problem extremely hard. So that’s jus, another one that is extremely challenging, but we want to solve. Another thing that I sort of briefly mentioned about earlier was the customization part of these vertical. How do you customize to customer’s own data instead of having these pre-built ’cause inevitably, you will have data that doesn’t work with these premium models. How do you allow customer to have a way to build their own models to still work, and that by itself is a very challenging problem because asking customers to label a lot of data is painful. They don’t want to go there. So either we go unsupervised or we go with very, very limited in number of supervision data. In such a case, how do we adapt our model so that it can work on this document that customer realize that the premium model has failed. That’s also very interesting kind of research problem that we are looking into. I envision in a language as low shot learning. It’s also, now it’s definitely applicable to the problem here as well. Sam Charrington: [00:45:50] In the case of some of the product ties, vision offerings, Azure does this as well. The user is able to upload its own set of labeled data and kind of the results for object detection are kind of fine tuned against the user’s data set. Cha Zhang: [00:46:13] Yeah. Sam Charrington: [00:46:14] Do the OCR and form recognition offerings, are they providing something similar? Like, can you upload it? Can I upload my own invoices? You’re doing some kind of transfer learning or, well. If you are, what are you doing to take advantage of what the user’s providing? Cha Zhang: [00:46:33] So we do have a product called a custom form which allow customer to upload a few samples here. We usually say minimum of five samples. So, say you have an invoice that doesn’t work with existing models, and so you want to solve the problem when you upload five invoices with similar is fine. These are from the same vendor or kind of looks or similar in structure, and we can figure out these key value pairs and extract them, either unsupervised or supervised. Right? Unsupervised means, customer don’t need to label anything. So you upload the file documents. The information we’re gaining by looking at these five documents is, well, these documents are supposed to be similar and therefore, they’re going to be a bunch of words in this document that actually is common across these documents. This commonality help us to tell, well this is probably part of the empathy form or the template of the form, while the thing that’s varying across forms are like, these are must be information customer has filled in as kind of different from sample to sample. So with that information, we can actually extract key value pairs out of, without any supervision. All you need is upload five similar documents. Of course that works to a certain degree, but if you’re still not happy with accuracy, we provide a way for you to label your key valued pairs. So here is like we, we have a UX where you can go and label the fields you care by essentially highlight the OCR text lines where you think this is the value I want to extract. Then we actually learn a model out of five samples and produce a model that can be used by the customer to extract these values. The accuracy is actually normally pretty high, in the 90/95 percentage range, actually. Sam Charrington: [00:48:38] So when the customer does this, is this process entirely learned or is there a human in the loop kind of exception handling element to it? Cha Zhang: [00:48:50] I guess this is probably kind of take a step back. I think all the products, OCR process today, OCR has made a significant advance, but if you actually care about the numbers, think about the invoice. Right? If your total is wrong, it’s really that bad. So, what we recommend is definitely we recommend people to have agent backup. For all of the products we offer, we give people confidence, right? So how confident we are about the expression of a particular value, and a different customer can choose their own threshold and have an agent to look at them. But I think, today’s accuracy. we don’t recommend kind of strays through, unless you are handling certain specific applications. I can give you an example. For example, if you have a valid, if you’re verifying receipt image against a employee entered data, so there you can go automatic, right? ‘Cause if the OCR produce a different number than the employee, well, you will need somebody to look at them anyway, but if they actually merged them, well, that probably means it’s okay. Sam Charrington: [00:50:08] Right. Cha Zhang: [00:50:08] So the application, you can automate it more. Sam Charrington: [00:50:13] Got it. So, the question that I was asking is slightly different though, and you know, so say you’ve got someone using automated form recognition and they have their five examples that they haven’t been happy with, and they submit that in through some website, our API, is someone at Microsoft taking those, and going, taking them manually through some process to try to figure out why they’re not working or are they thrown into some training job and then the customer’s result gets better? Cha Zhang: [00:50:48] Okay. Now, no, we don’t look at the customer’s data. So this is a fully automated product, meaning, you know, customer basically label these files. They call a API to train a model. The whole process is automated. Sam Charrington: [00:51:04] So under the covers, are they kind of forking off their own model? The last few layers are getting cut off and it’s fine tuning, or is it more elaborate than that, or…? Cha Zhang: [00:51:17] It’s more elaborate than that. Underneath the hood, there are multiple steps. We leverage a lot of information in these sample documents. For example, as I mentioned earlier, there will be words common across these samples. Those are very strong indicators regarding this might be part of the empathy, part of the form where, you probably think these are not so interesting to the customer. Transfer learning is certainly one way of doing that. Right now we are actually train these models without transfer learning. So it’s actually, the model is training from scratch for very few customers we’re able to do this. We’re able to do this because some very interesting work that we have done tobasically augument this data to make sure that you have sufficient data to still be able to train a model out of five samples only. This can be a feedback loop as well. So, if customer’s not happy with a model trained by five samples, you can upload them more and we just train a new model for you. So every time you try and just get a newmodel, that way, it’s a feedback loop where customer can keep improving their model until it to a certain stage where it’s really performing for the customer. Sam Charrington: [00:52:53] So when you say augmenting the five that they’re providing, are we talking about data augmentation and the sense of a transformation pipeline that kind of changes, adds noise, rotates, that kind of thing? Or are we talking about, you’ve got some other data set that you’re adding to their five and training it on that aggregate data set, and that’s how you’re producing a better model? Cha Zhang: [00:53:21] Both. Although I think the latter one is more because actually, when customer label these data, they actually provide, we ask them to provide some additional information. For example, they label, this is a date. We know it’s a date. So in this way we can artificially create more data to fill the form so that we can produce more data to train the model. Also, we use a very robust machine learning algorithms that are robust to very few examples. So, that way we can learn with this limitation. Yeah. Normally, if you look at many of the other offerings that people provide. You have to train with hundreds of examples here. We’re pushing it really down to five and we hope to push it even lower in the future. Sam Charrington: [00:54:11] So I’m assuming that this is a stacked problem and you’ve got some low level OCR, for example, models that are trained with many, many documents. What you’re doing with this form recognizer custom data is more at the top end node of that stack. Is the off the shelf model that I’m using without the five example customization, is that also trained on relatively few examples? Cha Zhang: [00:54:44] What do you mean? Sam Charrington: [00:54:45] I guess what, I guess maybe I’ll jump ahead to the conclusion that I’m drawing on. What’s what’s confusing me is how are you getting better results with few examples if you’re not using any kind of transfer? I guess I heard in your explanation that you’re not doing any kind of transfer. Cha Zhang: [00:55:03] So right now the custom forms support training model and these models are usually… each model is geared towards one particular form type. So in some way you can think this problem is actually restricted. It’s actually a easier problem. It’s not like a pre-built invoice where essentially you want to handle all your invoices. Here we’re handling one particular invoice coming from, I would say one particular vendor. I say they usually use this template. Sam Charrington: [00:55:37] Got it. So the customer then, do they call a unique API to resolve invoices of this type? Or is that then ensembled, and then there’s something that decides whether it’s of the type that you’ve built the new model for? Cha Zhang: [00:55:55] Yeah. So here’s a kind of the recommendation that we give to customers, right? So you maybe start with the previous model, and the previous model may work and then your job is done. If you’re happy, go. Then you certainly say you have a lot of invoices and out of a thousand, 10 of them doesn’t work. So while we offer the customer as well, take these invoices and you can train specific models for these 10 different invoices, you might need to train more than one model as a special model because this invoice may look very different. So imagine you can train like 10 different customer models for this. We actually also offer kind of automatic invoice classification. So a API called a model compose where we can compose these 10 small models into one. So, all you need is just calling to that one. By calling into that one, we also provide you a confidence to say, well, because during testing, the customer send the invoicing. We don’t really know whether it’s one that doesn’t work with this pre-built one or whether it’s part of this. It works well with the previous. So you send this invoice first to the customized version of the model, and we will tell you, ‘Hey, it doesn’t look like any of the 10 you have trained.’ So in this case, you will revert back say, okay, now I’m calling the previous invoice ’cause you sort of know that pre-build actually works well for that. So that’s what we recommend customers to do. Sam Charrington: [00:57:34] Okay. I dug into a little bit of the detail there, but it’s interesting to see kind of how the end-end problem is put together. In a case like this, the ends of that problem are on the customer side, not just the service that you’re offering, and so seeing how the pieces are put together is kind of interesting. Awesome! Well, Cha, thanks so much for taking the time and walking us through some of the interesting things that are happening in these domains. Cha Zhang: [00:58:12] Thank you for having me. Sam Charrington: [00:58:14] Great! Thank you.
In this month's community segment we chatted about explainability, Carlos Guestrin’s LIME paper, Europe’s attempt to ban “untrustworthy” AI systems and finally, Community member Nicolas Teague shares a blog post he wrote entitled "A Sight for Obscured Eye, Adversary, Optics, and Illusions,” which explores the parallels between computer vision adversarial examples & human vision optical illusions. In our presentation segment, Philosopie Group Inc. Director of AI, Chris Butler, joins us to discuss Trust in AI. Chris gives us an overview of a number of papers on the topic, including: Humans and Automation: Use, Misuse, Disuse, Abuse Trust in Automation: Integrating Empirical Evidence on Factors That Influence Trust Some Observations on Mental Models Overtrust of Robots in Emergency Evacuation Scenarios For links to the papers mentioned above and more information on this and previous meetups, or to get registered for upcoming meetups, visit [twimlai.com/meetup]!(/meetup) SUBSCRIBE AND TURN ON NOTIFICATIONS
On the heels of last week’s $200 million acquisition by Apple of Turi, Intel announced on Tuesday yet another acquisition in the machine learning and AI space, this time with the $400 million acquisition of deep learning cloud startup Nervana Systems. This is another exciting acquisition; let’s take a minute to unpack it. First of all, for those not familiar with the company, Nervana, spelled N-E-R-vana, is a two year old company developing software, hardware and cloud services for deep learning. The company was originally founded to build hardware for speeding up deep learning, and it’s this focus that made it so attractive to Intel. The company’s first hardware product, due next year, is a custom deep learning chip called the Nervana Engine. The ASIC chip is similar in focus to the Google Tensor Processing Unit or TPU which we highlighted in the very first episode of This Week in Machine Learning & AI back in May. The company has also released a software product called Neon, and operates the Nervana Cloud. Neon is an open source deep learning framework like TensorFlow, Caffe or Theano. Relative to those others, which you hear about here on the show pretty much every week, Neon is known for being particularly fast, especially on NVIDIA GPUs. This is due to some clever optimization work the team did with the GPU firmware. Neon doesn’t have quite the popularity of some of these other frameworks, in part because it was initially a proprietary product, only recently open sourced back in May. The company’s cloud offering is tuned for running deep learning, and will eventually incorporate the company’s own chips. This is a great deal for the company’s founders and investors. With $24.4 million in funding to date, and a price reported to be as high as $408 million, Nervana returned nearly 17x to investors, which is home run territory for most VCs. At the same time, if you’ll allow me to Monday Morning Quarterback, I’m a little surprised that they’ve decided to sell so early in the game. The company is extremely well positioned in really two hot spaces, deep learning and cloud, and the team has only been at it for a couple of years. Projecting out a couple of years, it’s easy to see Nervana with a billion dollar valuation, assuming they continued to execute. This makes me wonder what the team saw in the market that said that now was the time to sell. Of course, it’s certainly the case that Intel brings a lot more to the table here than cash. The company obviously has vast resources and expertise in the chip-making arena and they could certainly help accelerate Nervana’s plans. It’s also the case though that the company faces stiff and growing competition. Google for example, offers everything Nervana does. Google’s TensorFlow, released about 8 months ago, is by most measures the most popular deep learning framework. (You’ll recall we discussed Francois Chollet’s analysis of the landscape back on the July 15 show.) Google also sees TensorFlow as becoming an on-ramp to the Google Compute Platform. And GCP has TPUs, which I just mentioned and which the company announced back in May. So perhaps the Nervana team and investors looked at the long slog ahead and decided to take the money off the table. I do wonder if the lack of an upside in terms of options makes hiring top talent more difficult for the company. So that’s the Nervana side of things, what about Intel’s side? Well, while this is a pretty small acquisition for Intel, I think it’s a smart move on their part. That’s because, despite numerous investments in the space, as recently as their investment in Nervana competitor CognitiveScale last week, Intel has been struggling to tell a story around machine and deep learning. The problem they’re facing is that NVIDIA is eating their lunch when it comes to chips for deep learning applications. In fact, NVIDIA also made news this week when they announced record revenues and a more aggressive sales outlook. The reason for the improved outlook? Quoting CEO Jen-Hsun Huang: “One particular dynamic sticks out, and it’s a very significant growth driver of where we have an extraordinary position in and it’s deep learning,” Huang told analysts in a conference call that lasted almost 80 minutes. “The last five years, we’ve quietly invested in deep learning because we believe that the future of deep learning is so impactful to the entire software industry, the entire computer industry that we, if you will, pushed it all in.” NVIDIA’s lead in deep learning has been a sore spot for Intel of late, to the point that several articles commented on interviews with company data center chief Diane Bryant where she became ruffled at the mention of Intel’s lack of presence in the machine learning market. Now, Intel and Diane are quick to shrug this off, since machine learning is a relatively nascent market. According to the MIT Technology Review, market research firm Tractica pegs the market for AI-related chips at under 1 billion, growing to 2.4 billion in 2024, a small figure compared to Intel’s 2015 revenue of $56 billion. But Intel missed the boat on mobile and PC chip sales are declining, and there’s weakness in data center and IoT revenue growth as well. So while machine learning and AI are an emerging market just at the beginning of the growth cycle, Intel can’t afford to sit this one out. This deal gives them a much needed story around deep learning and if the companies are able to execute, a foot in the door of this nascent market. Moving forward, this poses some of the same challenges I mentioned in the context of Apple/Turi, namely executive focus, but I also think this plays to several of Intel’s strengths. In particular, while I’ve seen the company struggle trying to independently build and sell enterprise software, the company does a good job of building and selling through reference architectures. If Nervana ultimately becomes a reference for how to build out a deep learning cloud using new and traditional Intel hardware combined with open source software, this could drive significant future adoption for them and begin to turn the tide. There are also a good number of possible tie-ins to take advantage of here. One is with Intel’s open source project, the Trusted Analytics Platform. Also, Intel has a significant stake in big data company Cloudera and cloud builder Mirantis. This is getting a bit ahead of ourselves, sure, but there could be some pretty interesting collaborations between these projects and companies over time. Subscribe: iTunes / Youtube / Spotify / RSS
Autonomous driving startup Comma.ai released a small dataset that lets you try your hand at building your own models for controlling a self-driving vehicle. The dataset consists 10 video clips recorded at 20 Hz from a camera mounted on the windshield of a 2016 Acura ILX. There are about 7 hours of video total, captured mostly during highway driving. Alongside the video files are a set of sensor logs where measurements such as velocity, acceleration, steering angle, GPS location and gyroscope angles are recorded. The dataset is a 45 GB compressed zip file that explodes to 80 GB when compressed. That is, if you can get it to uncompress. When I tried it, after a fairly long download, unzip complained about the file being corrupt when I tried to unzip it. The project’s github repo includes a script to download the data from archive.org as well as some simple models built in Keras and TensorFlow for predicting steering angle and creating simulated road images using generative AI. They’ve also included a paper on the latter topic. The idea is that since it’s pretty expensive to train a self-driving car on real roads, you typically want to train your algorithms in a simulator. To do that, you can either hand code a simulator or use a generative AI to create one. The paper describes the use of variational autoencoders and generative adversarial networks and an RNN to create simulated road images. You can start by running their existing models, but if you manage to do amazing things with the data, let Comma know—they’re hiring and want to meet you. Subscribe: iTunes / Youtube / Spotify / RSS
This week we discuss Intel’s latest deep learning acquisition, AI in the Olympics, image completion with deep learning in TensorFlow, and how you can win a free ticket to the O’Reilly AI Conference in New York City, plus a bunch more. Here are the notes for this week’s podcast: O’Reilly AI Conference Giveaway I’m excited to be partnered with the O’Reilly Artificial Intelligence Conference, to give away a free ticket to the event, which will be held September 26 – 27, 2016 in New York City. There are three ways to enter the giveaway: 1. (Preferred) Follow @twimlai on Twitter and retweet this tweet: Win a FREE ticket to the @OReillyAI Conference. To enter, follow @twimlai + RT. https://t.co/ReYqwqp538 for details. pic.twitter.com/9pLrzHIX9d — TWIML (@twimlai) August 15, 2016 2. Sign up for the TWIML&AI Newsletter and add a note “please enter me” in the comments field. 3. Use this site’s contact form to send me a message and use “AI contest” as the subject. A winner will be chosen at random and announced on the 9/2 podcast. Ticket is non-transferrable. Good luck, and hope to see you in New York! If you’d like to buy a ticket, register using the code PCTWIML for 20% off! And don’t forget to get your free early access ebook: Mastering Feature Engineering Intel Buys Deep Learning Startup Nervana Intel Buys a Startup to Catch Up in Deep Learning Deep Learning Chip Upstart Takes GPUs to Task Nvidia’s bet on deep learning and autonomous cars drives stock to record highs – MarketWatch AI Bot Joins Team Washington Post at the Rio Olympics The Washington Post experiments with automated storytelling to help power 2016 Rio Olympics coverage – The Washington Post Technology Fujitsu Software to Accelerate Deep Learning Workloads DetectNet: Deep Neural Network for Object Detection in DIGITS | Parallel Forall Google Research Blog: Meet Parsey’s Cousins: Syntax for 40 languages, plus new SyntaxNet capabilities Image Completion with Deep Learning Image Completion with Deep Learning in TensorFlow bamos/dcgan-completion.tensorflow: Image Completion with Deep Learning in TensorFlow [1607.07539] Semantic Image Inpainting with Perceptual and Contextual Losses [1511.06434] Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
I recently reported on the launch of the new NVIDIA TITAN X. At the time it wasn’t in the hands of any users so any thoughts on relative performance were either vendor provided or speculative. Well, a couple of researchers on the MXNet team were among the lucky folks that have their hands on the GPU at this point and they published an initial benchmark this week following the deepmark deep learning benchmarking protocol. In a nutshell they confirmed the speculation. The Pascal Titan X is about 30% faster than the GTX 1080 and its larger memory supports larger batch sizes for models like VGG and ResNet. Relative to the older Maxwell-based Titan X, the new GPU is 40-60% faster. If a single GPU isn’t enough for you, you might be interested in the new prototype announced by Orange Silicon Valley and CocoLink Corp, which they’re calling the “world’s highest density Deep Learning Supercomputer in a box.” The machine loads 20 overclocked GPUs into a single 4U rack unit offering 57,600 cores delivering 100 TeraFLOPS. The team at Orange report that an ImageNet training job that used to take one and a half days with a single NVIDIA K40 GPU can now be done in 3.5 hours using 8 GTX 1080s. The largest they’ve been able to scale a training job to is 16 GPUs, and they’re continuing to work on scaling this to the full 20 GPUs. Also in GPU news, Microsoft announced yesterday that Azure N-Series virtual machines are now available in preview. These VMs use Tesla K80 GPUs and the company claims these offer the fastest computational GPU performance in the public cloud. Moreover, unlike other cloud providers, these VMs expose the GPUs through via Discrete Device Assignment (DDA), resulting in near bare-metal performance. 6, 12 and 24 core flavors are available in the NC series of VM, which is optimized for computational workloads. An NV series that focuses more on visualization is also available, based on the Tesla M60 GPUs. Subscribe: iTunes / Youtube / Spotify / RSS
Each year, computer security conferences host a high tech version of the kids game “capture the flag,” so that teams of hackers and security researchers can demonstrate their hacking prowess. The game requires teams to secure a computer system by identifying intentional and unintentional vulnerabilities in various software modules while launching and defending against threats from competitive teams. This week, DARPA, the Defense Advanced Research Projects Agency, hosted a version of a capture the flag contest where the teams were autonomous bots. The event, held Thursday in Las Vegas as part of the Defcon security contest, was the final competition of the agency’s Cyber Grand Challenge, a $55 million hacking contest designed to spur innovation in the area of autonomous cyber warfare. Seven teams of researchers from across the country fielded bot systems that competed with one another to autonomously identify and patch software vulnerabilities that were planted in their systems by DARPA, while deflecting attacks from competing bots and launching their own attacks against the computer systems those bots were protecting. Team’s bots are scored on their ability to secure their own software and services, ensure their continued availability and take advantage of vulnerabilities in competing team’s systems. From the looks of it, DARPA constructed a pretty elaborate physical environment for the contest, complete with an “air gap” to ensure that each system was acting totally on its own. Announcers followed along with the 96 rounds of action and provided a live play-by-play for onlookers, while referees ensured that each team played by the rules. With each round, DARPA deployed a new set of software for the bots to both defend and attack. I watched segments of the 4+ hour video from the final competition and found it pretty fascinating, but I failed in my brief attempt to find any details on how the bot various bot systems work. Cade Metz’ coverage of the competition for Wired painted an interesting picture of the different strategies each bot pursued in the contest. One bot, Rubeus, built by federal contractor Raytheon, took an aggressive tack, going after vulnerabilities in the other systems from the get go. Yet another bot, Mech.Phish didn’t perform as well overall, but it did have a knack for finding and exploiting complex and subtle bugs in the challenge code. Mayhem, a bot fielded by a team from Carnegie Mellon spin-out ForAllSecure, and the eventual winner of the $2M first prize, seemed rather focused on patching its own systems and keeping them up and running. The bot reportedly used statistical analyses throughout the game to weigh the costs and benefits of patching vulnerabilities (which has inherent risks and demands service downtime), and would only decide to patch those holes that made sense based on this analysis. Cybersecurity is an important and rapidly evolving use case for ML & AI, and there’s been quite a bit of commercial activity in the area in addition to innovation and research activities like the CGC. This week startup Distil Networks closed a $21 million series C funding round to help enterprise customers separate good bots from bad ones, and keep the latter off of their web sites. Note that we’re not talking about chatbots here, but rather the kind of web bots that abuse APIs, scrape web sites, and probe them for vulnerabilities. The company uses machine learning techniques to detect when a bot is trying to cloak its activity by spoofing multiple user accounts, browsers, and locations. And last month, another cyber security startup, Darktrace Ltd. raised a $64 million series C to help enterprises identify and defend against a variety of networked threats. Subscribe: iTunes / Youtube / Spotify / RSS
News broke late last week of Apple’s acquisition of Seattle-based machine learning startup Turi, for a reported $200 million. Actually, I haven’t seen any definitive confirmation of the acquisition at the time of my initial research, but neither have there been any denials. You’ll recall we spoke about Turi just a few weeks ago, in the context of the Data Science Summit the company hosted in San Francisco, shortly after changing its name from Dato due to a legal dispute. The company, which was originally called GraphLab, was one of the first companies I started following in the machine learning platform space, and I’m pretty excited for founders Carlos Guestrin and Danny Bickson. At face value this a great deal for both companies. As we’ve discussed, Apple needs all the help it can get in machine learning and AI, and the company has over $230 billion-with-a-B sitting around in cash, so they can definitely afford it. And from Dato’s perspective, the purchase price is about 4x invested capital, so it’s a solid exit for a team of first time founders from academia in a space in which many of their contemporaries have struggled. But the question remains as to what happens next. This acquisition doesn’t really make sense if Turi is to remain an independent company—Apple needs the help internally fighting the “AI culture war,” and the company hasn’t had much success as an enterprise software player. On the other hand, in Turi CEO Carlos Guestrin, Apple could have a great ML standard bearer. Carlos is not only a business leader and a respected machine learning researcher but also a great teacher, with a popular machine learning course series on Coursera. So it’s likely that, as Techcrunch suggests, Turi discontinues offering its existing products and is reborn as a Apple’s new machine learning and AI development center. As a result, in addition to the Apple and the Turi team, winners in this deal include Seattle, which has been gaining a bit of notoriety as a cloud computing and machine learning hotspot and will also see a new influx of wealth as a result of this deal. Also Turi competitors in the machine learning platform space, folks like H2O, upstart DataRobot, and the French firm Dataiku have one less competitor to worry about and a solid exit to point to as a comparable. Dataiku, for its part, announced an update to its product, Dataiku Data Science Studio (DSS) 3.1 earlier in the week. The update adds a new support for HPE Vertica, H2O Sparkling Water, Spark MLlib, Scikit-Learn and XGBoost from within the DSS visual analysis tool, as well as integration with IBM Netezza, SAP Hana and Google Big Query on the backend. It will be interesting to see how this one plays out and I’ll keep you posted. Subscribe: iTunes / Youtube / Spotify / RSS
Last week, at a Machine Learning meetup at Stanford University, NVIDIA CEO Jen-Hsun Huang unveiled the company’s new flagship GPU, the NVIDIA TITAN X, and gifted the first device off of the assembly line to famed ML Researcher Andrew Ng. The new TITAN X, which holds the same name as the previous version of the device, is based on the company’s new Pascal graphics architecture, which was unveiled back in May. Last night, at a Machine Learning meetup at Stanford University, NVIDIA CEO Jen-Hsun Huang unveiled the company’s new flagship GPU, the NVIDIA TITAN X and gifted the first device off of the assembly line to famed ML Researcher Andrew Ng. The new TITAN X, which holds the same name as the previous version of the device, is based on the company’s new Pascal graphics architecture, which was unveiled back in May. The company is so excited about the card, it’s blog post introducing the card threw around a ton of superlatives and adjectives like Biggest, Ultimate, Irresponsible, Crazy, and Reckless. It also threw a bunch of numbers around, including these: 11 Trillion Floating point ops/sec 32-bit floating point 44 Trillion INT8 ops per second 12B transistors 3,584 CUDA cores running at 1.53 GHz 12 GB of GDDR5X memory with a 480 GB/s bandwidth) The other number it tossed out there was 1,200, which is the price of the card in US dollars. Now, not everyone is excited about this card as NVIDIA. Indeed, for gamers, what NVIDIA’s offering with the TITAN X is a GPU that’s about 25% faster than the company’s standby offering the GTX1080 but at double the cost. But it could be that that’s because the company is targeting deep learning researchers instead of gamers for the TITAN X. (In fact, CEO Jen-Hsun said as much at the product launch.) For people working on deep learning, the specs of the TITAN X should allow it to increase model training performance by 30-60%, which can save a researcher weeks of time and computing costs. The best technical preview I’ve found of the new card, which comes out on August 2nd, is over on AnandTech. Of course I’ll be dropping a link to this article and all the other ones I mention on the show into the show notes, available at twimlai.com.
This week’s show covers the White House’s AI Now workshop, tuning your AI BS meter, research on predatory robots, an AI that writes Python code, plus acquisitions, financing, technology updates and a bunch more. The Big Picture Home :: AI Now Jason Furman’s speech I need an AI BS-Meter — Gab41 Smerity.com: It’s ML, not magic: simple questions you should ask to help reduce AI hype You Can Now Drink Beer Brewed By Artificial Intelligence – Forbes On the importance of democratizing Artificial Intelligence Business Google buys machine learning startup Moodstocks to help your phone’s camera identify objects | VentureBeat | Business | by Chris O’Brien News discovery app SmartNews nabs another $38M, now valued at $500M-$600M | TechCrunch General Catalyst’s Phil Libin invests in 2 more chatbot startups, Growbot and Butter.ai | VentureBeat | Bots | by Ken Yeung Exclusive: Why Microsoft is betting its future on AI | The Verge Research Google’s DeepMind AI to use 1 million NHS eye scans to spot diseases earlier | Ars Technica Artificial Intelligence May Aid in Alzheimer’s Diagnosis – Neuroscience News Application of Machine Learning to Arterial Spin Labeling in Mild Cognitive Impairment and Alzheimer Disease Steering a Predator Robot using a Mixed Frame/Event-Driven Convolutional Neural Network Super-intelligent predator robot is taught to hunt down prey in chilling experiment | Daily Mail Online Technology Release of IPython 5.0 Skype chatbots now work in group chats | VentureBeat | Bots | by Khari Johnson Microsoft’s Project Malmo AI platform goes open source | ZDNet Projects Teaching an AI to write Python code with Python code Deep Learning for Chatbots, Part 2 – Implementing a Retrieval-Based Model in Tensorflow – WildML Specials Data Science Summit – JULY 12-13 in SAN FRANCISCO / Use code TWIML20 for 20% off registration FREE O’Reilly Early Access Ebook: Mastering Feature Engineering
This week’s show covers the first fatal Tesla autopilot crash, a new EU law that could prohibit machine learning, the AI that shot down a human fighter pilot, the 2016 CVPR conference, 10 hot AI startups, the business implications of machine learning, cool chatbot projects and, if you can believe it, even more. Here are the notes for this week’s podcast: Tesla Autopilot Crash A Tragic Loss | Tesla Motors Ex-Navy SEAL becomes first to die in self-driving car after Tesla crash | Daily Mail Online Tesla’s ‘Autopilot’ Flew Under Regulators’ Oversight – WSJ The technology behind the Tesla crash, explained – The Washington Post EU Legislation Impacts Machine Learning Use EU regulations on algorithmic decision-making and a “right to explanation” Artificial Intelligence Has a ‘Sea of Dudes’ Problem – Bloomberg Why We Should Expect Algorithms to Be Biased To study possibly racist algorithms, professors have to sue the US | Ars Technica Business The Most Well-Funded Startups Developing Core Artificial Intelligence Tech Doodle acquires chatbot Meekan to integrate its A.I. scheduling assistant | VentureBeat | Bots | by Chris O’Brien Meet Articoolo, the robot writer with content for brains | TechCrunch The Business Implications of Machine Learning — Medium How Amazon Triggered a Robot Arms Race – Bloomberg IEEE Computer Vision & Pattern Recognition Conference CVPR 2016 CVPR 2016 Open Access Repository Zeeshan Zia’s answer to What are the most interesting CVPR 2016 papers and why? – Quora All Your Questions Answered — CVPR Day 1 — Gab41 Jordi Pont-Tuset’s site – CVPR 2016: Deep learning takes over again? AI Fighter Pilot Beats Human Expert AI bests Air Force combat tactics experts in simulated dogfights | Ars Technica Genetic Fuzzy based Artificial Intelligence for Unmanned Combat Aerial Vehicle Control in Simulated Air Combat Missions Projects & Hands-On IBM Watson A.I. XPRIZE Changelog – Messenger Platform A Natural Language User Interface is just a User Interface — The Startup — Medium Build a Chatbot w/ an API – ML for Hackers #9 – YouTube Is that a Time Machine? Some Design Patterns for Real World Machine L… Data Science Summit Data Science Summit Use code TWIML20 for a 20% discount on registration! Image: Tesla Motors
This week’s show covers the International Conference on Machine Learning (ICML 2016), “dueling architectures” for reinforcement learning, AI safety goals for robots, plus top AI business deals, tech announcement, projects and more. ICML 2016 –Accepted Papers | ICML New York City – Which companies had accepted papers at #icml2016 ? Best Paper Awards – [1511.06581] Dueling Network Architectures for Deep Reinforcement Learning – [1601.06759] Pixel Recurrent Neural Networks – [1602.07415] Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling – My winner in the best name category: Extended and Unscented Kitchen Sinks – Demystifying Deep Reinforcement Learning Research Google Research Blog: Bringing Precision to the AI Safety Discussion OpenAI Blog: Concrete AI safety problems Paper: 1606.06565.pdf OpenAI technical goals Artificial intelligence achieves near-human performance in diagnosing breast cancer — ScienceDaily Paper: 1606.05718.pdf Business Twitter pays up to $150M for Magic Pony Technology, which uses neural networks to improve images | TechCrunch Increasing our Investment in Machine Learning | Twitter Blogs Artificial Intelligence Explodes: New Deal Activity Record For AI DARPA is looking to make huge strides in machine learning | PCWorld Data-Driven Discovery of Models (D3M) – Federal Business Opportunities: Opportunities AI Culture Wars in Silicon Valley How Siri Started — and Lost — the Assistant Race How Google is Remaking Itself as a “Machine Learning First” Company — Backchannel AI, Apple and Google Technology Lighting the way to deep machine learning | Engineering Blog | Facebook Code Intel Launches ‘Knights Landing’ Phi Family for HPC, Machine Learning The Toronto Raptors Are Using IBM’s Watson to Draft A Winning Team | Motherboard Projects Hello, TensorFlow! How to read: Character level deep learning GitXiv: Collaborative Open Computer Science Machine Learning Yearning Mastering Feature Engineering – O’Reilly Media Bonus I didn’t have time to cover: The Stanford Question Answering Dataset
This week’s podcast looks at new research on intrinsic motivation for AI systems, a kill-switch for intelligent agents, “knu” chips for machine learning, a screenplay made by a neural net, and more. Here are the notes for this week’s show: Intrinsically Motivated AI Playing Montezuma’s Revenge with Intrinsic Motivation Unifying Count-Based Exploration and Intrinsic Motivation Intrinsically Motivated Machines Implementation of DEvelopmentAl Learning Safely Interruptible Agents What if robots decide they want to take control? New paper: “Safely interruptible agents” Safely Interruptible Agents Open Source Project Updates TensorFlow 0.9 Apache Spark 2.0 Preview: Machine Learning Model Persistence A “Knu” Chip for Machine Learning Former NASA Exec Brings Stealth Machine Learning Chip to Light CrowdFlower’s AI Push Solving Million (not Billion) Dollar Business Problems with AI Vi: An AI Personal Trainer Meet Vi Recurrent Neural Net Writes Sci-Fi Movie Movie Written by Algorithm Turns out to be Hilarious and Intense Adventures in Narrated Reality, Part II Understanding LSTMs The Unreasonable Effectiveness of Recurrent Neural Networks Teaching Robots to Feel Teaching Robots to Feel: Emoji & Deep Learning ML for Hackers: Build a Chatbot ML for Hackers: Build a Chatbot Siraj Raval on Twitter Image Credit: LifeBEAM
This week’s show looks at Facebooks’ new DeepText engine, creating art with deep learning and Google Magenta, how to build artificial assistants and bots, and applying economics to machine learning models. Here are the notes for this week’s show: DeepText: Facebook’s Text Understanding Engine Introducting DeepText: Facebook’s Text Understanding Engine FBLearner Flow Research: Text Understanding from Scratch Natural Language Processing (almost) from Scratch Machine Learning and Art Google Magenta Neural Art A Neural Algorithm of Artistic Style Neural Art in TensorFlow Autoencoding Blade Runner Courses: NYU’s Machine Learning for Artists Goldsmith’s University of London The Latest TensorFlow Paper TensorFlow: A system for large-scale machine learning Business of ML & AI Microsoft Confirms Microsoft Ventures VC Arm Intel Acquires Computer Vision for IOT, Automotive Lumiata Closes $10 Million Series B Financing with Intel Capital Findo raises $3M to help you find files and documents through natural language queries More Bots, and How to Build Artificial Assistants Motion AI lets anyone easily build a bot Sequel lets you create a ‘Me’ bot, beats Google to the punch Hybrid Intelligence: How Artificial Assistants Work The Economics of Machine Learning models The preoccupation with test error in applied machine learning Towards Cost-Optimized Artificial Intelligence More Cool Deep Learning posts Deep Reinforcement Learning: Pong from Pixels A Survey of Deep Learning Techniques Applied to Trading Just for Fun Building an IoT Magic Mirror Magic Mirror on GitHub Image Credit: Microsoft