We could not locate the page you were looking for.

Below we have generated a list of search results based on the page you were trying to reach.

404 Error
I am the Henry Salvatori Professor of Computer and Cognitive Science at the University of Pennsylvania computer science department. I also hold a secondary appointment at the Department of Statistics and Data Science at the Wharton School, and I am associated with the theory group, PRiML (Penn Research in Machine Learning) the Warren Center for Network and Data Sciences, and am co-director of our program in Networked and Social Systems Engineering. I am also affiliated with the AMCS program (Applied Mathematics and Computational Science). I spent a year as a postdoc at Microsoft Research New England. Before that, I received my PhD from Carnegie Mellon University, where I was fortunate to have been advised by Avrim Blum. My main interests are in algorithms and machine learning, and specifically in the areas of private data analysis, fairness in machine learning, game theory and mechanism design, and learning theory. I am the recipient of a Presidential Early Career Award for Scientists and Engineers (PECASE), an Alfred P. Sloan Research Fellowship, an NSF CAREER award, a Google Faculty Research Award, an Amazon Research Award, and a Yahoo Academic Career Enhancement award. I am also an Amazon Scholar at Amazon Web Services (AWS). Previously, I was involved in advisory and consulting work related to differential privacy, algorithmic fairness, and machine learning, including with Apple and Facebook. I was also a scientific advisor for Leapyear and Spectrum Labs. For more information, see my CV and Research Statement. My lovely wife Cathy just got her PhD in math at MIT. At her insistence, I link to her website
Sergey Levine received a BS and MS in Computer Science from Stanford University in 2009, and a Ph.D. in Computer Science from Stanford University in 2014. He joined the faculty of the Department of Electrical Engineering and Computer Sciences at UC Berkeley in fall 2016. His work focuses on machine learning for decision making and control, with an emphasis on deep learning and reinforcement learning algorithms. Applications of his work include autonomous robots and vehicles, as well as applications in other decision-making domains. His research includes developing algorithms for end-to-end training of deep neural network policies that combine perception and control, scalable algorithms for inverse reinforcement learning, deep reinforcement learning algorithms, and more.
There are few things I love more than cuddling up with an exciting new book. There are always more things I want to learn than time I have in the day, and I think books are such a fun, long-form way of engaging (one where I won’t be tempted to check Twitter partway through). This book roundup is a selection from the last few years of TWIML guests, counting only the ones related to ML/AI published in the past 10 years. We hope that some of their insights are useful to you! If you liked their book or want to hear more about them before taking the leap into longform writing, check out the accompanying podcast episode (linked on the guest’s name). (Note: These links are affiliate links, which means that ordering through them helps support our show!) Adversarial ML Generative Adversarial Learning: Architectures and Applications (2022), Jürgen Schmidhuber AI Ethics Sex, Race, and Robots: How to Be Human in the Age of AI (2019), Ayanna Howard Ethics and Data Science (2018), Hilary Mason AI Sci-Fi AI 2041: Ten Visions for Our Future (2021), Kai-Fu Lee AI Analysis AI Superpowers: China, Silicon Valley, And The New World Order (2018), Kai-Fu Lee Rebooting AI: Building Artificial Intelligence We Can Trust (2019), Gary Marcus Artificial Unintelligence: How Computers Misunderstand the World (The MIT Press) (2019), Meredith Broussard Complexity: A Guided Tour (2011), Melanie Mitchell Artificial Intelligence: A Guide for Thinking Humans (2019), Melanie Mitchell Career Insights My Journey into AI (2018), Kai-Fu Lee Build a Career in Data Science (2020), Jacqueline Nolis Computational Neuroscience The Computational Brain (2016), Terrence Sejnowski Computer Vision Large-Scale Visual Geo-Localization (Advances in Computer Vision and Pattern Recognition) (2016), Amir Zamir Image Understanding using Sparse Representations (2014), Pavan Turaga Visual Attributes (Advances in Computer Vision and Pattern Recognition) (2017), Devi Parikh Crowdsourcing in Computer Vision (Foundations and Trends(r) in Computer Graphics and Vision) (2016), Adriana Kovashka Riemannian Computing in Computer Vision (2015), Pavan Turaga Databases Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases (2021), Xin Luna Dong Big Data Integration (Synthesis Lectures on Data Management) (2015), Xin Luna Dong Deep Learning The Deep Learning Revolution (2016), Terrence Sejnowski Dive into Deep Learning (2021), Zachary Lipton Introduction to Machine Learning A Course in Machine Learning (2020), Hal Daume III Approaching (Almost) Any Machine Learning Problem (2020), Abhishek Thakur Building Machine Learning Powered Applications: Going from Idea to Product (2020), Emmanuel Ameisen ML Organization Data Driven (2015), Hilary Mason The AI Organization: Learn from Real Companies and Microsoft’s Journey How to Redefine Your Organization with AI (2019), David Carmona MLOps Effective Data Science Infrastructure: How to make data scientists productive (2022), Ville Tuulos Model Specifics An Introduction to Variational Autoencoders (Foundations and Trends(r) in Machine Learning) (2019), Max Welling NLP Linguistic Fundamentals for Natural Language Processing II: 100 Essentials from Semantics and Pragmatics (2013), Emily M. Bender Robotics What to Expect When You’re Expecting Robots (2021), Julie Shah The New Breed: What Our History with Animals Reveals about Our Future with Robots (2021), Kate Darling Software How To Kernel-based Approximation Methods Using Matlab (2015), Michael McCourt
Sam Charrington: Hey, what’s up everyone! We are just a week away from kicking off TWIMLfest, and I’m super excited to share a rundown of what we’ve got in store for week 1. On deck are the Codenames Bot Competition kickoff, an Accessibility and Computer Vision panel, the first of our Wellness Wednesdays sessions featuring meditation and yoga, as well as the first block of our Unconference Sessions proposed and delivered by folks like you. The leaderboard currently includes sessions on Sampling vs Profiling for Data Logging, Deep Learning for Time Series in Industry, and Machine Learning for Sustainable Agriculture. You can check out and vote on the current proposals or submit your own by visiting https://twimlai.com/twimlfest/vote/. And of course, we’ll have a couple of amazing keynote interviews that we’ll be unveiling shortly! As if great content isn’t reason enough to get registered for TWIMLcon, by popular demand we are extending our TWIMLfest SWAG BAG giveaway by just a few more days! Everyone who registers for TWIMLfest between now and Wednesday October 7th, will be automatically entered into a drawing for one of five TWIMLfest SWAG BAGs, including a mug, t-shirt, and stickers. Registration and all the action takes place at twimlfest.com, so if you have not registered yet, be sure to jump over and do it now! We’ll wait here for you. Before we jump into the interview, I’d like to take a moment to thank Microsoft for their support for the show, and their sponsorship of this series of episodes highlighting just a few of the fundamental innovations behind Azure Cognitive Services. Cognitive Services is a portfolio of domain-specific capabilities that brings AI within the reach of every developer—without requiring machine-learning expertise. All it takes is an API call to embed the ability to see, hear, speak, search, understand, and accelerate decision-making into your apps. Visit aka.ms/cognitive to learn how customers like Volkswagen, Uber, and the BBC have used Azure Cognitive Services to embed services like real-time translation, facial recognition, and natural language understanding to create robust and intelligent user experiences in their apps. While you’re there, you can take advantage of the $200 credit to start building your own intelligent applications when you open an Azure Free Account. That link again is aka.ms/cognitive. And now, on to the show! Sam Charrington: [00:03:14] All right, everyone. I am here with Cha Zhang. Cha is a partner Engineering Manager with Microsoft Cloud and AI. Cha, welcome to the TWIML AI podcast. Cha Zhang: [00:03:25] Thank you, Sam. Nice to meet you. Sam Charrington: [00:03:27] Great to meet you as well. Before we dive in, I’d love to learn a little bit about your background. Tell us how you came to work in computer vision. Cha Zhang: [00:03:38] Sure. Sure. I actually have been at Microsoft for 16 years. I joined Microsoft originally as a researcher at Microsoft Research. I was there for 12 years. My research was primarily applying machine learning to image, audio, video; all of these different applications. I started 2016. I joined the product side, and currently I’m working as an Engineering Manager, and my primary focus is on document understanding. Sam Charrington: [00:04:11] Awesome. Awesome. So, we will be focusing quite a bit on OCR and some of your work in that space, and, you know, I think people often think of OCR as a, you know, a solve problem, right? It’s, you know, we’ve been scanning documents and extracting texts out of those documents for a long time. Obviously the advent of deep learning, you know, changes things, but I’d love to get the conversation started by having you share a little bit about, you know, what’s new and interesting in the space. How has it changed over the past few years? Cha Zhang: [00:04:50] Sure. Actually, it wasn’t very long ago, when people talk about OCR, what comes out of mind was firstly scan documents. In many people’s eyes, OCR for scan documents is sort of a solve the problem. More likely, I think there’s two major development. One is with a mobile first kind of word where everybody now have mobile phones and they take pictures everywhere. So there’s a lot of demand to do a text recognition out of images in the wild, and that certainly is a much more challenging problem than scan documents, and then technically, because of the advances in deep learning, we have realized that with deep learning, we can do OCR at a different level. We can make it a lot more accurate than before, and we can solve OCR problem in kind of imaging the wild scenario. So I think it started at 2000, early 2010 ish. I think there’s a lot of big advent advances in this area, and now we’re seeing basically OCR becomes something really that works. You know, people don’t need to worry about quality, etcetera, just mostly works. Sam Charrington: [00:06:08] Can you talk a little bit more about the challenges that arise when you’re trying to do OCR in the wild? Cha Zhang: [00:06:16] Of course. I think for documents, usually it’s white background and black text, but for images in the wild, essentially it’s a photo. So in the photo, there’s a lot of variations in the text. First there’s a huge scale variation, so some texts, if you capture a picture of a street, there might be some store name that are super big, and then there are some tiny texts that’s hard to see. So there’s a big variation in scale of the text and the aspect ratio of these texts can be a really long cause text string can be very long compared to regular objects, like a cat or a dog. Because of the mobile capture scenario, usually it’s difficult to integrate close these texts by and access a line of rectangles. For example, you’re not, there might be perspective just portions of the text when the camera sees them. The background in the image in the wild is much more complicated than the typical white background you see in scan documents, and some of these backgrounds, such as fences, breaks, and stripes, are even though they appear quite simple for human beings, but think of like fences can be a perfect, a bunch of ones, you know, on the street sitting there and they look very similar to two characters. So those create additional challenges, and I think one of the biggest one, I think technically for OCR, that’s challenging is the localization accuracy. So, typically in object detection, the localization accuracy, if it’s measured by intersection of a union, and if that criteria is bigger than 0.5, people think this is good enough, but for OCR, if you actually, the intersection is only half of the union, a lot of the characters will be missing. So, usually OCR will need a 0.9, 0.95 level kind of accuracy in order to recognize all the characters properly. So… Sam Charrington: [00:08:31] Can you explain that in more detail? What is intersection over union and how is that used in convect detection? Cha Zhang: [00:08:39] So, in order to measure the accuracy of a particular detection algorithm, you need to ground truth label the data, and so, typically what people do is they create a bounding box of the object to be determined, to be detected, and then you use a automatic algorithm to figure out where the object is, then that will also create a bounding box. Now you have two bounding boxes. and the question is how do you measure how well these two boxes align and, a common measure is to take the intersection of these two bounding boxes and you take the union of these two bounding boxes that you get two areas. You can imagine if the two bounding boxes are very close to each other, overlapping a lot, then that intersection of a union would be very high, but if they are off, they’re offset by quite a bit, then you know, the number is low. So that’s kind of academia standard, how people measure detection accuracy with this criteria. Sam Charrington: [00:09:46] Got it. And so, you were saying that the threshold that you need in the case of texts is higher because of what? Cha Zhang: [00:09:58] Because of… Let’s just think about, you know, you have a ground truth text, let’s say, “Hello world,” and it’s elongated a rectangle and you say, I have a text detection algorithm that creates also a bounding box, but have a intersection of a union, let’s say roughly 0.5, and so what that means is that the intersection area divided by the union of the two bound inbox is 50%. So very likely the detective bounding box will miss a few characters because, you know, the overlapping is not there. So, you might be missing at, you might miss a D as an N and all this will cause the OCR to produce wrong results. And so that’s the main challenge here. Sam Charrington: [00:10:48] So in the case of a traditional object detection scenario, you may miss a half of the face but you can tell that there’s a face there in the case of OCR, you’re just missing letters and it makes it a lot more difficult for the algorithm to guess what was there. Cha Zhang: [00:11:07] Yes, exactly. Sam Charrington: [00:11:08] Got it, and maybe taking a step back just to the problem as a whole, granted mobile is driving, you know, this transition to these in the wild pictures and people trying to OCR them, but what are the high value use cases there? Like, is it, you know, I’m thinking of some interesting ones as like the… when it’s in conjunction with translation, you know, maybe I’m in another country and I’m, I’ve done this. You know, you’re taking pictures of, of words and another character to try to read the menu or something like that. I’ve also done things like scan documents on a phone and, and you won’t want to OCR those, but that’s kind of back to the traditional OCR problem in a lot of ways. What are some of the other use cases that are common? Cha Zhang: [00:11:58] If you look at this kind of business opportunities, I still think the traditional document, you know, scan document, I think, some traditional kind of OCR problems that like, for example, receipts, where people can scan in the old days, but nowadays people mostly do reimburse them by taking or snapping a photo. So I think in term of the market, the revenue, I think that’s still quite a big one. There are a few others. The one that you mentioned, if you have a phone, you go to a foreign country, you snap a photo and you want to translate them as one. There’s also a lot of applications in digital asset management. So this is when you, either you are a big company or you are a personal kind of, you have some big storage of photos and where you want to organize these photos. We have shown that with OCR capability, you can increase the accuracy of processes, photos, and retrieve these photos. As a matter of fact, you know, the big search engines like Google and Bing, when they search images, OCR is integral part of that as well because the OCR, the content can help a lot in getting the best images. Sam Charrington: [00:13:22] Okay. And so, you were mentioning kind of some of the technical challenges and localization of the texts in these images is one of those challenges. How do you go about it? Is it the case that, you know, deep learning is so powerful off the shelf. Deep learning techniques just solves it for you or do you, you know, you reengineer the whole pipeline? How do you approach that? Cha Zhang: [00:13:53] So in text, this action, usually the detection pipeline is different from a traditional object detection. What’s been most popular for kind of OCR for imaging in the wild today is something called anchor free detection. So the idea… Anchor free. In a typical object detection, usually most well known anchors, like fast RCN and faster RCN, etcetera. They basically create these anchors and then they regress the actual bounding box of the objects. The challenge of using that kind of approach is that these anchors need to be preset, and so typically for normal object detection, you set at a certain density, and then you set a certain set of aspect ratios. Like your anchor box are one to two, one to three, one to one. Typically you go about there, but texts, some of the text can go like 20 to one so really you cannot, it will be a huge computational cost to go with anchor based approach. So modern days for OCR, we go anchor free, and the high level concept is essentially by using convolutional neural networks. You almost do kind of a per pixel level, a decision or classification saying, well, this region nearby this particular pixel, it looks like part of text. So there is a text/non-text classification almost kind of per pixel level. Then you rely on a few algorithms to group these into text vines by looking at how well two, for example, two texts, the region are similar to each other and you can decide, well, these two looks like the same textures and color, and maybe they should be connected. In this regard, there are quite a few well known algorithms to do this connection. In earlier days, people use a relatively kind of a rule-based approach like stable link where they link based on some features, but it’s kind of a rule-based. More recently, people start looking to new networks like relation network. So are kind of estimating the relation of two regions are features, and based on that to decide, well, these two should be connected or not. So that way you started kind of bottom up; you start with perfect kind of classification, and then you do grouping, and you come out with these text lines. Very powerful approach. It can not only detect kind of a straight lines, but even curve lines, you can handle them pretty well with those approaches. Sam Charrington: [00:16:44] So it sounds like you’re describing a pipeline. That’s not like a, end to end train single neural network that you give it images and train it on label data. It is, telling you what the text is, but rather a bunch of independent steps. Cha Zhang: [00:17:04] Yes, that’s a very good observation. Actually, so for OCR, detection is only the first step and after detection, we typically run a character model where you take the detected text lines, you normalize them into a straight line with a fixed height, and then you run a character model to actually decode the image into a character, a list of characters. There are a lot of approach actually similar to speech where, you know, speeches going from acoustic similar to these texts. But here we’re going from image to text. But a lot of the approaches that we use, like LSTM, language modeling, these are very similar. Now your question is certainly valid because in speech today, you know, people do end to end training you. They start from audio so they can directly go to text. For OCR, we are not a year yet. I think the main challenges, well first is how much data you have. I think speech, you can collect a lot more data compared with OCR. OCR data are usually very expensive to collect in a label and so, going stage by stage at this point is more economically doable than, you know, do end to end training. Sam Charrington: [00:18:25] Why is that? It seems that we have tons of pictures with words in them that we know particularly, is it just in the wild, the, in the wild examples where we don’t have the label data or is also this document use cases because I’m imagining, Microsoft has probably labeled a ton of receipts and business cards and that kind of thing. Cha Zhang: [00:18:50] Yeah. I think certainly a labeling is very, very expensive. For Microsoft, we are a company paying a lot of attention to privacy, you know, those kinds of issues and the collecting OCR data has been a major, I would say, blocking issue to go for this kind of end to end approach because if you think about it, a lot of the document that we actually carry, like if you say, talk about invoice, talk about receipts, business card, they all contain PI information. Those are data extremely difficult to obtain, and we follow very strict kind of guidelines – how we can collect them, how we can label them. So in some way we are limited by these privacy restrictions, but we do respect those a lot. So we, as a result, you know, we are now going end to end at this point. Sam Charrington: [00:19:48] Got it, got it. It makes me think a little bit about the, some of the issues with neural networks, remembering data. So for example, there are examples where you’re, you train a CNN and there are some attacks that you can do that will reproduce some of the images, you know, it’s to some degree or another, that the model was trained on. Likewise, with these very large language models, you can start to see some of the texts that the models were trained on, come out in the, in the output. I would imagine if you were training end to end, at least then that becomes an issue as well, and maybe more so than in the case of images.   What’s your intuition there? Would it be worse or are better than images? Cha Zhang: [00:20:39] I would imagine it will be similar, I would say. So after all, you know, OCR, you come from image to text, but during the learning of this OCR process, language model is actually very helpful to help improve the OCR accuracy. So, for example, during decoding of these texts lines into a text, we use some of the, like LSTM or, you know, basically these very popular language modeling schemes. Certainly it remembers the contextual information of the language in order to help the OCR to recognize these texts properly. So, I think when you go to end to end, when the amount of data that you use for training is humongous, I think, it’s difficult to imagine for me, you know, we’ll have similar level of data for training like BERT models or TBT models. Those are huge, huge amount of data, but still you will learn something from the text and they might leak into the model as well. Sam Charrington: [00:21:51] Along those lines, what enabled BERT and many of the recent innovations around language models is a shift from supervised to the semi-supervised way of framing the task. Is there a semi-supervised framing for the OCR test? That makes sense? Cha Zhang: [00:22:13] Actually for OCR today, we are not, although I think it’s definitely a very interesting research problem. I think BERT is a super nice framework for transfer learning. You know, you, you go from pre-trained model and then, you know, on a supervisor, you can… In the image word, I think, transfer learning probably exists earlier in image than language. So earlier days when we have ImageNet, we trained like a resident, those are already being used for transfer learning. So, unsupervised kind of image learning is also, I think it’s still ongoing. There’s a lot of interesting projects going on. I think for OCR right now, we’re not there yet. Like one of the main issues for building a product like OCR to use some of these pre-train model is the computational cost. I think this happens in language as well, BERT model, the GPT Model 3, like, you know, multi billions of parameter is very difficult to turn them into a product for OCR. It’s also, you know, we have the same problem. Computational cost is very sensitive. We need to make it fast, and so we’re using it relatively small models and normally we train from scratch. Transfer learning does show some benefit, but when the data reaches a certain amount, we found training from scratch is perfectly fine. Sam Charrington: [00:23:49] When you have a certain amount of data to train from? Cha Zhang: [00:23:53] Yeah. In the very early days when we started doing different learning OCR, we actually rely a lot on trans distillation – that’s teacher-student learning, where we first train a big and model, and then we gradually use teacher-student learning to create a small model so that it can run efficiently. Nowadays, we have figured out that you can train these models from scratch. The amount of data that we have on the order of, you know, hundreds of thousands and millions of images are sufficient to train from scratch on smaller model, and reach about the same accuracy. Sam Charrington: [00:24:31] Can you elaborate a little bit on that? Are you saying that you need more data to train smaller models? Cha Zhang: [00:24:37] No, I’m saying that… Take BERT as example. BERT is super beneficial for transfer of learning because it has seen so many documents. So giving any new language task, presumably your data is not much, there’s not much data that you have to train this new task, and therefore, leveraging BERT, where it has seen so many documents, will help through transfer learning to transfer some of the knowledge that the BERT has learned from this huge set of document, to the small kind of task so that it can reduce the amount of documents required to train the smaller task. The same thing happens in ImageNet transfer learning where, you know, if it’s a ResNet train on ImageNet, you learn a lot of visual information from the ImageNet dataset. Then if you have a tiny detection task, like detecting a helmet, let’s say, and you can do the transfer learning and you can use a very small amount of dataset to actually train a very good helmet detector. What I was saying just now was that for the problem of OCR where, you know, it is certainly a very important computer vision problem. Every company who invest in OCR tend to collect quite a bit of data, not to the level of, you know, billions, but hundreds or thousands, millions to that level, that amount of data is sufficient that you do not need to go transfer learning. You can train the model from scratch and you get very good results. Sam Charrington: [00:26:19] Got it. Got it. So when you were using transfer learning where you’re using models based on ImageNet, you know, along the lines of ResNet and others, or whether… Okay. Lets see… so the smaller models that you’re training are they, you know, some of the traditional architectures that we’ve already brought up or are you building out new architectures for the models themselves for this specific problem? Cha Zhang: [00:26:53] Right now we’re using some of the traditional models. There are some active research going on regarding searching the best effective architecture for OCR. We haven’t seen convincing results yet, but I think that’s a very active research area that we’re still kind of looking into, particularly when we try to make it smaller and smaller, you know, faster and faster. Sam Charrington: [00:27:20] When you say searching the best architecture for OCR, are you speaking using the word searching generally, like you have researchers are looking at different models and trying to find the best one for OCR, or are you suggesting a domain specific neural architecture search kind of…? Cha Zhang: [00:27:38] I mean neural architecture search. So that certainly can be applied to OCR and we were still exploring it, but I think that’s a very promising direction. Sam Charrington: [00:27:49] Okay. Interesting. Interesting. Earlier in the conversation you talked about one of the big use cases is some of these semi-structured data that we want to extract information out of – invoice is one example. There was a recent demonstration, or I guess that’s actually a product now of the mobile version of Excel or something. You can take a picture of a grid, grid like data, and that will, you know, both extract the text and organize it into a spreadsheet. Talk a little bit about the product that you’re working on the form recognizer, which is doing something similar. Cha Zhang: [00:28:35] Yeah, of course. So OCR certainly is pretty low level. Other than some of the application I mentioned earlier, like digital SMN and then photo managing, you know, translation, you can directly use OCR, but for many customers, what they want is not just OCR. They want to extract information from documents. Think about, you know,I need to process millions of invoices. I want to extract vendor name and the date, total amount, or if it’s an MS expense system where you want to process all the receipts, and either it can be a verification purpose, for example, like, okay, how do I make sure employees are not putting random numbers and they don’t match with the receipts that’s actually filed. It’s actually, it sounds kind of silly but you know, today, a lot of the company do this verification manually. Because of the huge manual amount of effort needed, they often can only do sampling. So you sample like 5% of these receipts to validate, but you kind of miss a huge chunk, and that you never even look at it? So we are looking at this space and we’re trying to build essentially two category of product – one is a previous set of product and these are solutions that works out of the box. For example, it can be a prebuilt receipt, pre-built business card, pre-built invoice. So these are, basically you’re sending an image or PDF file. It will extract all the fields that you’re, you’ll be interested in. Another big category that we think are super important is customization because, you know, the pre-build may never fit every need. So we have a solution called the custom form where we allow customer to basically send us a few sample images. You can either label or even, you know, not doing any labelling but we will be able to extract key value pairs out of these documents. Again, we see this as a much closer to what the customers need and that’s what the form recognizes its position as. Sam Charrington: [00:30:54] So we’ve talked about a bunch of the interesting technical challenges at the lower level at OCR. Does the form level, you know, is that a kind of a packaging of OCR? Does it have its own technical challenges to overcome…? Cha Zhang: [00:31:13] Actually it has a lot of very interesting challenges. So, one of the work recently is coming out from Microsoft research, whereas, you know, targeting exactly this problem. And so, just think about it. The language, I mean, passing these invoices and receipts are essentially sort of a language problem because you have these texts there. The challenge here is that these are images, so you run OCR on them, but unlike a typical language, a data set where you’ve scratched from the internet, you know, Wikipedia there’s basically have this ordering of these words already, but if these data coming from image, essentially you can detect these texts lines, but it’s actually very difficult to define the read order of these texts lines, and ordering of these texts lines by itself is a very challenging problem. When you have images in the wild, paper can be curved, you know, can be crunch, can be rotated here, the perspective, you know, all kinds of issues. They can have background text, you know, all these. So the particular approach that MSRA came out is called LayoutLM. It’s actually a modified a BERT model. It’s also a language model, but in addition to the language, we also embed 2D information, like what is the X, Y position of the bounding box of the text? So with that information, train, actually, this is all can also be trained without supervision. It’s unsupervised pre-training. We are able to learn this kind of spatial relationship in these invoices without coming out with explicit read order. With that, we actually can do a lot of these key value extraction really well. There’s also quite a lot of advanced research looking into say, relation networks where you see two text lines nearby each other, you can predict the relationship. Again, this is similar to the OCR where you have these bottom pixel level classification. You want a group of them here. You want a group P key and a value pairs. There’s also a lot of advanced research in this graphical convolution networks where you do convolution networks over a graph, where the graph is defined by connecting nearby text lines. Again, this is approach without requiring reading order, but just look at the spatial relationship. So these are all actually very exciting kind of extension of language, but also using visual information to help passing these vertical data more accurately. Sam Charrington: [00:34:09] Interesting. Yeah, I think it’s… At a quick thought would’ve imagined that, you know, maybe the top part of the stack, there is more rule-based than the bottom part of the stack was, you know, more machine learning base, but it sounds like they’re even, I don’t know, relatively, but there are a bunch of really interesting… Cha Zhang: [00:34:33] We are doing a lot of machine learning stuff on the top as well. Sam Charrington: [00:34:37] I’m imagining the, you know, when you talk about relation net, for example, on an invoice you could have date, and then the date, you know, horizontally next to it, or you can have date and then the date beneath it. Cha Zhang: [00:34:50] Yes. Sam Charrington: [00:34:50] You may have an address box and then a bunch of texts that comes beneath it. It would be nice to know that, you know, we’re talking about the address here. That’s part of the idea of the structured text extraction. So in that you mentioned relation net and graphical CNNs. Are those two approaches to solving the same problem or are they solving different aspects of the problem? Cha Zhang: [00:35:13] They solve different aspects of the problem, and they can be also used to solve the same. I mean, like right now, the main focus for us, for them for extracting key value pairs. This is both kind of pre-build and the customization. Think about, if it’s an invoice and you want a vendor name, so it’s a name. Certainly, you know, the text information because you see it looks like a vendor name. This probably is a vendor name and some invoice doesn’t even have the key in the invoice. Sam Charrington: [00:35:48] Right. Cha Zhang: [00:35:49] You don’t even have the word vendor name there, so how do you figure out this thing is still vendor name? So, there, you rely on information that’s language and that’s also kind of how the document is laid out. Like, okay, the font size may matter. You know, the position of the same may matter. So we are looking into combining all this information to come out with a better decision on those fields. Sam Charrington: [00:36:21] So, how does a graphical representation or way of thinking about the document gets you to a solution to these kinds of problems? You know, for example, the unlabeled vendor name? Cha Zhang: [00:36:33] The graphical kind of approach is basically… so you’ve got a bunch of text lines detected by the OCR and you connect to these texts lines with their neighbors. You define basically how strong these connections are. Actually it’s not defined. You actually learn these relationships by looking at the texts, looking at their relative positions, looking at their font similarity. Like one issue that you actually just mentioned was like address as you connect ’cause you have multiple lines of addresses. How do you know they actually belong to the same address? Right? So there’s this kind of, all these side information could be very helpful in determining that they should be grouped together. In the convolutional kind of graphical model, you learn a convolutional network by computing from all the neighboring nodes where each node is a text line to aggregate basically at the center node. So basically, the model learns by not only looking at the current text line that’s in focus, but also look at all the nearby text lines and decided, well, given all these contextual information, it does look like this is a vendor name. I guess that’s a very high level conceptual description of why it would work, but it’s the data driven machine learning so that the model [inaudible]. Sam Charrington: [00:38:06] As you’re solving problems like this, are you often needing to re-label your dataset? For example, imagining early on in developing an algorithm like this, you have a bunch of invoices, and you draw a bounding box around the addresses and you say, this is the address. Then you say, ‘Oh, well the font information is a whole new dataset,’ you have to label, well, this is… Are you going in and having people label Helvetica versus Arial? That seems a bit fine grain and hard to actually get an experts to label, or is it more abstract than that? Cha Zhang: [00:38:48] We usually only label the end goal, which is the field that you’re going to extract. So, for example, you want to extract a vendor name, vendor address, total text, you basically draw a bounding box in those regions and use that as a ground use data. Sam Charrington: [00:39:06] Got it. I think we’re going to the same place. When you say font… Cha Zhang: [00:39:11] When I say font, actually it’s in some way, implicit in the sense that we’re taking these bounding boxes, we’re extracting image information. Right? So think of it as let’s say, run a convolution network to extract a feature of that part of the text region, the text line. So, this feature is essentially all the visual information that can be helpful in deciding or determining the relationship between text lines. So if features are similar, it probably mean they are similar font, they are similar size, you know, so those kinds of… So, yeah, I think that seems to be sufficient. Sam Charrington: [00:39:55] So you’re not trying to kind of featurize your underlying images into these distinct things because what I inferred, when you said font. Do you look at the, you know, is there an analogy to kind of looking at the layers of the network, and when we do this with CNN, GC, like textures and things like that, is there some analogy that you’ve seen in looking at the layers of the network that says, ‘Oh, this layer is like identifying fonts.’ Cha Zhang: [00:40:32] No, we haven’t been going there yet. Well, I guess it’s certainly interesting to look at it. My take is most likely, font is just one attribute. I believe there are many other things. Yeah, I think it’ll be interesting to look at these features visually. Yeah. Sam Charrington: [00:40:54] We’ve talked throughout the discussion about kind of the ways that OCR and this form recognition problem kind of blends the vision domain and NLP domain and language models has come up quite a bit. Is there a little bit more kind of depth we can go into there? Some of the ways that, that you see, NLP, and particularly the advances in NLP over the past few years kind of influencing the problem and the way you solve it? Cha Zhang: [00:41:32] Yeah. We set up, I see NLP plays a very important role in these verticals. After all, these invoice receipt, business card, these are all human artifacts. They’re kind of language artifacts in some way. Right? So, all of the kind of latest state of the art in language modeling, we definitely want to leverage The thing I mentioned earlier, like the layout or it’s a one way to leverage them by using the language model, but also embed additional visual information, and hopefully to solve these problems effectively because input is really different, right? You know, the priorities like you take texts, it’s input here. We’re taking a bunch of texts lines to the locations and bounding boxes as inputs, and the algorithm can naturally kind of solve these problems. Sam Charrington: [00:42:30] And,is it also trying to do the traditional language model predicting the next character or word or set of texts? Cha Zhang: [00:42:38] Yeah, the way we train them are very similar, basically, merge texts – you merge some words and try to predict. Certainly you can use a lot of others. I think, you know, like I know recently people use translation targets. You can use alpha virgin coder kind of targets. This is a really active research area at this point. I don’t think, I think we’re still just scratching the surface, although we already seeing very, very promising results. So we definitely want to look deeper into this and see how well this really can push the state of the art. Sam Charrington: [00:43:21] Kind of continuing on that thread of the active research areas and what the future holds in this area, what are you most excited about in this domain of OCR and in general, extracting text from documents, vertical applications and the like. Cha Zhang: [00:43:42] Yeah, I think, we have been working on this problem for quite a while, but I think there’s still a lot of interesting problems. Only when we start to work with customers, we realize, you know, there are problems we haven’t been able to solve. I can just name one, for example, like table extraction sounds trivial, but when you actually look at all the existing tables in the word, the simplest one are those with explicit cell borders where you have straight lines but in reality, these tables can have no cell boundaries at all. It can be mixed on top with STEM, you know, all these things that are kind of making the problem extremely hard. So that’s jus, another one that is extremely challenging, but we want to solve. Another thing that I sort of briefly mentioned about earlier was the customization part of these vertical. How do you customize to customer’s own data instead of having these pre-built ’cause inevitably, you will have data that doesn’t work with these premium models. How do you allow customer to have a way to build their own models to still work, and that by itself is a very challenging problem because asking customers to label a lot of data is painful. They don’t want to go there. So either we go unsupervised or we go with very, very limited in number of supervision data. In such a case, how do we adapt our model so that it can work on this document that customer realize that the premium model has failed. That’s also very interesting kind of research problem that we are looking into. I envision in a language as low shot learning. It’s also, now it’s definitely applicable to the problem here as well. Sam Charrington: [00:45:50] In the case of some of the product ties, vision offerings, Azure does this as well. The user is able to upload its own set of labeled data and kind of the results for object detection are kind of fine tuned against the user’s data set. Cha Zhang: [00:46:13] Yeah. Sam Charrington: [00:46:14] Do the OCR and form recognition offerings, are they providing something similar? Like, can you upload it? Can I upload my own invoices? You’re doing some kind of transfer learning or, well. If you are, what are you doing to take advantage of what the user’s providing? Cha Zhang: [00:46:33] So we do have a product called a custom form which allow customer to upload a few samples here. We usually say minimum of five samples. So, say you have an invoice that doesn’t work with existing models, and so you want to solve the problem when you upload five invoices with similar is fine. These are from the same vendor or kind of looks or similar in structure, and we can figure out these key value pairs and extract them, either unsupervised or supervised. Right? Unsupervised means, customer don’t need to label anything. So you upload the file documents. The information we’re gaining by looking at these five documents is, well, these documents are supposed to be similar and therefore, they’re going to be a bunch of words in this document that actually is common across these documents. This commonality help us to tell, well this is probably part of the empathy form or the template of the form, while the thing that’s varying across forms are like, these are must be information customer has filled in as kind of different from sample to sample. So with that information, we can actually extract key value pairs out of, without any supervision. All you need is upload five similar documents. Of course that works to a certain degree, but if you’re still not happy with accuracy, we provide a way for you to label your key valued pairs. So here is like we, we have a UX where you can go and label the fields you care by essentially highlight the OCR text lines where you think this is the value I want to extract. Then we actually learn a model out of five samples and produce a model that can be used by the customer to extract these values. The accuracy is actually normally pretty high, in the 90/95 percentage range, actually. Sam Charrington: [00:48:38] So when the customer does this, is this process entirely learned or is there a human in the loop kind of exception handling element to it? Cha Zhang: [00:48:50] I guess this is probably kind of take a step back. I think all the products, OCR process today, OCR has made a significant advance, but if you actually care about the numbers, think about the invoice. Right? If your total is wrong, it’s really that bad. So, what we recommend is definitely we recommend people to have agent backup. For all of the products we offer, we give people confidence, right? So how confident we are about the expression of a particular value, and a different customer can choose their own threshold and have an agent to look at them. But I think, today’s accuracy. we don’t recommend kind of strays through, unless you are handling certain specific applications. I can give you an example. For example, if you have a valid, if you’re verifying receipt image against a employee entered data, so there you can go automatic, right? ‘Cause if the OCR produce a different number than the employee, well, you will need somebody to look at them anyway, but if they actually merged them, well, that probably means it’s okay. Sam Charrington: [00:50:08] Right. Cha Zhang: [00:50:08] So the application, you can automate it more. Sam Charrington: [00:50:13] Got it. So, the question that I was asking is slightly different though, and you know, so say you’ve got someone using automated form recognition and they have their five examples that they haven’t been happy with, and they submit that in through some website, our API, is someone at Microsoft taking those, and going, taking them manually through some process to try to figure out why they’re not working or are they thrown into some training job and then the customer’s result gets better? Cha Zhang: [00:50:48] Okay. Now, no, we don’t look at the customer’s data. So this is a fully automated product, meaning, you know, customer basically label these files. They call a API to train a model. The whole process is automated. Sam Charrington: [00:51:04] So under the covers, are they kind of forking off their own model? The last few layers are getting cut off and it’s fine tuning, or is it more elaborate than that, or…? Cha Zhang: [00:51:17] It’s more elaborate than that. Underneath the hood, there are multiple steps. We leverage a lot of information in these sample documents. For example, as I mentioned earlier, there will be words common across these samples. Those are very strong indicators regarding this might be part of the empathy, part of the form where, you probably think these are not so interesting to the customer. Transfer learning is certainly one way of doing that. Right now we are actually train these models without transfer learning. So it’s actually, the model is training from scratch for very few customers we’re able to do this. We’re able to do this because some very interesting work that we have done tobasically augument this data to make sure that you have sufficient data to still be able to train a model out of five samples only. This can be a feedback loop as well. So, if customer’s not happy with a model trained by five samples, you can upload them more and we just train a new model for you. So every time you try and just get a newmodel, that way, it’s a feedback loop where customer can keep improving their model until it to a certain stage where it’s really performing for the customer. Sam Charrington: [00:52:53] So when you say augmenting the five that they’re providing, are we talking about data augmentation and the sense of a transformation pipeline that kind of changes, adds noise, rotates, that kind of thing? Or are we talking about, you’ve got some other data set that you’re adding to their five and training it on that aggregate data set, and that’s how you’re producing a better model? Cha Zhang: [00:53:21] Both. Although I think the latter one is more because actually, when customer label these data, they actually provide, we ask them to provide some additional information. For example, they label, this is a date. We know it’s a date. So in this way we can artificially create more data to fill the form so that we can produce more data to train the model. Also, we use a very robust machine learning algorithms that are robust to very few examples. So, that way we can learn with this limitation. Yeah. Normally, if you look at many of the other offerings that people provide. You have to train with hundreds of examples here. We’re pushing it really down to five and we hope to push it even lower in the future. Sam Charrington: [00:54:11] So I’m assuming that this is a stacked problem and you’ve got some low level OCR, for example, models that are trained with many, many documents. What you’re doing with this form recognizer custom data is more at the top end node of that stack. Is the off the shelf model that I’m using without the five example customization, is that also trained on relatively few examples? Cha Zhang: [00:54:44] What do you mean? Sam Charrington: [00:54:45] I guess what, I guess maybe I’ll jump ahead to the conclusion that I’m drawing on. What’s what’s confusing me is how are you getting better results with few examples if you’re not using any kind of transfer? I guess I heard in your explanation that you’re not doing any kind of transfer. Cha Zhang: [00:55:03] So right now the custom forms support training model and these models are usually… each model is geared towards one particular form type. So in some way you can think this problem is actually restricted. It’s actually a easier problem. It’s not like a pre-built invoice where essentially you want to handle all your invoices. Here we’re handling one particular invoice coming from, I would say one particular vendor. I say they usually use this template. Sam Charrington: [00:55:37] Got it. So the customer then, do they call a unique API to resolve invoices of this type? Or is that then ensembled, and then there’s something that decides whether it’s of the type that you’ve built the new model for? Cha Zhang: [00:55:55] Yeah. So here’s a kind of the recommendation that we give to customers, right? So you maybe start with the previous model, and the previous model may work and then your job is done. If you’re happy, go. Then you certainly say you have a lot of invoices and out of a thousand, 10 of them doesn’t work. So while we offer the customer as well, take these invoices and you can train specific models for these 10 different invoices, you might need to train more than one model as a special model because this invoice may look very different. So imagine you can train like 10 different customer models for this. We actually also offer kind of automatic invoice classification. So a API called a model compose where we can compose these 10 small models into one. So, all you need is just calling to that one. By calling into that one, we also provide you a confidence to say, well, because during testing, the customer send the invoicing. We don’t really know whether it’s one that doesn’t work with this pre-built one or whether it’s part of this. It works well with the previous. So you send this invoice first to the customized version of the model, and we will tell you, ‘Hey, it doesn’t look like any of the 10 you have trained.’ So in this case, you will revert back say, okay, now I’m calling the previous invoice ’cause you sort of know that pre-build actually works well for that. So that’s what we recommend customers to do. Sam Charrington: [00:57:34] Okay. I dug into a little bit of the detail there, but it’s interesting to see kind of how the end-end problem is put together. In a case like this, the ends of that problem are on the customer side, not just the service that you’re offering, and so seeing how the pieces are put together is kind of interesting. Awesome! Well, Cha, thanks so much for taking the time and walking us through some of the interesting things that are happening in these domains. Cha Zhang: [00:58:12] Thank you for having me. Sam Charrington: [00:58:14] Great! Thank you.
As you may have heard on the podcast, I’m trying the newsletter thing again. I’m not sure what it’ll evolve into, but my goals are to make it personal, informative and brief/skimmable. I hope you’ll come along for the ride. As always, please let me know what you think! O'Really? On Monday we dropped five shows in our O’Reilly AI series for your binge listening pleasure. I’d name my favorite but they’re all my favorite! Really, the series offers something for everyone. I cut straight to the chase with Intel’s AI czar Naveen Rao, wax creative with Google’s Project Magenta lead Doug Eck, go full Nerd Alert with Ben Vigoda on Bayesian program synthesis, chat about scaling video object detection with Reza Zadeh, and learn how Rana el Kaliouby’s company uses emotional AI to help brands craft the customer experience. Check it out! Over the river & through the woods I just got back from a great trip to Europe. The bulk of my time was spent in Berlin, where I got to explore the city and tech scene, deliver an intro to AI workshop, and meet with TWIML listeners. Before heading to Germany though, I ventured into the Swiss hinterland to interview an impressive—and controversial in some circles—figure in modern AI—Jürgen Schmidhuber, co-creator of the LSTM neural network architecture. We had a great time and a great discussion, which will be posted soon on the podcast! Reading is fundamental In a recent show, I thought out loud about starting a paper-reading group for TWIML listeners. The idea seems to have resonated with folks. If you’d like to join in, jump over to the meetup page express your interest and help plan the details. Join me at the next AI Conference Apparently, the O’Reilly AI conference is being renamed “The AI Conference.” (Hubris anyone?) As usual, we’ve got a free ticket to give away, and we want to give it to YOU! Just comment or share your favorite quote from any of the shows in our O’Reilly AI series to enter. Commenting/sharing for each show gets you five entries! More details at the series page. Sign up for our Newsletter to receive this weekly to your inbox.
This week we discuss Intel’s latest deep learning acquisition, AI in the Olympics, image completion with deep learning in TensorFlow, and how you can win a free ticket to the O’Reilly AI Conference in New York City, plus a bunch more. Here are the notes for this week’s podcast: O’Reilly AI Conference Giveaway I’m excited to be partnered with the O’Reilly Artificial Intelligence Conference, to give away a free ticket to the event, which will be held September 26 – 27, 2016 in New York City. There are three ways to enter the giveaway: 1. (Preferred) Follow @twimlai on Twitter and retweet this tweet: Win a FREE ticket to the @OReillyAI Conference. To enter, follow @twimlai + RT. https://t.co/ReYqwqp538 for details. pic.twitter.com/9pLrzHIX9d — TWIML (@twimlai) August 15, 2016 2. Sign up for the TWIML&AI Newsletter and add a note “please enter me” in the comments field. 3. Use this site’s contact form to send me a message and use “AI contest” as the subject. A winner will be chosen at random and announced on the 9/2 podcast. Ticket is non-transferrable. Good luck, and hope to see you in New York! If you’d like to buy a ticket, register using the code PCTWIML for 20% off! And don’t forget to get your free early access ebook: Mastering Feature Engineering Intel Buys Deep Learning Startup Nervana Intel Buys a Startup to Catch Up in Deep Learning Deep Learning Chip Upstart Takes GPUs to Task Nvidia’s bet on deep learning and autonomous cars drives stock to record highs – MarketWatch AI Bot Joins Team Washington Post at the Rio Olympics The Washington Post experiments with automated storytelling to help power 2016 Rio Olympics coverage – The Washington Post Technology Fujitsu Software to Accelerate Deep Learning Workloads DetectNet: Deep Neural Network for Object Detection in DIGITS | Parallel Forall Google Research Blog: Meet Parsey’s Cousins: Syntax for 40 languages, plus new SyntaxNet capabilities Image Completion with Deep Learning Image Completion with Deep Learning in TensorFlow bamos/dcgan-completion.tensorflow: Image Completion with Deep Learning in TensorFlow [1607.07539] Semantic Image Inpainting with Perceptual and Contextual Losses [1511.06434] Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
We’ve talked fairly extensively about the use of Deep Learning in medicine in previous shows. Breast cancer and eye disease were a couple of the use cases we discussed, with both of these sharing the common feature that they’re based on image analysis. Well this week a team of researchers from Princeton University published a paper outlining their work applying machine learning to the challenge of identifying genetic causes of autism. The genetic causes for autism, or autism spectrum disorder, have been difficult for researchers to track down. The autism research community has identified 65 genes associated with autism risk so far, mostly through sequencing, but it’s believed that those are but a fraction of the 400-1,000 genes likely to be involved in the disease. To try to identify the additional genetic actors in autism susceptibility, the Princeton team used what they call a brain-specific functional interaction network, which was developed in previous research. This brain-specific network is a functional map of the brain, expressed as a probabilistic graph of how genes function together in pathways in the brain. They then used machine learning to train a classifier based on the connectivity patterns of the known ASD genes in the brain-specific network, and then uses this classifier to predict the level of potential ASD association for every gene in the genome. Specifically, they used an SVM classifier, and used the connectivity of the known ASD genes to the other genes in the brain-specific network as its features. I’m somewhat trivializing the ideas around the brain-specific network and how it translates into features, mostly because I don’t really understand it. But this is a great example and reminder that most of the magic in ML is in the feature engineering. Based on their method, the team was able to identify a number of candidate genes with no prior genetic evidence of ASD association, and has since gone on to validate many of these candidate genes through sequencing. Their results can thus be used as the basis for further analysis into the genetic causes of autism. Super interesting stuff. Check it out if you’ve got a background or interest in the medical applications of ML. A couple of other interesting research papers caught my eye this week: Researchers from security research firm ZeroFOX published a paper “Weaponizing data science for social engineering: Automated E2E spear phishing on Twitter.” Spear phishing, if you haven’t heart the term is like phishing, but is targeted at a particular user. You’re typically trying to get a user to click a link that will trick them into giving up some credentials. What the ZeroFOX team did was created a tool called SNAP_R that first rates a list of Twitter users based on their likely susceptibility to a spear phishing attack, and then uses a neural network to produce effective spear phishing tweets. If you heard that and immediately thought, oh it’s probably an LSTM RNN then woo hoo, you’re catching on! At least that’s how I felt when I read that that’s exactly what they did. This next paper I love click for info. It’s basically a Twitter sarcasm detector created by researchers at the University of Lisbon in Portugal and UT Austin. It works based on embeddings, a type of word vector, which come up all the time and that I’d like to learn more about, and these embeddings are fed into a CNN model and trained on tweets that are self-identified as sarcastic by their use of the #sarcasm hashtag. The researchers use embeddings in a unique way in this paper, coupled to the different social media users, and as a result are able to outperform another recently published state-of-the-art model for sarcasm detection by over 2%. Subscribe: iTunes / Youtube / Spotify / RSS
Each year, computer security conferences host a high tech version of the kids game “capture the flag,” so that teams of hackers and security researchers can demonstrate their hacking prowess. The game requires teams to secure a computer system by identifying intentional and unintentional vulnerabilities in various software modules while launching and defending against threats from competitive teams. This week, DARPA, the Defense Advanced Research Projects Agency, hosted a version of a capture the flag contest where the teams were autonomous bots. The event, held Thursday in Las Vegas as part of the Defcon security contest, was the final competition of the agency’s Cyber Grand Challenge, a $55 million hacking contest designed to spur innovation in the area of autonomous cyber warfare. Seven teams of researchers from across the country fielded bot systems that competed with one another to autonomously identify and patch software vulnerabilities that were planted in their systems by DARPA, while deflecting attacks from competing bots and launching their own attacks against the computer systems those bots were protecting. Team’s bots are scored on their ability to secure their own software and services, ensure their continued availability and take advantage of vulnerabilities in competing team’s systems. From the looks of it, DARPA constructed a pretty elaborate physical environment for the contest, complete with an “air gap” to ensure that each system was acting totally on its own. Announcers followed along with the 96 rounds of action and provided a live play-by-play for onlookers, while referees ensured that each team played by the rules. With each round, DARPA deployed a new set of software for the bots to both defend and attack. I watched segments of the 4+ hour video from the final competition and found it pretty fascinating, but I failed in my brief attempt to find any details on how the bot various bot systems work. Cade Metz’ coverage of the competition for Wired painted an interesting picture of the different strategies each bot pursued in the contest. One bot, Rubeus, built by federal contractor Raytheon, took an aggressive tack, going after vulnerabilities in the other systems from the get go. Yet another bot, Mech.Phish didn’t perform as well overall, but it did have a knack for finding and exploiting complex and subtle bugs in the challenge code. Mayhem, a bot fielded by a team from Carnegie Mellon spin-out ForAllSecure, and the eventual winner of the $2M first prize, seemed rather focused on patching its own systems and keeping them up and running. The bot reportedly used statistical analyses throughout the game to weigh the costs and benefits of patching vulnerabilities (which has inherent risks and demands service downtime), and would only decide to patch those holes that made sense based on this analysis. Cybersecurity is an important and rapidly evolving use case for ML & AI, and there’s been quite a bit of commercial activity in the area in addition to innovation and research activities like the CGC. This week startup Distil Networks closed a $21 million series C funding round to help enterprise customers separate good bots from bad ones, and keep the latter off of their web sites. Note that we’re not talking about chatbots here, but rather the kind of web bots that abuse APIs, scrape web sites, and probe them for vulnerabilities. The company uses machine learning techniques to detect when a bot is trying to cloak its activity by spoofing multiple user accounts, browsers, and locations. And last month, another cyber security startup, Darktrace Ltd. raised a $64 million series C to help enterprises identify and defend against a variety of networked threats. Subscribe: iTunes / Youtube / Spotify / RSS
In this post I want to revisit some comments that I made last week while discussing the news that Google DeepMind was granted access to a collection of 1,000,000 eye scan images by the British National Health System. If you’ll recall, I asked whether this data, which was collected by a government-funded public health organization should instead of being exclusively handed over to a single research organization, should rather be made publicly available to all researchers. Well, I wasn’t the only person thinking this thought. This week I came across a really interesting article by Natasha Lomas over on TechCrunch that takes this question a few steps further. While the focus of my question was on data accessibility, a key underlying issue, which Natasha very nicely articulates, is the issue of data value. To be clear, the issue here is that while Google DeepMind says it will be publishing the results of its research, and if you’re a regular listener here you know that this is very likely the case, they haven’t committed to share, via open source or otherwise, the models they create as a result of the work. As an example of a likely outcome, Google could turn around and license their models, which are based on public data, to one of the vendors of the eye scanners that are used by physicians. Sure they created the models, but they’re given quite a head start with exclusive access to the data. An article on the topic in New Scientist Magazine paraphrases a University of Pittsburgh eye doctor as saying: [DeepMind may get free access to valuable patient data – but the alternative is to keep potential insight locked up in the Moorfields dataset, inaccessible to human analysis.] You imagine the NHS saying the same thing, but this is obviously a false dichotomy. Who’s to say that if the data weren’t public another research organization such as a public university wouldn’t take up the challenge. Natasha asks a few good questions in her piece, namely: - why governments and public bodies fail to see the value locked up in the publicly funded data-sets - why aren’t they coming up with ways to maintain public ownership of public assets? - How could they do so in such a way as to distribute benefits equally, rather than disproportionately rewarding the company with the slickest sales pitch? Natasha compares the NHS DeepMind arrangement to other transactions involving the privatization of public resources, suggesting that these amount to a transfer of wealth from citizens to corporate interests. She suggests that “we, the public, really need to get our act together and demand a debate about who should own the value locked up in our data. And preferably do so before we’ve handed over any more sets of keys.” What occurred to me in thinking about this a bit more is that perhaps one piece of the puzzle is a new type of licensing model for data. Something viral like the GPL, but whose virality applies to derivative works, where in this case we mean models created by training on the data. So, if you used data licensed under such a license to train a model, you would need to publish the source code for the models should you choose to publicize them via services or executables. I’m just thinking aloud here. Let me know what you think in the comments, or on twitter.
This week’s show covers the White House’s AI Now workshop, tuning your AI BS meter, research on predatory robots, an AI that writes Python code, plus acquisitions, financing, technology updates and a bunch more. The Big Picture Home :: AI Now Jason Furman’s speech I need an AI BS-Meter — Gab41 Smerity.com: It’s ML, not magic: simple questions you should ask to help reduce AI hype You Can Now Drink Beer Brewed By Artificial Intelligence – Forbes On the importance of democratizing Artificial Intelligence Business Google buys machine learning startup Moodstocks to help your phone’s camera identify objects | VentureBeat | Business | by Chris O’Brien News discovery app SmartNews nabs another $38M, now valued at $500M-$600M | TechCrunch General Catalyst’s Phil Libin invests in 2 more chatbot startups, Growbot and Butter.ai | VentureBeat | Bots | by Ken Yeung Exclusive: Why Microsoft is betting its future on AI | The Verge Research Google’s DeepMind AI to use 1 million NHS eye scans to spot diseases earlier | Ars Technica Artificial Intelligence May Aid in Alzheimer’s Diagnosis – Neuroscience News Application of Machine Learning to Arterial Spin Labeling in Mild Cognitive Impairment and Alzheimer Disease Steering a Predator Robot using a Mixed Frame/Event-Driven Convolutional Neural Network Super-intelligent predator robot is taught to hunt down prey in chilling experiment | Daily Mail Online Technology Release of IPython 5.0 Skype chatbots now work in group chats | VentureBeat | Bots | by Khari Johnson Microsoft’s Project Malmo AI platform goes open source | ZDNet Projects Teaching an AI to write Python code with Python code Deep Learning for Chatbots, Part 2 – Implementing a Retrieval-Based Model in Tensorflow – WildML Specials Data Science Summit – JULY 12-13 in SAN FRANCISCO / Use code TWIML20 for 20% off registration FREE O’Reilly Early Access Ebook: Mastering Feature Engineering
This week’s show covers the International Conference on Machine Learning (ICML 2016), “dueling architectures” for reinforcement learning, AI safety goals for robots, plus top AI business deals, tech announcement, projects and more. ICML 2016 –Accepted Papers | ICML New York City – Which companies had accepted papers at #icml2016 ? Best Paper Awards – [1511.06581] Dueling Network Architectures for Deep Reinforcement Learning – [1601.06759] Pixel Recurrent Neural Networks – [1602.07415] Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling – My winner in the best name category: Extended and Unscented Kitchen Sinks – Demystifying Deep Reinforcement Learning Research Google Research Blog: Bringing Precision to the AI Safety Discussion OpenAI Blog: Concrete AI safety problems Paper: 1606.06565.pdf OpenAI technical goals Artificial intelligence achieves near-human performance in diagnosing breast cancer — ScienceDaily Paper: 1606.05718.pdf Business Twitter pays up to $150M for Magic Pony Technology, which uses neural networks to improve images | TechCrunch Increasing our Investment in Machine Learning | Twitter Blogs Artificial Intelligence Explodes: New Deal Activity Record For AI DARPA is looking to make huge strides in machine learning | PCWorld Data-Driven Discovery of Models (D3M) – Federal Business Opportunities: Opportunities AI Culture Wars in Silicon Valley How Siri Started — and Lost — the Assistant Race How Google is Remaking Itself as a “Machine Learning First” Company — Backchannel AI, Apple and Google Technology Lighting the way to deep machine learning | Engineering Blog | Facebook Code Intel Launches ‘Knights Landing’ Phi Family for HPC, Machine Learning The Toronto Raptors Are Using IBM’s Watson to Draft A Winning Team | Motherboard Projects Hello, TensorFlow! How to read: Character level deep learning GitXiv: Collaborative Open Computer Science Machine Learning Yearning Mastering Feature Engineering – O’Reilly Media Bonus I didn’t have time to cover: The Stanford Question Answering Dataset
This week’s show looks at Facebooks’ new DeepText engine, creating art with deep learning and Google Magenta, how to build artificial assistants and bots, and applying economics to machine learning models. Here are the notes for this week’s show: DeepText: Facebook’s Text Understanding Engine Introducting DeepText: Facebook’s Text Understanding Engine FBLearner Flow Research: Text Understanding from Scratch Natural Language Processing (almost) from Scratch Machine Learning and Art Google Magenta Neural Art A Neural Algorithm of Artistic Style Neural Art in TensorFlow Autoencoding Blade Runner Courses: NYU’s Machine Learning for Artists Goldsmith’s University of London The Latest TensorFlow Paper TensorFlow: A system for large-scale machine learning Business of ML & AI Microsoft Confirms Microsoft Ventures VC Arm Intel Acquires Computer Vision for IOT, Automotive Lumiata Closes $10 Million Series B Financing with Intel Capital Findo raises $3M to help you find files and documents through natural language queries More Bots, and How to Build Artificial Assistants Motion AI lets anyone easily build a bot Sequel lets you create a ‘Me’ bot, beats Google to the punch Hybrid Intelligence: How Artificial Assistants Work The Economics of Machine Learning models The preoccupation with test error in applied machine learning Towards Cost-Optimized Artificial Intelligence More Cool Deep Learning posts Deep Reinforcement Learning: Pong from Pixels A Survey of Deep Learning Techniques Applied to Trading Just for Fun Building an IoT Magic Mirror Magic Mirror on GitHub Image Credit: Microsoft