We could not locate the page you were looking for.

Below we have generated a list of search results based on the page you were trying to reach.

404 Error
Since Sep 2016, I am a University Lecturer (equivalent to US Assistant Professor) in Machine Learning at the Department of Engineering in the University of Cambridge, UK. I was before a postdoctoral fellow in the Harvard Intelligent Probabilistic Systems group at the School of Engineering and Applied Sciencies of Harvard University, working with the group leader Prof. Ryan Adams. This position was funded through a post-doctoral fellowship given by the Rafael del Pino Foundation. Before that, I was a postdoctoral research associate in the Machine Learning Group at the Department of Engineering in the University of Cambridge (UK) from June 2011 to August 2014, working with Prof. Zoubin Ghahramani. During my first two years in Cambridge I worked in a collaboration project with the Indian multinational company Infosys Technologies. I also spent two weeks giving lectures on Bayesian Machine Learning at Charles University in Prague (Czech Republic). From December 2010 to May 2011, I was a teaching assistant at the Computer Science Department in Universidad Autónoma de Madrid (Spain), where I completed my Ph.D. and M.Phil. in Computer Science in December 2010 and June 2007, respectively. I also obtained a B.Sc. in Computer Science from this institution in June 2004, with a special prize to the best academic record on graduation. My research revolves around model based machine learning with a focus on probabilistic learning techniques and with a particular interest on Bayesian optimization, matrix factorization methods, copulas, Gaussian processes and sparse linear models. A general feature of my work is also an emphasis on fast methods for approximate Bayesian inference that scale to large datasets. The results of my research have been published at top machine learning journals (Journal of Machine Learning Research) and conferences (NIPS and ICML).
There are few things I love more than cuddling up with an exciting new book. There are always more things I want to learn than time I have in the day, and I think books are such a fun, long-form way of engaging (one where I won’t be tempted to check Twitter partway through). This book roundup is a selection from the last few years of TWIML guests, counting only the ones related to ML/AI published in the past 10 years. We hope that some of their insights are useful to you! If you liked their book or want to hear more about them before taking the leap into longform writing, check out the accompanying podcast episode (linked on the guest’s name). (Note: These links are affiliate links, which means that ordering through them helps support our show!) Adversarial ML Generative Adversarial Learning: Architectures and Applications (2022), Jürgen Schmidhuber AI Ethics Sex, Race, and Robots: How to Be Human in the Age of AI (2019), Ayanna Howard Ethics and Data Science (2018), Hilary Mason AI Sci-Fi AI 2041: Ten Visions for Our Future (2021), Kai-Fu Lee AI Analysis AI Superpowers: China, Silicon Valley, And The New World Order (2018), Kai-Fu Lee Rebooting AI: Building Artificial Intelligence We Can Trust (2019), Gary Marcus Artificial Unintelligence: How Computers Misunderstand the World (The MIT Press) (2019), Meredith Broussard Complexity: A Guided Tour (2011), Melanie Mitchell Artificial Intelligence: A Guide for Thinking Humans (2019), Melanie Mitchell Career Insights My Journey into AI (2018), Kai-Fu Lee Build a Career in Data Science (2020), Jacqueline Nolis Computational Neuroscience The Computational Brain (2016), Terrence Sejnowski Computer Vision Large-Scale Visual Geo-Localization (Advances in Computer Vision and Pattern Recognition) (2016), Amir Zamir Image Understanding using Sparse Representations (2014), Pavan Turaga Visual Attributes (Advances in Computer Vision and Pattern Recognition) (2017), Devi Parikh Crowdsourcing in Computer Vision (Foundations and Trends(r) in Computer Graphics and Vision) (2016), Adriana Kovashka Riemannian Computing in Computer Vision (2015), Pavan Turaga Databases Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases (2021), Xin Luna Dong Big Data Integration (Synthesis Lectures on Data Management) (2015), Xin Luna Dong Deep Learning The Deep Learning Revolution (2016), Terrence Sejnowski Dive into Deep Learning (2021), Zachary Lipton Introduction to Machine Learning A Course in Machine Learning (2020), Hal Daume III Approaching (Almost) Any Machine Learning Problem (2020), Abhishek Thakur Building Machine Learning Powered Applications: Going from Idea to Product (2020), Emmanuel Ameisen ML Organization Data Driven (2015), Hilary Mason The AI Organization: Learn from Real Companies and Microsoft’s Journey How to Redefine Your Organization with AI (2019), David Carmona MLOps Effective Data Science Infrastructure: How to make data scientists productive (2022), Ville Tuulos Model Specifics An Introduction to Variational Autoencoders (Foundations and Trends(r) in Machine Learning) (2019), Max Welling NLP Linguistic Fundamentals for Natural Language Processing II: 100 Essentials from Semantics and Pragmatics (2013), Emily M. Bender Robotics What to Expect When You’re Expecting Robots (2021), Julie Shah The New Breed: What Our History with Animals Reveals about Our Future with Robots (2021), Kate Darling Software How To Kernel-based Approximation Methods Using Matlab (2015), Michael McCourt
“Speech is the most natural way to communicate, so using voice to interact with machines has always been one of the top scenarios associated with AI.” Getting to know Li Jiang Li Jiang, now a distinguished engineer at Microsoft Speech, was introduced to speech recognition while doing research as an undergraduate student in the mid-1980’s. He fell in love with the problem space while building a speech recognition system on a rudimentary Apple computer. “It was simply magic to see [the] computer respond to voice and recognize what is said.” Halfway through his PhD program, Li started to focus on speech recognition at an internship with Microsoft Research in 1994. Li loved the company and the experience so much so that he stayed on, and never actually returned to school! Over the past 27 years, Li has worked on different roles in both the research and engineering teams, eventually returning to the field of speech. Li is currently leading the audio and speech technology department under Azure cognitive services. “I was fortunate to witness not only the dramatic advancement in technology in the past few decades, but how the technologies are enabling people to do more and to improve their productivity. It has been a great ride and I loved every moment of it.” Progression of Speech Technologies Early speech technologies were pattern recognition and rule-based expert systems. When Li got started, a simple pattern matching tool was one of the early leaders in the space. One of the better-known technologies at the time was called DTW, Dynamic Time Warping. These systems essentially tried to match a language sequence given with another sequence template, and would recognize the language if the sequence matched a template. However, there was always some kind of restriction– these systems worked best with isolated speech, small vocabulary, or single-speaker audio. The introduction of Hidden Markov Models (HMM) served as a foundation for modern speech recognition systems, which enabled accurate recognition of large vocabulary, speaker-independent, continuous speech. Around 2010, deep learning studies started showing promising results for speech recognition. The LSTM model was found to be very suitable for speech, and it has since been used as the foundation for the current generation of improved models. More recently, transformer models, which originated in Natural Language Processing research, have shown promising improvements across different fields in speech. In 2016, Microsoft actually reached human parity on the challenging Switchboard task, thanks to deep learning technologies. The Switchboard Task & Human Parity The Switchboard task is kind of like the Turing Test for the speech community, in that it’s the benchmark of human equivalence in system performance. In it, two people select a common topic and they carry a free form conversation that the systems transcribe. Initially, systems had a very high error rate on this task, in the range of 20%. In October 2016, the professional transcribers reported an error rate of 5.9% on the switchboard task, while the Microsoft system using deep learning was able to achieve 5.8%. This was the first time human parity had been achieved in the speech recognition space. Pros and Cons of Different Architectures The traditional language architecture is a hybrid model made up of three parts: an acoustic model, a language model, and a pronunciation model. An acoustic model tries to model the acoustic events, essentially trying to tell which particular words are produced in a sequence. The language model is trained on hours of speech, and trimmed down to recognize text. The pronunciation model connects the acoustic sound and the word together. The biggest benefit of this hybrid model is that it’s easily customizable– since it has such a robust collection of sounds and speech, it’s easy to feed it a new word or new sound and have it integrated into the system. The downside of this hybrid model is that the memory footprint is huge. Even in its highly compressed binary form, it still takes multiple gigabytes. End-to-end models have received a lot of attention recently, and rightly so—they have progressed immensely over the last few years. An end-to-end model is a single model that can take in speech and output text, essentially jointly modeling the acoustic and language aspects. End-to-end models can more or less match a hybrid model in accuracy scores, while being much more compact than a hybrid model: an end-to-end model is small enough to fit into a smartphone or an IoT device. The downside to the end-to-end model’s smaller size is that the model is much more dependent on the speech labels and data it’s trained on, and not as adaptable to new language. It’s much harder for an end-to-end model to incorporate new vocabulary, and the community is working on ways to make end-to-end models more flexible. Integrating Specialized Vocabulary Even though a generic system performs pretty well, there are many domain-specific terms that systems can struggle with if not trained on them specifically. For this reason, Microsoft allows customers to bring their own specialized data and customize their language models. This is especially necessary for specialized domains like medicine, which have a lot of specific terminology and require a much deeper data investment. Microsoft recently acquired Nuance, a leader in the medical speech recognition space, to help make this process even smoother. Li believes that it’s important to continue to improve both generic model capability and domain-specific models. The more data an algorithm is given, the more its training becomes effective. Li hopes that eventually a model is going to be good enough to handle almost all the domain-specific scenarios, but until then, we have to take a pragmatic approach and ask how we can make this technology work for different domains. Specific Use Cases at Microsoft A major challenge at Microsoft is figuring out how to stay current in technical innovations while still maintaining a short research-to-market cycle and keeping the cost economical for customers. Li mentioned that there’s a lot of work being done to make inferences faster, models smaller, and improve latency. For most customers, it only takes a few hours of speech data in order to get a really high-quality voice tone. Microsoft uses this technology internally, too—Li mentioned he spent about 30 minutes building a personal voice font for himself. “It’s really interesting to hear your own voice and read your own email, that’s a very interesting experience.” For large and widely-spoken languages, like English and Chinese, there’s a ton of data to train models on. It’s more challenging when it comes to smaller languages that have less data to train on. To accommodate this, Li’s team is using transfer learning on a pre-trained base model, then adding on language-specific data. This approach has been working really well! Emotional Encoding & Deepfakes The Microsoft Speech team is also working on encoding emotional styles into TTS software, which can be differentiated across vertical domains. For a news anchor for example, the voice tone is programmed as calm and reputable, whereas for a personal assistant it’s more warm and cheerful. In order to prevent the use of technology for malicious deepfakes, a top priority at Microsoft is making sure text-to-speech software is being used responsibly. The company has a dedicated responsible AI team, and a thorough review process that ensures a customer’s voice is coming from their own person and not someone else. They are also working on a feature that could offer a unique watermark that can be detected as being generated by text-to-speech software. The Future of Speech Li hopes to continue improving the technology itself going forward. He looks forward to having speech recognition systems learn abbreviations and be able to “code switch”, recognizing the same voice even in different languages. Li hopes to make the system more robust and more portable, easier to apply to different applications with less recognition errors. He said he’s always learning about areas where the system struggles and ways to keep improving its capability to help Microsoft better serve their customers. To hear more about the evolution of speech technology, you can listen to the full episode here!
Sam Charrington: Hey, what’s up everyone! We are just a week away from kicking off TWIMLfest, and I’m super excited to share a rundown of what we’ve got in store for week 1. On deck are the Codenames Bot Competition kickoff, an Accessibility and Computer Vision panel, the first of our Wellness Wednesdays sessions featuring meditation and yoga, as well as the first block of our Unconference Sessions proposed and delivered by folks like you. The leaderboard currently includes sessions on Sampling vs Profiling for Data Logging, Deep Learning for Time Series in Industry, and Machine Learning for Sustainable Agriculture. You can check out and vote on the current proposals or submit your own by visiting https://twimlai.com/twimlfest/vote/. And of course, we’ll have a couple of amazing keynote interviews that we’ll be unveiling shortly! As if great content isn’t reason enough to get registered for TWIMLcon, by popular demand we are extending our TWIMLfest SWAG BAG giveaway by just a few more days! Everyone who registers for TWIMLfest between now and Wednesday October 7th, will be automatically entered into a drawing for one of five TWIMLfest SWAG BAGs, including a mug, t-shirt, and stickers. Registration and all the action takes place at twimlfest.com, so if you have not registered yet, be sure to jump over and do it now! We’ll wait here for you. Before we jump into the interview, I’d like to take a moment to thank Microsoft for their support for the show, and their sponsorship of this series of episodes highlighting just a few of the fundamental innovations behind Azure Cognitive Services. Cognitive Services is a portfolio of domain-specific capabilities that brings AI within the reach of every developer—without requiring machine-learning expertise. All it takes is an API call to embed the ability to see, hear, speak, search, understand, and accelerate decision-making into your apps. Visit aka.ms/cognitive to learn how customers like Volkswagen, Uber, and the BBC have used Azure Cognitive Services to embed services like real-time translation, facial recognition, and natural language understanding to create robust and intelligent user experiences in their apps. While you’re there, you can take advantage of the $200 credit to start building your own intelligent applications when you open an Azure Free Account. That link again is aka.ms/cognitive. And now, on to the show! Sam Charrington: [00:03:14] All right, everyone. I am here with Cha Zhang. Cha is a partner Engineering Manager with Microsoft Cloud and AI. Cha, welcome to the TWIML AI podcast. Cha Zhang: [00:03:25] Thank you, Sam. Nice to meet you. Sam Charrington: [00:03:27] Great to meet you as well. Before we dive in, I’d love to learn a little bit about your background. Tell us how you came to work in computer vision. Cha Zhang: [00:03:38] Sure. Sure. I actually have been at Microsoft for 16 years. I joined Microsoft originally as a researcher at Microsoft Research. I was there for 12 years. My research was primarily applying machine learning to image, audio, video; all of these different applications. I started 2016. I joined the product side, and currently I’m working as an Engineering Manager, and my primary focus is on document understanding. Sam Charrington: [00:04:11] Awesome. Awesome. So, we will be focusing quite a bit on OCR and some of your work in that space, and, you know, I think people often think of OCR as a, you know, a solve problem, right? It’s, you know, we’ve been scanning documents and extracting texts out of those documents for a long time. Obviously the advent of deep learning, you know, changes things, but I’d love to get the conversation started by having you share a little bit about, you know, what’s new and interesting in the space. How has it changed over the past few years? Cha Zhang: [00:04:50] Sure. Actually, it wasn’t very long ago, when people talk about OCR, what comes out of mind was firstly scan documents. In many people’s eyes, OCR for scan documents is sort of a solve the problem. More likely, I think there’s two major development. One is with a mobile first kind of word where everybody now have mobile phones and they take pictures everywhere. So there’s a lot of demand to do a text recognition out of images in the wild, and that certainly is a much more challenging problem than scan documents, and then technically, because of the advances in deep learning, we have realized that with deep learning, we can do OCR at a different level. We can make it a lot more accurate than before, and we can solve OCR problem in kind of imaging the wild scenario. So I think it started at 2000, early 2010 ish. I think there’s a lot of big advent advances in this area, and now we’re seeing basically OCR becomes something really that works. You know, people don’t need to worry about quality, etcetera, just mostly works. Sam Charrington: [00:06:08] Can you talk a little bit more about the challenges that arise when you’re trying to do OCR in the wild? Cha Zhang: [00:06:16] Of course. I think for documents, usually it’s white background and black text, but for images in the wild, essentially it’s a photo. So in the photo, there’s a lot of variations in the text. First there’s a huge scale variation, so some texts, if you capture a picture of a street, there might be some store name that are super big, and then there are some tiny texts that’s hard to see. So there’s a big variation in scale of the text and the aspect ratio of these texts can be a really long cause text string can be very long compared to regular objects, like a cat or a dog. Because of the mobile capture scenario, usually it’s difficult to integrate close these texts by and access a line of rectangles. For example, you’re not, there might be perspective just portions of the text when the camera sees them. The background in the image in the wild is much more complicated than the typical white background you see in scan documents, and some of these backgrounds, such as fences, breaks, and stripes, are even though they appear quite simple for human beings, but think of like fences can be a perfect, a bunch of ones, you know, on the street sitting there and they look very similar to two characters. So those create additional challenges, and I think one of the biggest one, I think technically for OCR, that’s challenging is the localization accuracy. So, typically in object detection, the localization accuracy, if it’s measured by intersection of a union, and if that criteria is bigger than 0.5, people think this is good enough, but for OCR, if you actually, the intersection is only half of the union, a lot of the characters will be missing. So, usually OCR will need a 0.9, 0.95 level kind of accuracy in order to recognize all the characters properly. So… Sam Charrington: [00:08:31] Can you explain that in more detail? What is intersection over union and how is that used in convect detection? Cha Zhang: [00:08:39] So, in order to measure the accuracy of a particular detection algorithm, you need to ground truth label the data, and so, typically what people do is they create a bounding box of the object to be determined, to be detected, and then you use a automatic algorithm to figure out where the object is, then that will also create a bounding box. Now you have two bounding boxes. and the question is how do you measure how well these two boxes align and, a common measure is to take the intersection of these two bounding boxes and you take the union of these two bounding boxes that you get two areas. You can imagine if the two bounding boxes are very close to each other, overlapping a lot, then that intersection of a union would be very high, but if they are off, they’re offset by quite a bit, then you know, the number is low. So that’s kind of academia standard, how people measure detection accuracy with this criteria. Sam Charrington: [00:09:46] Got it. And so, you were saying that the threshold that you need in the case of texts is higher because of what? Cha Zhang: [00:09:58] Because of… Let’s just think about, you know, you have a ground truth text, let’s say, “Hello world,” and it’s elongated a rectangle and you say, I have a text detection algorithm that creates also a bounding box, but have a intersection of a union, let’s say roughly 0.5, and so what that means is that the intersection area divided by the union of the two bound inbox is 50%. So very likely the detective bounding box will miss a few characters because, you know, the overlapping is not there. So, you might be missing at, you might miss a D as an N and all this will cause the OCR to produce wrong results. And so that’s the main challenge here. Sam Charrington: [00:10:48] So in the case of a traditional object detection scenario, you may miss a half of the face but you can tell that there’s a face there in the case of OCR, you’re just missing letters and it makes it a lot more difficult for the algorithm to guess what was there. Cha Zhang: [00:11:07] Yes, exactly. Sam Charrington: [00:11:08] Got it, and maybe taking a step back just to the problem as a whole, granted mobile is driving, you know, this transition to these in the wild pictures and people trying to OCR them, but what are the high value use cases there? Like, is it, you know, I’m thinking of some interesting ones as like the… when it’s in conjunction with translation, you know, maybe I’m in another country and I’m, I’ve done this. You know, you’re taking pictures of, of words and another character to try to read the menu or something like that. I’ve also done things like scan documents on a phone and, and you won’t want to OCR those, but that’s kind of back to the traditional OCR problem in a lot of ways. What are some of the other use cases that are common? Cha Zhang: [00:11:58] If you look at this kind of business opportunities, I still think the traditional document, you know, scan document, I think, some traditional kind of OCR problems that like, for example, receipts, where people can scan in the old days, but nowadays people mostly do reimburse them by taking or snapping a photo. So I think in term of the market, the revenue, I think that’s still quite a big one. There are a few others. The one that you mentioned, if you have a phone, you go to a foreign country, you snap a photo and you want to translate them as one. There’s also a lot of applications in digital asset management. So this is when you, either you are a big company or you are a personal kind of, you have some big storage of photos and where you want to organize these photos. We have shown that with OCR capability, you can increase the accuracy of processes, photos, and retrieve these photos. As a matter of fact, you know, the big search engines like Google and Bing, when they search images, OCR is integral part of that as well because the OCR, the content can help a lot in getting the best images. Sam Charrington: [00:13:22] Okay. And so, you were mentioning kind of some of the technical challenges and localization of the texts in these images is one of those challenges. How do you go about it? Is it the case that, you know, deep learning is so powerful off the shelf. Deep learning techniques just solves it for you or do you, you know, you reengineer the whole pipeline? How do you approach that? Cha Zhang: [00:13:53] So in text, this action, usually the detection pipeline is different from a traditional object detection. What’s been most popular for kind of OCR for imaging in the wild today is something called anchor free detection. So the idea… Anchor free. In a typical object detection, usually most well known anchors, like fast RCN and faster RCN, etcetera. They basically create these anchors and then they regress the actual bounding box of the objects. The challenge of using that kind of approach is that these anchors need to be preset, and so typically for normal object detection, you set at a certain density, and then you set a certain set of aspect ratios. Like your anchor box are one to two, one to three, one to one. Typically you go about there, but texts, some of the text can go like 20 to one so really you cannot, it will be a huge computational cost to go with anchor based approach. So modern days for OCR, we go anchor free, and the high level concept is essentially by using convolutional neural networks. You almost do kind of a per pixel level, a decision or classification saying, well, this region nearby this particular pixel, it looks like part of text. So there is a text/non-text classification almost kind of per pixel level. Then you rely on a few algorithms to group these into text vines by looking at how well two, for example, two texts, the region are similar to each other and you can decide, well, these two looks like the same textures and color, and maybe they should be connected. In this regard, there are quite a few well known algorithms to do this connection. In earlier days, people use a relatively kind of a rule-based approach like stable link where they link based on some features, but it’s kind of a rule-based. More recently, people start looking to new networks like relation network. So are kind of estimating the relation of two regions are features, and based on that to decide, well, these two should be connected or not. So that way you started kind of bottom up; you start with perfect kind of classification, and then you do grouping, and you come out with these text lines. Very powerful approach. It can not only detect kind of a straight lines, but even curve lines, you can handle them pretty well with those approaches. Sam Charrington: [00:16:44] So it sounds like you’re describing a pipeline. That’s not like a, end to end train single neural network that you give it images and train it on label data. It is, telling you what the text is, but rather a bunch of independent steps. Cha Zhang: [00:17:04] Yes, that’s a very good observation. Actually, so for OCR, detection is only the first step and after detection, we typically run a character model where you take the detected text lines, you normalize them into a straight line with a fixed height, and then you run a character model to actually decode the image into a character, a list of characters. There are a lot of approach actually similar to speech where, you know, speeches going from acoustic similar to these texts. But here we’re going from image to text. But a lot of the approaches that we use, like LSTM, language modeling, these are very similar. Now your question is certainly valid because in speech today, you know, people do end to end training you. They start from audio so they can directly go to text. For OCR, we are not a year yet. I think the main challenges, well first is how much data you have. I think speech, you can collect a lot more data compared with OCR. OCR data are usually very expensive to collect in a label and so, going stage by stage at this point is more economically doable than, you know, do end to end training. Sam Charrington: [00:18:25] Why is that? It seems that we have tons of pictures with words in them that we know particularly, is it just in the wild, the, in the wild examples where we don’t have the label data or is also this document use cases because I’m imagining, Microsoft has probably labeled a ton of receipts and business cards and that kind of thing. Cha Zhang: [00:18:50] Yeah. I think certainly a labeling is very, very expensive. For Microsoft, we are a company paying a lot of attention to privacy, you know, those kinds of issues and the collecting OCR data has been a major, I would say, blocking issue to go for this kind of end to end approach because if you think about it, a lot of the document that we actually carry, like if you say, talk about invoice, talk about receipts, business card, they all contain PI information. Those are data extremely difficult to obtain, and we follow very strict kind of guidelines – how we can collect them, how we can label them. So in some way we are limited by these privacy restrictions, but we do respect those a lot. So we, as a result, you know, we are now going end to end at this point. Sam Charrington: [00:19:48] Got it, got it. It makes me think a little bit about the, some of the issues with neural networks, remembering data. So for example, there are examples where you’re, you train a CNN and there are some attacks that you can do that will reproduce some of the images, you know, it’s to some degree or another, that the model was trained on. Likewise, with these very large language models, you can start to see some of the texts that the models were trained on, come out in the, in the output. I would imagine if you were training end to end, at least then that becomes an issue as well, and maybe more so than in the case of images.   What’s your intuition there? Would it be worse or are better than images? Cha Zhang: [00:20:39] I would imagine it will be similar, I would say. So after all, you know, OCR, you come from image to text, but during the learning of this OCR process, language model is actually very helpful to help improve the OCR accuracy. So, for example, during decoding of these texts lines into a text, we use some of the, like LSTM or, you know, basically these very popular language modeling schemes. Certainly it remembers the contextual information of the language in order to help the OCR to recognize these texts properly. So, I think when you go to end to end, when the amount of data that you use for training is humongous, I think, it’s difficult to imagine for me, you know, we’ll have similar level of data for training like BERT models or TBT models. Those are huge, huge amount of data, but still you will learn something from the text and they might leak into the model as well. Sam Charrington: [00:21:51] Along those lines, what enabled BERT and many of the recent innovations around language models is a shift from supervised to the semi-supervised way of framing the task. Is there a semi-supervised framing for the OCR test? That makes sense? Cha Zhang: [00:22:13] Actually for OCR today, we are not, although I think it’s definitely a very interesting research problem. I think BERT is a super nice framework for transfer learning. You know, you, you go from pre-trained model and then, you know, on a supervisor, you can… In the image word, I think, transfer learning probably exists earlier in image than language. So earlier days when we have ImageNet, we trained like a resident, those are already being used for transfer learning. So, unsupervised kind of image learning is also, I think it’s still ongoing. There’s a lot of interesting projects going on. I think for OCR right now, we’re not there yet. Like one of the main issues for building a product like OCR to use some of these pre-train model is the computational cost. I think this happens in language as well, BERT model, the GPT Model 3, like, you know, multi billions of parameter is very difficult to turn them into a product for OCR. It’s also, you know, we have the same problem. Computational cost is very sensitive. We need to make it fast, and so we’re using it relatively small models and normally we train from scratch. Transfer learning does show some benefit, but when the data reaches a certain amount, we found training from scratch is perfectly fine. Sam Charrington: [00:23:49] When you have a certain amount of data to train from? Cha Zhang: [00:23:53] Yeah. In the very early days when we started doing different learning OCR, we actually rely a lot on trans distillation – that’s teacher-student learning, where we first train a big and model, and then we gradually use teacher-student learning to create a small model so that it can run efficiently. Nowadays, we have figured out that you can train these models from scratch. The amount of data that we have on the order of, you know, hundreds of thousands and millions of images are sufficient to train from scratch on smaller model, and reach about the same accuracy. Sam Charrington: [00:24:31] Can you elaborate a little bit on that? Are you saying that you need more data to train smaller models? Cha Zhang: [00:24:37] No, I’m saying that… Take BERT as example. BERT is super beneficial for transfer of learning because it has seen so many documents. So giving any new language task, presumably your data is not much, there’s not much data that you have to train this new task, and therefore, leveraging BERT, where it has seen so many documents, will help through transfer learning to transfer some of the knowledge that the BERT has learned from this huge set of document, to the small kind of task so that it can reduce the amount of documents required to train the smaller task. The same thing happens in ImageNet transfer learning where, you know, if it’s a ResNet train on ImageNet, you learn a lot of visual information from the ImageNet dataset. Then if you have a tiny detection task, like detecting a helmet, let’s say, and you can do the transfer learning and you can use a very small amount of dataset to actually train a very good helmet detector. What I was saying just now was that for the problem of OCR where, you know, it is certainly a very important computer vision problem. Every company who invest in OCR tend to collect quite a bit of data, not to the level of, you know, billions, but hundreds or thousands, millions to that level, that amount of data is sufficient that you do not need to go transfer learning. You can train the model from scratch and you get very good results. Sam Charrington: [00:26:19] Got it. Got it. So when you were using transfer learning where you’re using models based on ImageNet, you know, along the lines of ResNet and others, or whether… Okay. Lets see… so the smaller models that you’re training are they, you know, some of the traditional architectures that we’ve already brought up or are you building out new architectures for the models themselves for this specific problem? Cha Zhang: [00:26:53] Right now we’re using some of the traditional models. There are some active research going on regarding searching the best effective architecture for OCR. We haven’t seen convincing results yet, but I think that’s a very active research area that we’re still kind of looking into, particularly when we try to make it smaller and smaller, you know, faster and faster. Sam Charrington: [00:27:20] When you say searching the best architecture for OCR, are you speaking using the word searching generally, like you have researchers are looking at different models and trying to find the best one for OCR, or are you suggesting a domain specific neural architecture search kind of…? Cha Zhang: [00:27:38] I mean neural architecture search. So that certainly can be applied to OCR and we were still exploring it, but I think that’s a very promising direction. Sam Charrington: [00:27:49] Okay. Interesting. Interesting. Earlier in the conversation you talked about one of the big use cases is some of these semi-structured data that we want to extract information out of – invoice is one example. There was a recent demonstration, or I guess that’s actually a product now of the mobile version of Excel or something. You can take a picture of a grid, grid like data, and that will, you know, both extract the text and organize it into a spreadsheet. Talk a little bit about the product that you’re working on the form recognizer, which is doing something similar. Cha Zhang: [00:28:35] Yeah, of course. So OCR certainly is pretty low level. Other than some of the application I mentioned earlier, like digital SMN and then photo managing, you know, translation, you can directly use OCR, but for many customers, what they want is not just OCR. They want to extract information from documents. Think about, you know,I need to process millions of invoices. I want to extract vendor name and the date, total amount, or if it’s an MS expense system where you want to process all the receipts, and either it can be a verification purpose, for example, like, okay, how do I make sure employees are not putting random numbers and they don’t match with the receipts that’s actually filed. It’s actually, it sounds kind of silly but you know, today, a lot of the company do this verification manually. Because of the huge manual amount of effort needed, they often can only do sampling. So you sample like 5% of these receipts to validate, but you kind of miss a huge chunk, and that you never even look at it? So we are looking at this space and we’re trying to build essentially two category of product – one is a previous set of product and these are solutions that works out of the box. For example, it can be a prebuilt receipt, pre-built business card, pre-built invoice. So these are, basically you’re sending an image or PDF file. It will extract all the fields that you’re, you’ll be interested in. Another big category that we think are super important is customization because, you know, the pre-build may never fit every need. So we have a solution called the custom form where we allow customer to basically send us a few sample images. You can either label or even, you know, not doing any labelling but we will be able to extract key value pairs out of these documents. Again, we see this as a much closer to what the customers need and that’s what the form recognizes its position as. Sam Charrington: [00:30:54] So we’ve talked about a bunch of the interesting technical challenges at the lower level at OCR. Does the form level, you know, is that a kind of a packaging of OCR? Does it have its own technical challenges to overcome…? Cha Zhang: [00:31:13] Actually it has a lot of very interesting challenges. So, one of the work recently is coming out from Microsoft research, whereas, you know, targeting exactly this problem. And so, just think about it. The language, I mean, passing these invoices and receipts are essentially sort of a language problem because you have these texts there. The challenge here is that these are images, so you run OCR on them, but unlike a typical language, a data set where you’ve scratched from the internet, you know, Wikipedia there’s basically have this ordering of these words already, but if these data coming from image, essentially you can detect these texts lines, but it’s actually very difficult to define the read order of these texts lines, and ordering of these texts lines by itself is a very challenging problem. When you have images in the wild, paper can be curved, you know, can be crunch, can be rotated here, the perspective, you know, all kinds of issues. They can have background text, you know, all these. So the particular approach that MSRA came out is called LayoutLM. It’s actually a modified a BERT model. It’s also a language model, but in addition to the language, we also embed 2D information, like what is the X, Y position of the bounding box of the text? So with that information, train, actually, this is all can also be trained without supervision. It’s unsupervised pre-training. We are able to learn this kind of spatial relationship in these invoices without coming out with explicit read order. With that, we actually can do a lot of these key value extraction really well. There’s also quite a lot of advanced research looking into say, relation networks where you see two text lines nearby each other, you can predict the relationship. Again, this is similar to the OCR where you have these bottom pixel level classification. You want a group of them here. You want a group P key and a value pairs. There’s also a lot of advanced research in this graphical convolution networks where you do convolution networks over a graph, where the graph is defined by connecting nearby text lines. Again, this is approach without requiring reading order, but just look at the spatial relationship. So these are all actually very exciting kind of extension of language, but also using visual information to help passing these vertical data more accurately. Sam Charrington: [00:34:09] Interesting. Yeah, I think it’s… At a quick thought would’ve imagined that, you know, maybe the top part of the stack, there is more rule-based than the bottom part of the stack was, you know, more machine learning base, but it sounds like they’re even, I don’t know, relatively, but there are a bunch of really interesting… Cha Zhang: [00:34:33] We are doing a lot of machine learning stuff on the top as well. Sam Charrington: [00:34:37] I’m imagining the, you know, when you talk about relation net, for example, on an invoice you could have date, and then the date, you know, horizontally next to it, or you can have date and then the date beneath it. Cha Zhang: [00:34:50] Yes. Sam Charrington: [00:34:50] You may have an address box and then a bunch of texts that comes beneath it. It would be nice to know that, you know, we’re talking about the address here. That’s part of the idea of the structured text extraction. So in that you mentioned relation net and graphical CNNs. Are those two approaches to solving the same problem or are they solving different aspects of the problem? Cha Zhang: [00:35:13] They solve different aspects of the problem, and they can be also used to solve the same. I mean, like right now, the main focus for us, for them for extracting key value pairs. This is both kind of pre-build and the customization. Think about, if it’s an invoice and you want a vendor name, so it’s a name. Certainly, you know, the text information because you see it looks like a vendor name. This probably is a vendor name and some invoice doesn’t even have the key in the invoice. Sam Charrington: [00:35:48] Right. Cha Zhang: [00:35:49] You don’t even have the word vendor name there, so how do you figure out this thing is still vendor name? So, there, you rely on information that’s language and that’s also kind of how the document is laid out. Like, okay, the font size may matter. You know, the position of the same may matter. So we are looking into combining all this information to come out with a better decision on those fields. Sam Charrington: [00:36:21] So, how does a graphical representation or way of thinking about the document gets you to a solution to these kinds of problems? You know, for example, the unlabeled vendor name? Cha Zhang: [00:36:33] The graphical kind of approach is basically… so you’ve got a bunch of text lines detected by the OCR and you connect to these texts lines with their neighbors. You define basically how strong these connections are. Actually it’s not defined. You actually learn these relationships by looking at the texts, looking at their relative positions, looking at their font similarity. Like one issue that you actually just mentioned was like address as you connect ’cause you have multiple lines of addresses. How do you know they actually belong to the same address? Right? So there’s this kind of, all these side information could be very helpful in determining that they should be grouped together. In the convolutional kind of graphical model, you learn a convolutional network by computing from all the neighboring nodes where each node is a text line to aggregate basically at the center node. So basically, the model learns by not only looking at the current text line that’s in focus, but also look at all the nearby text lines and decided, well, given all these contextual information, it does look like this is a vendor name. I guess that’s a very high level conceptual description of why it would work, but it’s the data driven machine learning so that the model [inaudible]. Sam Charrington: [00:38:06] As you’re solving problems like this, are you often needing to re-label your dataset? For example, imagining early on in developing an algorithm like this, you have a bunch of invoices, and you draw a bounding box around the addresses and you say, this is the address. Then you say, ‘Oh, well the font information is a whole new dataset,’ you have to label, well, this is… Are you going in and having people label Helvetica versus Arial? That seems a bit fine grain and hard to actually get an experts to label, or is it more abstract than that? Cha Zhang: [00:38:48] We usually only label the end goal, which is the field that you’re going to extract. So, for example, you want to extract a vendor name, vendor address, total text, you basically draw a bounding box in those regions and use that as a ground use data. Sam Charrington: [00:39:06] Got it. I think we’re going to the same place. When you say font… Cha Zhang: [00:39:11] When I say font, actually it’s in some way, implicit in the sense that we’re taking these bounding boxes, we’re extracting image information. Right? So think of it as let’s say, run a convolution network to extract a feature of that part of the text region, the text line. So, this feature is essentially all the visual information that can be helpful in deciding or determining the relationship between text lines. So if features are similar, it probably mean they are similar font, they are similar size, you know, so those kinds of… So, yeah, I think that seems to be sufficient. Sam Charrington: [00:39:55] So you’re not trying to kind of featurize your underlying images into these distinct things because what I inferred, when you said font. Do you look at the, you know, is there an analogy to kind of looking at the layers of the network, and when we do this with CNN, GC, like textures and things like that, is there some analogy that you’ve seen in looking at the layers of the network that says, ‘Oh, this layer is like identifying fonts.’ Cha Zhang: [00:40:32] No, we haven’t been going there yet. Well, I guess it’s certainly interesting to look at it. My take is most likely, font is just one attribute. I believe there are many other things. Yeah, I think it’ll be interesting to look at these features visually. Yeah. Sam Charrington: [00:40:54] We’ve talked throughout the discussion about kind of the ways that OCR and this form recognition problem kind of blends the vision domain and NLP domain and language models has come up quite a bit. Is there a little bit more kind of depth we can go into there? Some of the ways that, that you see, NLP, and particularly the advances in NLP over the past few years kind of influencing the problem and the way you solve it? Cha Zhang: [00:41:32] Yeah. We set up, I see NLP plays a very important role in these verticals. After all, these invoice receipt, business card, these are all human artifacts. They’re kind of language artifacts in some way. Right? So, all of the kind of latest state of the art in language modeling, we definitely want to leverage The thing I mentioned earlier, like the layout or it’s a one way to leverage them by using the language model, but also embed additional visual information, and hopefully to solve these problems effectively because input is really different, right? You know, the priorities like you take texts, it’s input here. We’re taking a bunch of texts lines to the locations and bounding boxes as inputs, and the algorithm can naturally kind of solve these problems. Sam Charrington: [00:42:30] And,is it also trying to do the traditional language model predicting the next character or word or set of texts? Cha Zhang: [00:42:38] Yeah, the way we train them are very similar, basically, merge texts – you merge some words and try to predict. Certainly you can use a lot of others. I think, you know, like I know recently people use translation targets. You can use alpha virgin coder kind of targets. This is a really active research area at this point. I don’t think, I think we’re still just scratching the surface, although we already seeing very, very promising results. So we definitely want to look deeper into this and see how well this really can push the state of the art. Sam Charrington: [00:43:21] Kind of continuing on that thread of the active research areas and what the future holds in this area, what are you most excited about in this domain of OCR and in general, extracting text from documents, vertical applications and the like. Cha Zhang: [00:43:42] Yeah, I think, we have been working on this problem for quite a while, but I think there’s still a lot of interesting problems. Only when we start to work with customers, we realize, you know, there are problems we haven’t been able to solve. I can just name one, for example, like table extraction sounds trivial, but when you actually look at all the existing tables in the word, the simplest one are those with explicit cell borders where you have straight lines but in reality, these tables can have no cell boundaries at all. It can be mixed on top with STEM, you know, all these things that are kind of making the problem extremely hard. So that’s jus, another one that is extremely challenging, but we want to solve. Another thing that I sort of briefly mentioned about earlier was the customization part of these vertical. How do you customize to customer’s own data instead of having these pre-built ’cause inevitably, you will have data that doesn’t work with these premium models. How do you allow customer to have a way to build their own models to still work, and that by itself is a very challenging problem because asking customers to label a lot of data is painful. They don’t want to go there. So either we go unsupervised or we go with very, very limited in number of supervision data. In such a case, how do we adapt our model so that it can work on this document that customer realize that the premium model has failed. That’s also very interesting kind of research problem that we are looking into. I envision in a language as low shot learning. It’s also, now it’s definitely applicable to the problem here as well. Sam Charrington: [00:45:50] In the case of some of the product ties, vision offerings, Azure does this as well. The user is able to upload its own set of labeled data and kind of the results for object detection are kind of fine tuned against the user’s data set. Cha Zhang: [00:46:13] Yeah. Sam Charrington: [00:46:14] Do the OCR and form recognition offerings, are they providing something similar? Like, can you upload it? Can I upload my own invoices? You’re doing some kind of transfer learning or, well. If you are, what are you doing to take advantage of what the user’s providing? Cha Zhang: [00:46:33] So we do have a product called a custom form which allow customer to upload a few samples here. We usually say minimum of five samples. So, say you have an invoice that doesn’t work with existing models, and so you want to solve the problem when you upload five invoices with similar is fine. These are from the same vendor or kind of looks or similar in structure, and we can figure out these key value pairs and extract them, either unsupervised or supervised. Right? Unsupervised means, customer don’t need to label anything. So you upload the file documents. The information we’re gaining by looking at these five documents is, well, these documents are supposed to be similar and therefore, they’re going to be a bunch of words in this document that actually is common across these documents. This commonality help us to tell, well this is probably part of the empathy form or the template of the form, while the thing that’s varying across forms are like, these are must be information customer has filled in as kind of different from sample to sample. So with that information, we can actually extract key value pairs out of, without any supervision. All you need is upload five similar documents. Of course that works to a certain degree, but if you’re still not happy with accuracy, we provide a way for you to label your key valued pairs. So here is like we, we have a UX where you can go and label the fields you care by essentially highlight the OCR text lines where you think this is the value I want to extract. Then we actually learn a model out of five samples and produce a model that can be used by the customer to extract these values. The accuracy is actually normally pretty high, in the 90/95 percentage range, actually. Sam Charrington: [00:48:38] So when the customer does this, is this process entirely learned or is there a human in the loop kind of exception handling element to it? Cha Zhang: [00:48:50] I guess this is probably kind of take a step back. I think all the products, OCR process today, OCR has made a significant advance, but if you actually care about the numbers, think about the invoice. Right? If your total is wrong, it’s really that bad. So, what we recommend is definitely we recommend people to have agent backup. For all of the products we offer, we give people confidence, right? So how confident we are about the expression of a particular value, and a different customer can choose their own threshold and have an agent to look at them. But I think, today’s accuracy. we don’t recommend kind of strays through, unless you are handling certain specific applications. I can give you an example. For example, if you have a valid, if you’re verifying receipt image against a employee entered data, so there you can go automatic, right? ‘Cause if the OCR produce a different number than the employee, well, you will need somebody to look at them anyway, but if they actually merged them, well, that probably means it’s okay. Sam Charrington: [00:50:08] Right. Cha Zhang: [00:50:08] So the application, you can automate it more. Sam Charrington: [00:50:13] Got it. So, the question that I was asking is slightly different though, and you know, so say you’ve got someone using automated form recognition and they have their five examples that they haven’t been happy with, and they submit that in through some website, our API, is someone at Microsoft taking those, and going, taking them manually through some process to try to figure out why they’re not working or are they thrown into some training job and then the customer’s result gets better? Cha Zhang: [00:50:48] Okay. Now, no, we don’t look at the customer’s data. So this is a fully automated product, meaning, you know, customer basically label these files. They call a API to train a model. The whole process is automated. Sam Charrington: [00:51:04] So under the covers, are they kind of forking off their own model? The last few layers are getting cut off and it’s fine tuning, or is it more elaborate than that, or…? Cha Zhang: [00:51:17] It’s more elaborate than that. Underneath the hood, there are multiple steps. We leverage a lot of information in these sample documents. For example, as I mentioned earlier, there will be words common across these samples. Those are very strong indicators regarding this might be part of the empathy, part of the form where, you probably think these are not so interesting to the customer. Transfer learning is certainly one way of doing that. Right now we are actually train these models without transfer learning. So it’s actually, the model is training from scratch for very few customers we’re able to do this. We’re able to do this because some very interesting work that we have done tobasically augument this data to make sure that you have sufficient data to still be able to train a model out of five samples only. This can be a feedback loop as well. So, if customer’s not happy with a model trained by five samples, you can upload them more and we just train a new model for you. So every time you try and just get a newmodel, that way, it’s a feedback loop where customer can keep improving their model until it to a certain stage where it’s really performing for the customer. Sam Charrington: [00:52:53] So when you say augmenting the five that they’re providing, are we talking about data augmentation and the sense of a transformation pipeline that kind of changes, adds noise, rotates, that kind of thing? Or are we talking about, you’ve got some other data set that you’re adding to their five and training it on that aggregate data set, and that’s how you’re producing a better model? Cha Zhang: [00:53:21] Both. Although I think the latter one is more because actually, when customer label these data, they actually provide, we ask them to provide some additional information. For example, they label, this is a date. We know it’s a date. So in this way we can artificially create more data to fill the form so that we can produce more data to train the model. Also, we use a very robust machine learning algorithms that are robust to very few examples. So, that way we can learn with this limitation. Yeah. Normally, if you look at many of the other offerings that people provide. You have to train with hundreds of examples here. We’re pushing it really down to five and we hope to push it even lower in the future. Sam Charrington: [00:54:11] So I’m assuming that this is a stacked problem and you’ve got some low level OCR, for example, models that are trained with many, many documents. What you’re doing with this form recognizer custom data is more at the top end node of that stack. Is the off the shelf model that I’m using without the five example customization, is that also trained on relatively few examples? Cha Zhang: [00:54:44] What do you mean? Sam Charrington: [00:54:45] I guess what, I guess maybe I’ll jump ahead to the conclusion that I’m drawing on. What’s what’s confusing me is how are you getting better results with few examples if you’re not using any kind of transfer? I guess I heard in your explanation that you’re not doing any kind of transfer. Cha Zhang: [00:55:03] So right now the custom forms support training model and these models are usually… each model is geared towards one particular form type. So in some way you can think this problem is actually restricted. It’s actually a easier problem. It’s not like a pre-built invoice where essentially you want to handle all your invoices. Here we’re handling one particular invoice coming from, I would say one particular vendor. I say they usually use this template. Sam Charrington: [00:55:37] Got it. So the customer then, do they call a unique API to resolve invoices of this type? Or is that then ensembled, and then there’s something that decides whether it’s of the type that you’ve built the new model for? Cha Zhang: [00:55:55] Yeah. So here’s a kind of the recommendation that we give to customers, right? So you maybe start with the previous model, and the previous model may work and then your job is done. If you’re happy, go. Then you certainly say you have a lot of invoices and out of a thousand, 10 of them doesn’t work. So while we offer the customer as well, take these invoices and you can train specific models for these 10 different invoices, you might need to train more than one model as a special model because this invoice may look very different. So imagine you can train like 10 different customer models for this. We actually also offer kind of automatic invoice classification. So a API called a model compose where we can compose these 10 small models into one. So, all you need is just calling to that one. By calling into that one, we also provide you a confidence to say, well, because during testing, the customer send the invoicing. We don’t really know whether it’s one that doesn’t work with this pre-built one or whether it’s part of this. It works well with the previous. So you send this invoice first to the customized version of the model, and we will tell you, ‘Hey, it doesn’t look like any of the 10 you have trained.’ So in this case, you will revert back say, okay, now I’m calling the previous invoice ’cause you sort of know that pre-build actually works well for that. So that’s what we recommend customers to do. Sam Charrington: [00:57:34] Okay. I dug into a little bit of the detail there, but it’s interesting to see kind of how the end-end problem is put together. In a case like this, the ends of that problem are on the customer side, not just the service that you’re offering, and so seeing how the pieces are put together is kind of interesting. Awesome! Well, Cha, thanks so much for taking the time and walking us through some of the interesting things that are happening in these domains. Cha Zhang: [00:58:12] Thank you for having me. Sam Charrington: [00:58:14] Great! Thank you.
What a start to week two of TWIMLcon 2021! Today’s sessions featured speakers from WikiMedia, Prosus Group, Palo Alto Networks, Clorox, Dataiku, Janssen Pharmaceutical Companies, iRobot, Algorithmia, and ClearML sharing their thoughts on building and running data science and ML platforms. We also got an overview of major themes and trends in machine learning for 2021. Without further ado, let’s review. The day started with a chat with Chris Albon, Director of Machine Learning at WikiMedia, the foundation responsible for Wikipedia. Chris shared a lot about what it’s like to build an ML team that supports one of the largest websites in the world and to do so completely open-source and transparently. He had a lot of great advice on choosing open source technologies with vibrant communities of developers who have thought hard about the problem space and the solutions required by their customers. He had some thoughts about how we might want to adjust how we think about models: “Now it’s about productization and treating models less like a crystal chandelier and more like a disposable coffee cup. If you find a better one, use it and throw away the old one.” He also touched on one of the themes that has been emerging through the conference, that of the “full-stack engineer” vs specialization, and made the case that full-stack may be in vogue but that he really valued specialization. Next up, we heard from Paul van der Boor, Senior Director of Data Science at Prosus Group. Paul jokingly refers to Prosus as the biggest internet company that you’ve never heard of. They are a global consumer internet group and one of the largest technology investors in the world. Their portfolio of companies serves more than 1.5 billion people in more than 80 countries and covers classifieds, payment/fintech, food, and education. Paul shared the ML platform architectures of three of their portfolio companies: OLX, iFood, and Swiggy. These architecture reviews were extremely informative; check out the replay for full details. From looking at and working with all these companies and more, he and his team have extracted a set of general principles that they recommend when making platform decisions: Architect for change - the user interface should be a separate layer of abstraction from the infrastructure below it so you have flexibility to change the underlying infrastructure and components. Use multiple components in parallel; try new things under the cover and see what works. Don’t reinvent the wheel - if there is a component off-the-shelf that works then consider using it vs. building it. There are many good components, tools, platforms, and services available now. There is no reason not to use them as long as they’re abstracted from the user interface (point one above.) Use tools that will scale to the degree you need them to scale. Take the MLOps perspective and build for the long haul. What great advice. Continuing the theme of AI Operationalization, we had a panel discussion with Rasool Tahmasbi (Lead Data Scientist, Palo Alto Networks), Sarah Cullem (Director, Head of DTC Analytics & Data Science, Clorox), and Mike Becker (Data Scientist, The Janssen Pharmaceutical Companies of Johnson & Johnson), led by Conor Jensen, Director of AI Consulting and Data Science, Dataiku). Some highlights from the panel: Sarah made the case that simplification of tools and technologies allows your team to focus on solving the real business problems. She shared three keys to success: set realistic expectations; clearly define shared terms be clear if the output will be used by only machines or machines AND humans so that you can deliver what’s needed to the business. Conor Jensen outlined patterns of success that he has seen while working in customers across many industries: Create a Center of Excellence (CoE) that sets up processes and tools; Use prototypes to help get buy-in from leadership; Recognize that the UI that people use to interact with your model...might be as important as the model itself. Rasool and Mike agreed on using POCs (proof of concepts) as a means to get buy-in and emphasized going for “small early quick wins.” Next up was one of our “Team Teardowns,” this time featuring a discussion with iRobot ML team members Danielle Dean (Technical Director of Machine Learning), Mathew Salvari (Lead Principal Machine Learning Scientist), and Mohan Muppidi (ML Cloud Architect). We asked them their thoughts on structure vs. flexibility and found that they were VERY strong proponents of structure. It appears that they are very prescriptive in their tools, infrastructure, and processes and they shared a pretty strong belief that standardization simplifies collaboration and speeds up development. One interesting tip that Danielle raised when we discussed the issue of maintaining team alignment, was their “hub-and-spoke” model. In this approach, they have a central ops team setting standards but also have somebody with operational skills in each major product team so that they can work together to maintain alignment to standards and practices. The last session of the day before the networking and workshop was a talk by Diego Oppenheimer, CEO of Algorithmia. He shared the results of their third annual ML survey titled 2021 Enterprise Trends in Machine Learning. In this survey they found 10 key trends across four main themes and covering budgets, use cases, model counts, governance, technology integration, organizational alignment, model deployment times, deployment challenges, and the costs of build vs. buy. We won’t steal his thunder here. We suggest that everybody go grab the full report. You can find it at: https://tinyurl.com/twiml. Diego’s passion for ML and MLOps was clear and he had a few great quotes: Scaling the greatest technology:  “This is the greatest technology of our lifetime, now it’s about getting the tools to be able to do it at scale” MLOps is about Speed: “I like to think about it as building a high speed highway. The existence of the highway doesn’t mean there aren’t controls (tolls, highway patrol) but it allows cars to move faster between destinations.” Focus on the business impact, not on DIY: “We are builders - building is exciting. But what's even more exciting is moving the needle for a business, so focusing on that is the best way to increase focus (and funding) for ML efforts.” Following a brief networking session, Ariel Biller, from ClearML wrapped up our day by providing his thoughts on the state of ML and MLOps, stating that “ML/DL research is inherently messy. MLOps (automation, orchestration, reproducibility, and workflow integration) is the missing element.” Before launching into a thorough demo of ClearML, he shared the 4-step process that he believes most ML teams go through when trying to architect and build their ML Platform. We think it was only half in jest. We don’t need one; We’ll build it ourselves; We’ll build it again, but right this time; Let’s go get one that’s already built. Huge shout-out to Chris, Paul, Rasool, Sarah, Conor, Mike, Mathew, Mohan, Danielle, Diego, and Ariel for their insights and humor in today’s sessions. If you missed the conference today, it’s not too late to register for TWIMLcon! There are still two more days of sessions as well as an unconference. Even if you have missed most of the conference, it’s not too late to sign up now and get access to all the conference session replays.
In this month's community segment we chatted about explainability, Carlos Guestrin’s LIME paper, Europe’s attempt to ban “untrustworthy” AI systems and finally, Community member Nicolas Teague shares a blog post he wrote entitled "A Sight for Obscured Eye, Adversary, Optics, and Illusions,” which explores the parallels between computer vision adversarial examples & human vision optical illusions. In our presentation segment, Philosopie Group Inc. Director of AI, Chris Butler, joins us to discuss Trust in AI. Chris gives us an overview of a number of papers on the topic, including: Humans and Automation: Use, Misuse, Disuse, Abuse Trust in Automation: Integrating Empirical Evidence on Factors That Influence Trust Some Observations on Mental Models Overtrust of Robots in Emergency Evacuation Scenarios For links to the papers mentioned above and more information on this and previous meetups, or to get registered for upcoming meetups, visit [twimlai.com/meetup]!(/meetup) SUBSCRIBE AND TURN ON NOTIFICATIONS
Autonomous driving startup Comma.ai released a small dataset that lets you try your hand at building your own models for controlling a self-driving vehicle. The dataset consists 10 video clips recorded at 20 Hz from a camera mounted on the windshield of a 2016 Acura ILX. There are about 7 hours of video total, captured mostly during highway driving. Alongside the video files are a set of sensor logs where measurements such as velocity, acceleration, steering angle, GPS location and gyroscope angles are recorded. The dataset is a 45 GB compressed zip file that explodes to 80 GB when compressed. That is, if you can get it to uncompress. When I tried it, after a fairly long download, unzip complained about the file being corrupt when I tried to unzip it. The project’s github repo includes a script to download the data from archive.org as well as some simple models built in Keras and TensorFlow for predicting steering angle and creating simulated road images using generative AI. They’ve also included a paper on the latter topic. The idea is that since it’s pretty expensive to train a self-driving car on real roads, you typically want to train your algorithms in a simulator. To do that, you can either hand code a simulator or use a generative AI to create one. The paper describes the use of variational autoencoders and generative adversarial networks and an RNN to create simulated road images. You can start by running their existing models, but if you manage to do amazing things with the data, let Comma know—they’re hiring and want to meet you. Subscribe: iTunes / Youtube / Spotify / RSS
I recently reported on the launch of the new NVIDIA TITAN X. At the time it wasn’t in the hands of any users so any thoughts on relative performance were either vendor provided or speculative. Well, a couple of researchers on the MXNet team were among the lucky folks that have their hands on the GPU at this point and they published an initial benchmark this week following the deepmark deep learning benchmarking protocol. In a nutshell they confirmed the speculation. The Pascal Titan X is about 30% faster than the GTX 1080 and its larger memory supports larger batch sizes for models like VGG and ResNet. Relative to the older Maxwell-based Titan X, the new GPU is 40-60% faster. If a single GPU isn’t enough for you, you might be interested in the new prototype announced by Orange Silicon Valley and CocoLink Corp, which they’re calling the “world’s highest density Deep Learning Supercomputer in a box.” The machine loads 20 overclocked GPUs into a single 4U rack unit offering 57,600 cores delivering 100 TeraFLOPS. The team at Orange report that an ImageNet training job that used to take one and a half days with a single NVIDIA K40 GPU can now be done in 3.5 hours using 8 GTX 1080s. The largest they’ve been able to scale a training job to is 16 GPUs, and they’re continuing to work on scaling this to the full 20 GPUs. Also in GPU news, Microsoft announced yesterday that Azure N-Series virtual machines are now available in preview. These VMs use Tesla K80 GPUs and the company claims these offer the fastest computational GPU performance in the public cloud. Moreover, unlike other cloud providers, these VMs expose the GPUs through via Discrete Device Assignment (DDA), resulting in near bare-metal performance. 6, 12 and 24 core flavors are available in the NC series of VM, which is optimized for computational workloads. An NV series that focuses more on visualization is also available, based on the Tesla M60 GPUs. Subscribe: iTunes / Youtube / Spotify / RSS
Good morning, First off, thanks everyone for your interest in the podcast. If you haven’t listened to the latest show, it’s a bit different than the previous ones. It’s the first in a series of interviews with folks doing interesting things in the machine learning and AI arena. I hope you find it interesting! This week the interview took the place of the regular news show, mostly because I didn’t have time to put the latter together. The news show is a ton of work, with each show taking about 24 hours to produce (down from 30+ when I started), and they can’t, by definition, be done in advance. All that said, I really believe in the format-—creating it was scratching my own itch-—so I’m working on ways to ensure it can continue uninterrupted, even when I’m traveling late in the week (as was the case this week), have other projects to attend to, or my wife gets tired of me dedicating the weekends to it (I'm starting to get that look). A couple of things I’m working on to this end are to (a) find some regular sponsors for the show and (b) find/hire someone or a small team of someones who can help me produce the show. Of course, (a) makes (b) possible, but I’m pursuing both in parallel as of now. You can help by continuing to share the podcast with your friends, review it on iTunes, post it, tweet it, etc. Ok, enough of the “inside baseball.” Here’s a quick rundown of the interesting ML and AI news for the week. Business We saw a few interesting business and product announcements this week: Shopping and travel bot startup Mezi raised $9 million in a series A financing closed this week. Investors in this round include previous investor Nexus Venture Partners and new investors Saama Capital and American Express Ventures. They've also brought on new individual investors Amit Singhal, former SVP and Head of Google Search, and Gokul Rajaram, Product Engineering Lead at Square. B12--like the vitamin I suppose--raised a $12.4 million series A. These guys are not the first guys to talk about applying AI to web site development... see The Grid for an earlier example. Like Mezi, they're also highlighting their use of hybrid AI in delivering their solution. We'll see a lot more of this type of business in the near future: startups taking traditional service-oriented businesses and sprinkling on some AI in the form of tools or automation under the covers--perhaps even just a bit to get started with. Prospera has raised $7 million to commercialize just one of many applications that will apply AI to this data. Prospera is developing a system based on computer vision and deep learning technologies that will determine when, where and how much water to deliver to crops to improve yields while conserving resources. Google introduces ML-based bid automation tools with AdWords Smart Bidding. Smart Bidding takes millions of signals into account to help users determine the best bid for a given ad unit, and it automatically refines conversion performance models to optimize deployment of customers' advertising budgets. Office 365 adds Researcher and Editor, new intelligent services to aid users writing reports and other documents in Word. Researcher is a sidebar that pulls up related articles from encyclopedias and the web based on what the user has written, and Editor is a smarter evolution of Word's spelling and grammar checkers. We've seen research sidebars in Word before and they've never proven useful, so it will be interesting to see how this one performs. Editor, on the other hand, I'd expect to be really useful, and to eventually replace the standard editorial tools in Word over time. Last, but certainly not least, Prisma, the app we talked about last time for bringing research into artistic style transfer with neural networks to the iPhone is now available on Android. I've played with it and it's pretty cool. Research OpenAI is hiring. Elon Musk-founded OpenAI is hiring researchers to work on a few "special projects". They specific research areas are: 1. Detecting if someone is using a covert breakthrough AI system in the world. 2. Building an agent to win online programming competitions. 3. Cyber-security defense. 4. Creating a complex simulation with many long-lived agents. Call me crazy, but as much as Musk says he fears an AI, the research areas here seem to be right out of an apocalyptic AI movie. Neural Network from Matroid leads in Princeton Competition. This is an interesting post describing Matroid's entry into the Princeton ModelNet competition for classifying 3D CAD models. Their application of Convolutional Neural Networks (CNNs) to this problem is interesting, and they've published a paper on their approach on arXiv. If you haven't seen the sample images from the DeepWarp Project around this week, you should check them out. A team of researchers from Skolkovo Institute of Science and Technology in Russia developed a deep learning model for creating photorealistic images from a base image where the eyes are looking in an arbitrary direction. I'd like to dig deeper into this paper at some point. Projects The Charades Data Set is an interesting set of dataset composed of nearly 10,000 videos of daily indoors activities collected by the Allen Institute for AI using Amazon Mechanical Turk. The dataset contains 66,500 temporal annotations for 157 action classes, 41,104 labels for 46 object classes, and 27,847 textual descriptions of the videos. Language modeling a billion words. An interesting project to create a generative natural language AI using LSTM RNNs trained on the Google Billion Words dataset. An interesting discussion of the techniques used to achieve scale, including the use of multiple GPUs. Bonus: Yann LeCun on Quora Yann LeCun, director of AI research at Facebook and NYU professor, did an AMA over on Quora the other day. Here are some of the responses I found interesting: What are the likely AI advancements in the next 5 to 10 years? - Quora Who is leading in AI research among big players like Google, Facebook, Apple and Microsoft? - Quora What is a plausible path (if any) towards AI becoming a threat to humanity? - Quora What are some recent and potentially upcoming breakthroughs in deep learning? - Quora Sign up for our Newsletter to receive this weekly to your inbox.
Last week, at a Machine Learning meetup at Stanford University, NVIDIA CEO Jen-Hsun Huang unveiled the company’s new flagship GPU, the NVIDIA TITAN X, and gifted the first device off of the assembly line to famed ML Researcher Andrew Ng. The new TITAN X, which holds the same name as the previous version of the device, is based on the company’s new Pascal graphics architecture, which was unveiled back in May. Last night, at a Machine Learning meetup at Stanford University, NVIDIA CEO Jen-Hsun Huang unveiled the company’s new flagship GPU, the NVIDIA TITAN X and gifted the first device off of the assembly line to famed ML Researcher Andrew Ng. The new TITAN X, which holds the same name as the previous version of the device, is based on the company’s new Pascal graphics architecture, which was unveiled back in May. The company is so excited about the card, it’s blog post introducing the card threw around a ton of superlatives and adjectives like Biggest, Ultimate, Irresponsible, Crazy, and Reckless. It also threw a bunch of numbers around, including these: 11 Trillion Floating point ops/sec 32-bit floating point 44 Trillion INT8 ops per second 12B transistors 3,584 CUDA cores running at 1.53 GHz 12 GB of GDDR5X memory with a 480 GB/s bandwidth) The other number it tossed out there was 1,200, which is the price of the card in US dollars. Now, not everyone is excited about this card as NVIDIA. Indeed, for gamers, what NVIDIA’s offering with the TITAN X is a GPU that’s about 25% faster than the company’s standby offering the GTX1080 but at double the cost. But it could be that that’s because the company is targeting deep learning researchers instead of gamers for the TITAN X. (In fact, CEO Jen-Hsun said as much at the product launch.) For people working on deep learning, the specs of the TITAN X should allow it to increase model training performance by 30-60%, which can save a researcher weeks of time and computing costs. The best technical preview I’ve found of the new card, which comes out on August 2nd, is over on AnandTech. Of course I’ll be dropping a link to this article and all the other ones I mention on the show into the show notes, available at twimlai.com.
What do you do if you’re an NVIDIA employee and you’re tired of your neighbor’s cats hanging out on your front lawn? Well, if you’re Bob Bond you build a deep learning based controller for your sprinkler system and train it to recognize cats! Bob’s project uses an IP camera to feed images to a Caffe-based deep learning model pretrained on ImageNet data running on an NVIDIA Jetson TX1 system. This talks to a his sprinkler system to turn on the water when an object identified as a cat or a dog makes its way onto his lawn. Very cool. He’s got a great write-up on his blog and the code is up on github. Next up is a cool project that shows you how to control a small raspberry pi based robot called a GoPiGo with TensorFlow, to enable simple autonomous driving. There’s a neat video up on YouTube showing the GoPiGo autonomously navigating a simple course and the code that does it is dead simple. In my view, this really shows off the power of deep learning and TensorFlow in particular. Finally, if you’re looking for more project or research ideas and you’re interested in Natural Language Processing, check out the project reports written by students in Richard Socher’s Deep Learning for NLP course at Stanford, also known as CS224d, which were posted this week. There are some pretty interesting projects including several on sentiment analysis, political bias detection, playlist creation, video annotation, and much more. There appear to be around 100 reports total, so there’s sure to be something for everyone.
One of the papers I’ve been meaning to look into is the Wide and Deep Learning paper published by Google Research a couple of weeks ago. It turns out that the paper is both short and very much on the applied side of the spectrum, so it’s relatively easy reading. There’s also a lot of supporting material, between the Google Research blog, the TensorFlow docs and the video they created, though I found that reading the paper helped me understand the video, as opposed to the other way around! The background here is that a team from Google Research developed a recommender model that combines the best aspects of logistic regression and neural nets and found than it outperformed either approach individually by a small but significant percentage. The basic idea is that linear models are easy to use, easy to scale and easy to understand. They’re also pretty good at “memorizing” the relationships between individual features when you use some simple feature engineering to capture the relationship between individual features. This feature engineering, which is very commonly used, results in a lot of derived features and so the linear models that uses it is called “wide” learning in this paper. What the linear models aren’t really good at are “generalizing” across different features because they can’t really see those relationships unless you feed in a set of higher order derived features that capture this, and doing so is labor intensive. This is where neural nets, or so called “deep” models, come into play. They are better at generalizing and rooting out unexpected feature combinations that have predictive value. But they’re also prone to over-generalization and don’t do a good job at “memorizing” specific feature combinations that are infrequently seen in the training data. So this paper proposes a jointly trained model that combines both wide and deep learning. By jointly trained we mean that this isn’t an ensemble model, where we train a linear model and a neural net separately and then choose the best prediction among the two. That doesn’t help us here because for ensemble to work, we need both models to be independently accurate. That would mean we would need to do all the feature engineering we’re trying to avoid for the linear model. Rather, by training the wide and deep models together, they can each do what they’re best at while keeping the overall model complexity low. It’s actually pretty surprising how much system-level implementation detail this paper packs into 4 pages. I was left feeling like I have a pretty good understanding of how the recommendation system for the Google Play store was designed so as to make recommendations against a 1 million item app catalog using over 500 billion training examples to serve each request in about 10 ms under a peak load of 10 million app scoring requests per second. In addition to publishing the paper, Google also open sourced their TensorFlow implementation of the model with a high level API for Wide & Deep models called a “DNN Linear Combined Classifier”. Alright, I hope you enjoyed learning about this paper as much as I enjoyed reading it. Before we jump over to Projects, a few quick notes: In recent weeks we’ve talked about the ICML and CVPR conferences. This week Leo Tam posted a blog calling out his impressions from both and his top 10 posts from each. Check it out for a concise look into what you missed at these conferences. Next, this week was the IJCAI conference, the International Joint Conference on AI. I haven’t seen much by way of summaries or highlight posts so I don’t have much to say about it, but if you see anything good send it my way to share. Finally, if you’re looking for a contextualized view into a bunch of interesting and important research papers including bot, and how they all fit together, you’ll like Xavier Amatriain’s presentation from last week’s Data Science Summit. The focus of the talk is providing the audience with a reminder of all the problems for which traditional ML is still state of the art relative to the new hotness deep learning, and he cites the relevant papers for each area. The slides are up on SlideShare and are highly recommended.
This week’s show covers the White House’s AI Now workshop, tuning your AI BS meter, research on predatory robots, an AI that writes Python code, plus acquisitions, financing, technology updates and a bunch more. The Big Picture Home :: AI Now Jason Furman’s speech I need an AI BS-Meter — Gab41 Smerity.com: It’s ML, not magic: simple questions you should ask to help reduce AI hype You Can Now Drink Beer Brewed By Artificial Intelligence – Forbes On the importance of democratizing Artificial Intelligence Business Google buys machine learning startup Moodstocks to help your phone’s camera identify objects | VentureBeat | Business | by Chris O’Brien News discovery app SmartNews nabs another $38M, now valued at $500M-$600M | TechCrunch General Catalyst’s Phil Libin invests in 2 more chatbot startups, Growbot and Butter.ai | VentureBeat | Bots | by Ken Yeung Exclusive: Why Microsoft is betting its future on AI | The Verge Research Google’s DeepMind AI to use 1 million NHS eye scans to spot diseases earlier | Ars Technica Artificial Intelligence May Aid in Alzheimer’s Diagnosis – Neuroscience News Application of Machine Learning to Arterial Spin Labeling in Mild Cognitive Impairment and Alzheimer Disease Steering a Predator Robot using a Mixed Frame/Event-Driven Convolutional Neural Network Super-intelligent predator robot is taught to hunt down prey in chilling experiment | Daily Mail Online Technology Release of IPython 5.0 Skype chatbots now work in group chats | VentureBeat | Bots | by Khari Johnson Microsoft’s Project Malmo AI platform goes open source | ZDNet Projects Teaching an AI to write Python code with Python code Deep Learning for Chatbots, Part 2 – Implementing a Retrieval-Based Model in Tensorflow – WildML Specials Data Science Summit – JULY 12-13 in SAN FRANCISCO / Use code TWIML20 for 20% off registration FREE O’Reilly Early Access Ebook: Mastering Feature Engineering
This week’s show covers the first fatal Tesla autopilot crash, a new EU law that could prohibit machine learning, the AI that shot down a human fighter pilot, the 2016 CVPR conference, 10 hot AI startups, the business implications of machine learning, cool chatbot projects and, if you can believe it, even more. Here are the notes for this week’s podcast: Tesla Autopilot Crash A Tragic Loss | Tesla Motors Ex-Navy SEAL becomes first to die in self-driving car after Tesla crash | Daily Mail Online Tesla’s ‘Autopilot’ Flew Under Regulators’ Oversight – WSJ The technology behind the Tesla crash, explained – The Washington Post EU Legislation Impacts Machine Learning Use EU regulations on algorithmic decision-making and a “right to explanation” Artificial Intelligence Has a ‘Sea of Dudes’ Problem – Bloomberg Why We Should Expect Algorithms to Be Biased To study possibly racist algorithms, professors have to sue the US | Ars Technica Business The Most Well-Funded Startups Developing Core Artificial Intelligence Tech Doodle acquires chatbot Meekan to integrate its A.I. scheduling assistant | VentureBeat | Bots | by Chris O’Brien Meet Articoolo, the robot writer with content for brains | TechCrunch The Business Implications of Machine Learning — Medium How Amazon Triggered a Robot Arms Race – Bloomberg IEEE Computer Vision & Pattern Recognition Conference CVPR 2016 CVPR 2016 Open Access Repository Zeeshan Zia’s answer to What are the most interesting CVPR 2016 papers and why? – Quora All Your Questions Answered — CVPR Day 1 — Gab41 Jordi Pont-Tuset’s site – CVPR 2016: Deep learning takes over again? AI Fighter Pilot Beats Human Expert AI bests Air Force combat tactics experts in simulated dogfights | Ars Technica Genetic Fuzzy based Artificial Intelligence for Unmanned Combat Aerial Vehicle Control in Simulated Air Combat Missions Projects & Hands-On IBM Watson A.I. XPRIZE Changelog – Messenger Platform A Natural Language User Interface is just a User Interface — The Startup — Medium Build a Chatbot w/ an API – ML for Hackers #9 – YouTube Is that a Time Machine? Some Design Patterns for Real World Machine L… Data Science Summit Data Science Summit Use code TWIML20 for a 20% discount on registration! Image: Tesla Motors
This week’s show covers the International Conference on Machine Learning (ICML 2016), “dueling architectures” for reinforcement learning, AI safety goals for robots, plus top AI business deals, tech announcement, projects and more. ICML 2016 –Accepted Papers | ICML New York City – Which companies had accepted papers at #icml2016 ? Best Paper Awards – [1511.06581] Dueling Network Architectures for Deep Reinforcement Learning – [1601.06759] Pixel Recurrent Neural Networks – [1602.07415] Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling – My winner in the best name category: Extended and Unscented Kitchen Sinks – Demystifying Deep Reinforcement Learning Research Google Research Blog: Bringing Precision to the AI Safety Discussion OpenAI Blog: Concrete AI safety problems Paper: 1606.06565.pdf OpenAI technical goals Artificial intelligence achieves near-human performance in diagnosing breast cancer — ScienceDaily Paper: 1606.05718.pdf Business Twitter pays up to $150M for Magic Pony Technology, which uses neural networks to improve images | TechCrunch Increasing our Investment in Machine Learning | Twitter Blogs Artificial Intelligence Explodes: New Deal Activity Record For AI DARPA is looking to make huge strides in machine learning | PCWorld Data-Driven Discovery of Models (D3M) – Federal Business Opportunities: Opportunities AI Culture Wars in Silicon Valley How Siri Started — and Lost — the Assistant Race How Google is Remaking Itself as a “Machine Learning First” Company — Backchannel AI, Apple and Google Technology Lighting the way to deep machine learning | Engineering Blog | Facebook Code Intel Launches ‘Knights Landing’ Phi Family for HPC, Machine Learning The Toronto Raptors Are Using IBM’s Watson to Draft A Winning Team | Motherboard Projects Hello, TensorFlow! How to read: Character level deep learning GitXiv: Collaborative Open Computer Science Machine Learning Yearning Mastering Feature Engineering – O’Reilly Media Bonus I didn’t have time to cover: The Stanford Question Answering Dataset
This week’s podcast digs into Apple’s ML and AI announcements at WWDC, looks at the new Deep Thunder offering by IBM and The Weather Company, and discusses exciting new deep learning research from MIT, OpenAI and Google Discover More Here. Here are the notes for this week’s show: Does Apple Bring the ML & AI Goods at WWDC? Apple Faces an Artificial Intelligence Challenge Apple struggles with the idea of intelligent life outside Cupertino Basic Neural Network Subroutines IBM’s Deep Thunder Deep Thunder Deep Thunder Can Forecast the Weather—Down to a City Block Big Players Making Moves Announcing Google Research, Europe Microsoft acquires Wand Labs Amazon Hires Carnegie Mellon Machine-Learning Expert as Google Expands its Own AI Initiatives Exciting New Deep Learning and Neural Nets Research AI Produces Realistic Sounds That Fool Humans Generative Models Learning to Learn by Gradient Descent by Gradient Descent Smart Reply: Automated Response Suggestion for Email Deep Learning Architecture In Deep Learning, Architecture Engineering is the New Feature Engineering Image Credit: Apple, Inc.
This week’s podcast looks at new research on intrinsic motivation for AI systems, a kill-switch for intelligent agents, “knu” chips for machine learning, a screenplay made by a neural net, and more. Here are the notes for this week’s show: Intrinsically Motivated AI Playing Montezuma’s Revenge with Intrinsic Motivation Unifying Count-Based Exploration and Intrinsic Motivation Intrinsically Motivated Machines Implementation of DEvelopmentAl Learning Safely Interruptible Agents What if robots decide they want to take control? New paper: “Safely interruptible agents” Safely Interruptible Agents Open Source Project Updates TensorFlow 0.9 Apache Spark 2.0 Preview: Machine Learning Model Persistence A “Knu” Chip for Machine Learning Former NASA Exec Brings Stealth Machine Learning Chip to Light CrowdFlower’s AI Push Solving Million (not Billion) Dollar Business Problems with AI Vi: An AI Personal Trainer Meet Vi Recurrent Neural Net Writes Sci-Fi Movie Movie Written by Algorithm Turns out to be Hilarious and Intense Adventures in Narrated Reality, Part II Understanding LSTMs The Unreasonable Effectiveness of Recurrent Neural Networks Teaching Robots to Feel Teaching Robots to Feel: Emoji & Deep Learning ML for Hackers: Build a Chatbot ML for Hackers: Build a Chatbot Siraj Raval on Twitter Image Credit: LifeBEAM
This week’s show looks at Facebooks’ new DeepText engine, creating art with deep learning and Google Magenta, how to build artificial assistants and bots, and applying economics to machine learning models. Here are the notes for this week’s show: DeepText: Facebook’s Text Understanding Engine Introducting DeepText: Facebook’s Text Understanding Engine FBLearner Flow Research: Text Understanding from Scratch Natural Language Processing (almost) from Scratch Machine Learning and Art Google Magenta Neural Art A Neural Algorithm of Artistic Style Neural Art in TensorFlow Autoencoding Blade Runner Courses: NYU’s Machine Learning for Artists Goldsmith’s University of London The Latest TensorFlow Paper TensorFlow: A system for large-scale machine learning Business of ML & AI Microsoft Confirms Microsoft Ventures VC Arm Intel Acquires Computer Vision for IOT, Automotive Lumiata Closes $10 Million Series B Financing with Intel Capital Findo raises $3M to help you find files and documents through natural language queries More Bots, and How to Build Artificial Assistants Motion AI lets anyone easily build a bot Sequel lets you create a ‘Me’ bot, beats Google to the punch Hybrid Intelligence: How Artificial Assistants Work The Economics of Machine Learning models The preoccupation with test error in applied machine learning Towards Cost-Optimized Artificial Intelligence More Cool Deep Learning posts Deep Reinforcement Learning: Pong from Pixels A Survey of Deep Learning Techniques Applied to Trading Just for Fun Building an IoT Magic Mirror Magic Mirror on GitHub Image Credit: Microsoft
Every week I end the week with close to 100 tabs filled with stories—some good, some not so good—spanning all corners of the cloud computing, big data, machine learning and AI web. I thought it would be useful to bring you the best of these stories in a weekly podcast. I have no idea whether this will be sustainable or not—this first episode took a lot of work—but let’s run with it and see what happens. Here are the notes for this week’s stories: AI Tech Front and Center at Google I/O Google I/O 2016 Keynote I/O: Building the next evolution of Google Google supercharges machine learning tasks with TPU custom chip Nvidia creates a 15B-transistor chip for deep learning Deep Learning Part of Amazon’s Destiny Amazon open-sources its own deep learning software, DSSTNE https://github.com/amznlabs/amazon-dsstne TensorFlow Uber’s Autonomous Boom Box Takes to the Streets of Pittsburgh Steel City’s New Wheels Jeff Schneider Interview at Structure Data Artificial Intelligence for Robotics: Programming a Robotic Car Scanse’s Sweep: Scanning LIDAR for Everyone AI by the Bay Conference AI by the Bay / Data by the Bay What’s Up with NLP at Quora? Applications of NLP at Quora Conference Calls: AI’s Killer App? How this guy used Watson to tune out of conference calls https://github.com/joshnewlan/say_what