Sam Charrington: Hey, what’s up everyone!
We are just a week away from kicking off TWIMLfest, and I’m super excited to share a rundown of what we’ve got in store for week 1. On deck are the Codenames Bot Competition kickoff, an Accessibility and Computer Vision panel, the first of our Wellness Wednesdays sessions featuring meditation and yoga, as well as the first block of our Unconference Sessions proposed and delivered by folks like you. The leaderboard currently includes sessions on Sampling vs Profiling for Data Logging, Deep Learning for Time Series in Industry, and Machine Learning for Sustainable Agriculture. You can check out and vote on the current proposals or submit your own by visiting And of course, we’ll have a couple of amazing keynote interviews that we’ll be unveiling shortly! As if great content isn’t reason enough to get registered for TWIMLcon, by popular demand we are extending our TWIMLfest SWAG BAG giveaway by just a few more days! Everyone who registers for TWIMLfest between now and Wednesday October 7th, will be automatically entered into a drawing for one of five TWIMLfest SWAG BAGs, including a mug, t-shirt, and stickers.
Registration and all the action takes place at, so if you have not registered yet, be sure to jump over and do it now! We’ll wait here for you.

Before we jump into the interview, I’d like to take a moment to thank Microsoft for their support for the show, and their sponsorship of this series of episodes highlighting just a few of the fundamental innovations behind Azure Cognitive Services. Cognitive Services is a portfolio of domain-specific capabilities that brings AI within the reach of every developer—without requiring machine-learning expertise. All it takes is an API call to embed the ability to see, hear, speak, search, understand, and accelerate decision-making into your apps. Visit to learn how customers like Volkswagen, Uber, and the BBC have used Azure Cognitive Services to embed services like real-time translation, facial recognition, and natural language understanding to create robust and intelligent user experiences in their apps. While you’re there, you can take advantage of the $200 credit to start building your own intelligent applications when you open an Azure Free Account. That link again is

And now, on to the show!

Sam Charrington: [00:03:14] All right, everyone. I am here with Cha Zhang. Cha is a partner Engineering Manager with Microsoft Cloud and AI. Cha, welcome to the TWIML AI podcast.

Cha Zhang: [00:03:25] Thank you, Sam. Nice to meet you.

Sam Charrington: [00:03:27] Great to meet you as well. Before we dive in, I’d love to learn a little bit about your background. Tell us how you came to work in computer vision.

Cha Zhang: [00:03:38] Sure. Sure. I actually have been at Microsoft for 16 years. I joined Microsoft originally as a researcher at Microsoft Research. I was there for 12 years. My research was primarily applying machine learning to image, audio, video; all of these different applications. I started 2016. I joined the product side, and currently I’m working as an Engineering Manager, and my primary focus is on document understanding.

Sam Charrington: [00:04:11] Awesome. Awesome. So, we will be focusing quite a bit on OCR and some of your work in that space, and, you know, I think people often think of OCR as a, you know, a solve problem, right? It’s, you know, we’ve been scanning documents and extracting texts out of those documents for a long time.

Obviously the advent of deep learning, you know, changes things, but I’d love to get the conversation started by having you share a little bit about, you know, what’s new and interesting in the space. How has it changed over the past few years?

Cha Zhang: [00:04:50] Sure. Actually, it wasn’t very long ago, when people talk about OCR, what comes out of mind was firstly scan documents. In many people’s eyes, OCR for scan documents is sort of a solve the problem. More likely, I think there’s two major development. One is with a mobile first kind of word where everybody now have mobile phones and they take pictures everywhere. So there’s a lot of demand to do a text recognition out of images in the wild, and that certainly is a much more challenging problem than scan documents, and then technically, because of the advances in deep learning, we have realized that with deep learning, we can do OCR at a different level. We can make it a lot more accurate than before, and we can solve OCR problem in kind of imaging the wild scenario. So I think it started at 2000, early 2010 ish. I think there’s a lot of big advent advances in this area, and now we’re seeing basically OCR becomes something really that works. You know, people don’t need to worry about quality, etcetera, just mostly works.

Sam Charrington: [00:06:08] Can you talk a little bit more about the challenges that arise when you’re trying to do OCR in the wild?

Cha Zhang: [00:06:16] Of course.
I think for documents, usually it’s white background and black text, but for images in the wild, essentially it’s a photo. So in the photo, there’s a lot of variations in the text. First there’s a huge scale variation, so some texts, if you capture a picture of a street, there might be some store name that are super big, and then there are some tiny texts that’s hard to see. So there’s a big variation in scale of the text and the aspect ratio of these texts can be a really long cause text string can be very long compared to regular objects, like a cat or a dog. Because of the mobile capture scenario, usually it’s difficult to integrate close these texts by and access a line of rectangles.
For example, you’re not, there might be perspective just portions of the text when the camera sees them. The background in the image in the wild is much more complicated than the typical white background you see in scan documents, and some of these backgrounds, such as fences, breaks, and stripes, are even though they appear quite simple for human beings, but think of like fences can be a perfect, a bunch of ones, you know, on the street sitting there and they look very similar to two characters. So those create additional challenges, and I think one of the biggest one, I think technically for OCR, that’s challenging is the localization accuracy. So, typically in object detection, the localization accuracy, if it’s measured by intersection of a union, and if that criteria is bigger than 0.5, people think this is good enough, but for OCR, if you actually, the intersection is only half of the union, a lot of the characters will be missing. So, usually OCR will need a 0.9, 0.95 level kind of accuracy in order to recognize all the characters properly. So…

Sam Charrington: [00:08:31] Can you explain that in more detail? What is intersection over union and how is that used in convect detection?

Cha Zhang: [00:08:39] So, in order to measure the accuracy of a particular detection algorithm, you need to ground truth label the data, and so, typically what people do is they create a bounding box of the object to be determined, to be detected, and then you use a automatic algorithm to figure out where the object is, then that will also create a bounding box. Now you have two bounding boxes. and the question is how do you measure how well these two boxes align and, a common measure is to take the intersection of these two bounding boxes and you take the union of these two bounding boxes that you get two areas. You can imagine if the two bounding boxes are very close to each other, overlapping a lot, then that intersection of a union would be very high, but if they are off, they’re offset by quite a bit, then you know, the number is low. So that’s kind of academia standard, how people measure detection accuracy with this criteria.

Sam Charrington: [00:09:46] Got it.
And so, you were saying that the threshold that you need in the case of texts is higher because of what?

Cha Zhang: [00:09:58] Because of…
Let’s just think about, you know, you have a ground truth text, let’s say, “Hello world,” and it’s elongated a rectangle and you say, I have a text detection algorithm that creates also a bounding box, but have a intersection of a union, let’s say roughly 0.5, and so what that means is that the intersection area divided by the union of the two bound inbox is 50%. So very likely the detective bounding box will miss a few characters because, you know, the overlapping is not there. So, you might be missing at, you might miss a D as an N and all this will cause the OCR to produce wrong results. And so that’s the main challenge here.

Sam Charrington: [00:10:48] So in the case of a traditional object detection scenario, you may miss a half of the face but you can tell that there’s a face there in the case of OCR, you’re just missing letters and it makes it a lot more difficult for the algorithm to guess what was there.

Cha Zhang: [00:11:07] Yes, exactly.

Sam Charrington: [00:11:08] Got it, and maybe taking a step back just to the problem as a whole, granted mobile is driving, you know, this transition to these in the wild pictures and people trying to OCR them, but what are the high value use cases there? Like, is it, you know, I’m thinking of some interesting ones as like the… when it’s in conjunction with translation, you know, maybe I’m in another country and I’m, I’ve done this. You know, you’re taking pictures of, of words and another character to try to read the menu or something like that. I’ve also done things like scan documents on a phone and, and you won’t want to OCR those, but that’s kind of back to the traditional OCR problem in a lot of ways. What are some of the other use cases that are common?

Cha Zhang: [00:11:58] If you look at this kind of business opportunities, I still think the traditional document, you know, scan document, I think, some traditional kind of OCR problems that like, for example, receipts, where people can scan in the old days, but nowadays people mostly do reimburse them by taking or snapping a photo.
So I think in term of the market, the revenue, I think that’s still quite a big one. There are a few others. The one that you mentioned, if you have a phone, you go to a foreign country, you snap a photo and you want to translate them as one. There’s also a lot of applications in digital asset management.
So this is when you, either you are a big company or you are a personal kind of, you have some big storage of photos and where you want to organize these photos. We have shown that with OCR capability, you can increase the accuracy of processes, photos, and retrieve these photos. As a matter of fact, you know, the big search engines like Google and Bing, when they search images, OCR is integral part of that as well because the OCR, the content can help a lot in getting the best images.

Sam Charrington: [00:13:22] Okay. And so, you were mentioning kind of some of the technical challenges and localization of the texts in these images is one of those challenges. How do you go about it? Is it the case that, you know, deep learning is so powerful off the shelf. Deep learning techniques just solves it for you or do you, you know, you reengineer the whole pipeline? How do you approach that?

Cha Zhang: [00:13:53] So in text, this action, usually the detection pipeline is different from a traditional object detection. What’s been most popular for kind of OCR for imaging in the wild today is something called anchor free detection.
So the idea… Anchor free. In a typical object detection, usually most well known anchors, like fast RCN and faster RCN, etcetera. They basically create these anchors and then they regress the actual bounding box of the objects. The challenge of using that kind of approach is that these anchors need to be preset, and so typically for normal object detection, you set at a certain density, and then you set a certain set of aspect ratios. Like your anchor box are one to two, one to three, one to one. Typically you go about there, but texts, some of the text can go like 20 to one so really you cannot, it will be a huge computational cost to go with anchor based approach.
So modern days for OCR, we go anchor free, and the high level concept is essentially by using convolutional neural networks. You almost do kind of a per pixel level, a decision or classification saying, well, this region nearby this particular pixel, it looks like part of text. So there is a text/non-text classification almost kind of per pixel level.
Then you rely on a few algorithms to group these into text vines by looking at how well two, for example, two texts, the region are similar to each other and you can decide, well, these two looks like the same textures and color, and maybe they should be connected. In this regard, there are quite a few well known algorithms to do this connection.
In earlier days, people use a relatively kind of a rule-based approach like stable link where they link based on some features, but it’s kind of a rule-based. More recently, people start looking to new networks like relation network. So are kind of estimating the relation of two regions are features, and based on that to decide, well, these two should be connected or not.
So that way you started kind of bottom up; you start with perfect kind of classification, and then you do grouping, and you come out with these text lines. Very powerful approach. It can not only detect kind of a straight lines, but even curve lines, you can handle them pretty well with those approaches.

Sam Charrington: [00:16:44] So it sounds like you’re describing a pipeline. That’s not like a, end to end train single neural network that you give it images and train it on label data. It is, telling you what the text is, but rather a bunch of independent steps.

Cha Zhang: [00:17:04] Yes, that’s a very good observation. Actually, so for OCR, detection is only the first step and after detection, we typically run a character model where you take the detected text lines, you normalize them into a straight line with a fixed height, and then you run a character model to actually decode the image into a character, a list of characters.
There are a lot of approach actually similar to speech where, you know, speeches going from acoustic similar to these texts. But here we’re going from image to text. But a lot of the approaches that we use, like LSTM, language modeling, these are very similar.
Now your question is certainly valid because in speech today, you know, people do end to end training you. They start from audio so they can directly go to text. For OCR, we are not a year yet. I think the main challenges, well first is how much data you have. I think speech, you can collect a lot more data compared with OCR. OCR data are usually very expensive to collect in a label and so, going stage by stage at this point is more economically doable than, you know, do end to end training.

Sam Charrington: [00:18:25] Why is that? It seems that we have tons of pictures with words in them that we know particularly, is it just in the wild, the, in the wild examples where we don’t have the label data or is also this document use cases because I’m imagining, Microsoft has probably labeled a ton of receipts and business cards and that kind of thing.

Cha Zhang: [00:18:50] Yeah. I think certainly a labeling is very, very expensive. For Microsoft, we are a company paying a lot of attention to privacy, you know, those kinds of issues and the collecting OCR data has been a major, I would say, blocking issue to go for this kind of end to end approach because if you think about it, a lot of the document that we actually carry, like if you say, talk about invoice, talk about receipts, business card, they all contain PI information. Those are data extremely difficult to obtain, and we follow very strict kind of guidelines – how we can collect them, how we can label them. So in some way we are limited by these privacy restrictions, but we do respect those a lot. So we, as a result, you know, we are now going end to end at this point.

Sam Charrington: [00:19:48] Got it, got it. It makes me think a little bit about the, some of the issues with neural networks, remembering data. So for example, there are examples where you’re, you train a CNN and there are some attacks that you can do that will reproduce some of the images, you know, it’s to some degree or another, that the model was trained on. Likewise, with these very large language models, you can start to see some of the texts that the models were trained on, come out in the, in the output.
I would imagine if you were training end to end, at least then that becomes an issue as well, and maybe more so than in the case of images.   What’s your intuition there? Would it be worse or are better than images?

Cha Zhang: [00:20:39] I would imagine it will be similar, I would say. So after all, you know, OCR, you come from image to text, but during the learning of this OCR process, language model is actually very helpful to help improve the OCR accuracy.
So, for example, during decoding of these texts lines into a text, we use some of the, like LSTM or, you know, basically these very popular language modeling schemes. Certainly it remembers the contextual information of the language in order to help the OCR to recognize these texts properly. So, I think when you go to end to end, when the amount of data that you use for training is humongous, I think, it’s difficult to imagine for me, you know, we’ll have similar level of data for training like BERT models or TBT models. Those are huge, huge amount of data, but still you will learn something from the text and they might leak into the model as well.

Sam Charrington: [00:21:51] Along those lines, what enabled BERT and many of the recent innovations around language models is a shift from supervised to the semi-supervised way of framing the task. Is there a semi-supervised framing for the OCR test? That makes sense?

Cha Zhang: [00:22:13] Actually for OCR today, we are not, although I think it’s definitely a very interesting research problem.
I think BERT is a super nice framework for transfer learning. You know, you, you go from pre-trained model and then, you know, on a supervisor, you can… In the image word, I think, transfer learning probably exists earlier in image than language. So earlier days when we have ImageNet, we trained like a resident, those are already being used for transfer learning.
So, unsupervised kind of image learning is also, I think it’s still ongoing. There’s a lot of interesting projects going on. I think for OCR right now, we’re not there yet. Like one of the main issues for building a product like OCR to use some of these pre-train model is the computational cost.
I think this happens in language as well, BERT model, the GPT Model 3, like, you know, multi billions of parameter is very difficult to turn them into a product for OCR. It’s also, you know, we have the same problem. Computational cost is very sensitive. We need to make it fast, and so we’re using it relatively small models and normally we train from scratch. Transfer learning does show some benefit, but when the data reaches a certain amount, we found training from scratch is perfectly fine.

Sam Charrington: [00:23:49] When you have a certain amount of data to train from?

Cha Zhang: [00:23:53] Yeah. In the very early days when we started doing different learning OCR, we actually rely a lot on trans distillation – that’s teacher-student learning, where we first train a big and model, and then we gradually use teacher-student learning to create a small model so that it can run efficiently. Nowadays, we have figured out that you can train these models from scratch. The amount of data that we have on the order of, you know, hundreds of thousands and millions of images are sufficient to train from scratch on smaller model, and reach about the same accuracy.

Sam Charrington: [00:24:31] Can you elaborate a little bit on that? Are you saying that you need more data to train smaller models?

Cha Zhang: [00:24:37] No, I’m saying that… Take BERT as example. BERT is super beneficial for transfer of learning because it has seen so many documents. So giving any new language task, presumably your data is not much, there’s not much data that you have to train this new task, and therefore, leveraging BERT, where it has seen so many documents, will help through transfer learning to transfer some of the knowledge that the BERT has learned from this huge set of document, to the small kind of task so that it can reduce the amount of documents required to train the smaller task.
The same thing happens in ImageNet transfer learning where, you know, if it’s a ResNet train on ImageNet, you learn a lot of visual information from the ImageNet dataset. Then if you have a tiny detection task, like detecting a helmet, let’s say, and you can do the transfer learning and you can use a very small amount of dataset to actually train a very good helmet detector. What I was saying just now was that for the problem of OCR where, you know, it is certainly a very important computer vision problem. Every company who invest in OCR tend to collect quite a bit of data, not to the level of, you know, billions, but hundreds or thousands, millions to that level, that amount of data is sufficient that you do not need to go transfer learning. You can train the model from scratch and you get very good results.

Sam Charrington: [00:26:19] Got it. Got it.
So when you were using transfer learning where you’re using models based on ImageNet, you know, along the lines of ResNet and others, or whether… Okay.
Lets see… so the smaller models that you’re training are they, you know, some of the traditional architectures that we’ve already brought up or are you building out new architectures for the models themselves for this specific problem?

Cha Zhang: [00:26:53] Right now we’re using some of the traditional models. There are some active research going on regarding searching the best effective architecture for OCR. We haven’t seen convincing results yet, but I think that’s a very active research area that we’re still kind of looking into, particularly when we try to make it smaller and smaller, you know, faster and faster.

Sam Charrington: [00:27:20] When you say searching the best architecture for OCR, are you speaking using the word searching generally, like you have researchers are looking at different models and trying to find the best one for OCR, or are you suggesting a domain specific neural architecture search kind of…?

Cha Zhang: [00:27:38] I mean neural architecture search.
So that certainly can be applied to OCR and we were still exploring it, but I think that’s a very promising direction.

Sam Charrington: [00:27:49] Okay. Interesting. Interesting.
Earlier in the conversation you talked about one of the big use cases is some of these semi-structured data that we want to extract information out of – invoice is one example.
There was a recent demonstration, or I guess that’s actually a product now of the mobile version of Excel or something. You can take a picture of a grid, grid like data, and that will, you know, both extract the text and organize it into a spreadsheet. Talk a little bit about the product that you’re working on the form recognizer, which is doing something similar.

Cha Zhang: [00:28:35] Yeah, of course. So OCR certainly is pretty low level. Other than some of the application I mentioned earlier, like digital SMN and then photo managing, you know, translation, you can directly use OCR, but for many customers, what they want is not just OCR. They want to extract information from documents.
Think about, you know,I need to process millions of invoices. I want to extract vendor name and the date, total amount, or if it’s an MS expense system where you want to process all the receipts, and either it can be a verification purpose, for example, like, okay, how do I make sure employees are not putting random numbers and they don’t match with the receipts that’s actually filed. It’s actually, it sounds kind of silly but you know, today, a lot of the company do this verification manually. Because of the huge manual amount of effort needed, they often can only do sampling. So you sample like 5% of these receipts to validate, but you kind of miss a huge chunk, and that you never even look at it?
So we are looking at this space and we’re trying to build essentially two category of product – one is a previous set of product and these are solutions that works out of the box. For example, it can be a prebuilt receipt, pre-built business card, pre-built invoice. So these are, basically you’re sending an image or PDF file. It will extract all the fields that you’re, you’ll be interested in. Another big category that we think are super important is customization because, you know, the pre-build may never fit every need. So we have a solution called the custom form where we allow customer to basically send us a few sample images.
You can either label or even, you know, not doing any labelling but we will be able to extract key value pairs out of these documents. Again, we see this as a much closer to what the customers need and that’s what the form recognizes its position as.

Sam Charrington: [00:30:54] So we’ve talked about a bunch of the interesting technical challenges at the lower level at OCR.
Does the form level, you know, is that a kind of a packaging of OCR? Does it have its own technical challenges to overcome…?

Cha Zhang: [00:31:13] Actually it has a lot of very interesting challenges. So, one of the work recently is coming out from Microsoft research, whereas, you know, targeting exactly this problem. And so, just think about it. The language, I mean, passing these invoices and receipts are essentially sort of a language problem because you have these texts there. The challenge here is that these are images, so you run OCR on them, but unlike a typical language, a data set where you’ve scratched from the internet, you know, Wikipedia there’s basically have this ordering of these words already, but if these data coming from image, essentially you can detect these texts lines, but it’s actually very difficult to define the read order of these texts lines, and ordering of these texts lines by itself is a very challenging problem. When you have images in the wild, paper can be curved, you know, can be crunch, can be rotated here, the perspective, you know, all kinds of issues. They can have background text, you know, all these. So the particular approach that MSRA came out is called LayoutLM. It’s actually a modified a BERT model. It’s also a language model, but in addition to the language, we also embed 2D information, like what is the X, Y position of the bounding box of the text?
So with that information, train, actually, this is all can also be trained without supervision. It’s unsupervised pre-training. We are able to learn this kind of spatial relationship in these invoices without coming out with explicit read order. With that, we actually can do a lot of these key value extraction really well.
There’s also quite a lot of advanced research looking into say, relation networks where you see two text lines nearby each other, you can predict the relationship. Again, this is similar to the OCR where you have these bottom pixel level classification. You want a group of them here. You want a group P key and a value pairs.
There’s also a lot of advanced research in this graphical convolution networks where you do convolution networks over a graph, where the graph is defined by connecting nearby text lines. Again, this is approach without requiring reading order, but just look at the spatial relationship. So these are all actually very exciting kind of extension of language, but also using visual information to help passing these vertical data more accurately.

Sam Charrington: [00:34:09] Interesting. Yeah, I think it’s… At a quick thought would’ve imagined that, you know, maybe the top part of the stack, there is more rule-based than the bottom part of the stack was, you know, more machine learning base, but it sounds like they’re even, I don’t know, relatively, but there are a bunch of really interesting…

Cha Zhang: [00:34:33] We are doing a lot of machine learning stuff on the top as well.

Sam Charrington: [00:34:37] I’m imagining the, you know, when you talk about relation net, for example, on an invoice you could have date, and then the date, you know, horizontally next to it, or you can have date and then the date beneath it.

Cha Zhang: [00:34:50] Yes.

Sam Charrington: [00:34:50] You may have an address box and then a bunch of texts that comes beneath it. It would be nice to know that, you know, we’re talking about the address here. That’s part of the idea of the structured text extraction.
So in that you mentioned relation net and graphical CNNs. Are those two approaches to solving the same problem or are they solving different aspects of the problem?

Cha Zhang: [00:35:13] They solve different aspects of the problem, and they can be also used to solve the same. I mean, like right now, the main focus for us, for them for extracting key value pairs. This is both kind of pre-build and the customization. Think about, if it’s an invoice and you want a vendor name, so it’s a name. Certainly, you know, the text information because you see it looks like a vendor name. This probably is a vendor name and some invoice doesn’t even have the key in the invoice.

Sam Charrington: [00:35:48] Right.

Cha Zhang: [00:35:49] You don’t even have the word vendor name there, so how do you figure out this thing is still vendor name?
So, there, you rely on information that’s language and that’s also kind of how the document is laid out. Like, okay, the font size may matter. You know, the position of the same may matter. So we are looking into combining all this information to come out with a better decision on those fields.

Sam Charrington: [00:36:21] So, how does a graphical representation or way of thinking about the document gets you to a solution to these kinds of problems? You know, for example, the unlabeled vendor name?

Cha Zhang: [00:36:33] The graphical kind of approach is basically… so you’ve got a bunch of text lines detected by the OCR and you connect to these texts lines with their neighbors. You define basically how strong these connections are. Actually it’s not defined. You actually learn these relationships by looking at the texts, looking at their relative positions, looking at their font similarity. Like one issue that you actually just mentioned was like address as you connect ’cause you have multiple lines of addresses. How do you know they actually belong to the same address? Right? So there’s this kind of, all these side information could be very helpful in determining that they should be grouped together. In the convolutional kind of graphical model, you learn a convolutional network by computing from all the neighboring nodes where each node is a text line to aggregate basically at the center node. So basically, the model learns by not only looking at the current text line that’s in focus, but also look at all the nearby text lines and decided, well, given all these contextual information, it does look like this is a vendor name.
I guess that’s a very high level conceptual description of why it would work, but it’s the data driven machine learning so that the model [inaudible].

Sam Charrington: [00:38:06] As you’re solving problems like this, are you often needing to re-label your dataset? For example, imagining early on in developing an algorithm like this, you have a bunch of invoices, and you draw a bounding box around the addresses and you say, this is the address.
Then you say, ‘Oh, well the font information is a whole new dataset,’ you have to label, well, this is… Are you going in and having people label Helvetica versus Arial? That seems a bit fine grain and hard to actually get an experts to label, or is it more abstract than that?

Cha Zhang: [00:38:48] We usually only label the end goal, which is the field that you’re going to extract. So, for example, you want to extract a vendor name, vendor address, total text, you basically draw a bounding box in those regions and use that as a ground use data.

Sam Charrington: [00:39:06] Got it. I think we’re going to the same place. When you say font…

Cha Zhang: [00:39:11] When I say font, actually it’s in some way, implicit in the sense that we’re taking these bounding boxes, we’re extracting image information. Right? So think of it as let’s say, run a convolution network to extract a feature of that part of the text region, the text line. So, this feature is essentially all the visual information that can be helpful in deciding or determining the relationship between text lines. So if features are similar, it probably mean they are similar font, they are similar size, you know, so those kinds of… So, yeah, I think that seems to be sufficient.

Sam Charrington: [00:39:55] So you’re not trying to kind of featurize your underlying images into these distinct things because what I inferred, when you said font. Do you look at the, you know, is there an analogy to kind of looking at the layers of the network, and when we do this with CNN, GC, like textures and things like that, is there some analogy that you’ve seen in looking at the layers of the network that says, ‘Oh, this layer is like identifying fonts.’

Cha Zhang: [00:40:32] No, we haven’t been going there yet. Well, I guess it’s certainly interesting to look at it. My take is most likely, font is just one attribute. I believe there are many other things. Yeah, I think it’ll be interesting to look at these features visually. Yeah.

Sam Charrington: [00:40:54] We’ve talked throughout the discussion about kind of the ways that OCR and this form recognition problem kind of blends the vision domain and NLP domain and language models has come up quite a bit. Is there a little bit more kind of depth we can go into there? Some of the ways that, that you see, NLP, and particularly the advances in NLP over the past few years kind of influencing the problem and the way you solve it?

Cha Zhang: [00:41:32] Yeah. We set up, I see NLP plays a very important role in these verticals. After all, these invoice receipt, business card, these are all human artifacts. They’re kind of language artifacts in some way. Right? So, all of the kind of latest state of the art in language modeling, we definitely want to leverage The thing I mentioned earlier, like the layout or it’s a one way to leverage them by using the language model, but also embed additional visual information, and hopefully to solve these problems effectively because input is really different, right? You know, the priorities like you take texts, it’s input here. We’re taking a bunch of texts lines to the locations and bounding boxes as inputs, and the algorithm can naturally kind of solve these problems.

Sam Charrington: [00:42:30] And,is it also trying to do the traditional language model predicting the next character or word or set of texts?

Cha Zhang: [00:42:38] Yeah, the way we train them are very similar, basically, merge texts – you merge some words and try to predict. Certainly you can use a lot of others. I think, you know, like I know recently people use translation targets. You can use alpha virgin coder kind of targets.
This is a really active research area at this point. I don’t think, I think we’re still just scratching the surface, although we already seeing very, very promising results. So we definitely want to look deeper into this and see how well this really can push the state of the art.

Sam Charrington: [00:43:21] Kind of continuing on that thread of the active research areas and what the future holds in this area, what are you most excited about in this domain of OCR and in general, extracting text from documents, vertical applications and the like.

Cha Zhang: [00:43:42] Yeah, I think, we have been working on this problem for quite a while, but I think there’s still a lot of interesting problems.
Only when we start to work with customers, we realize, you know, there are problems we haven’t been able to solve. I can just name one, for example, like table extraction sounds trivial, but when you actually look at all the existing tables in the word, the simplest one are those with explicit cell borders where you have straight lines but in reality, these tables can have no cell boundaries at all. It can be mixed on top with STEM, you know, all these things that are kind of making the problem extremely hard. So that’s jus, another one that is extremely challenging, but we want to solve.
Another thing that I sort of briefly mentioned about earlier was the customization part of these vertical. How do you customize to customer’s own data instead of having these pre-built ’cause inevitably, you will have data that doesn’t work with these premium models. How do you allow customer to have a way to build their own models to still work, and that by itself is a very challenging problem because asking customers to label a lot of data is painful. They don’t want to go there. So either we go unsupervised or we go with very, very limited in number of supervision data. In such a case, how do we adapt our model so that it can work on this document that customer realize that the premium model has failed. That’s also very interesting kind of research problem that we are looking into. I envision in a language as low shot learning. It’s also, now it’s definitely applicable to the problem here as well.

Sam Charrington: [00:45:50] In the case of some of the product ties, vision offerings, Azure does this as well. The user is able to upload its own set of labeled data and kind of the results for object detection are kind of fine tuned against the user’s data set.

Cha Zhang: [00:46:13] Yeah.

Sam Charrington: [00:46:14] Do the OCR and form recognition offerings, are they providing something similar? Like, can you upload it? Can I upload my own invoices? You’re doing some kind of transfer learning or, well. If you are, what are you doing to take advantage of what the user’s providing?

Cha Zhang:
[00:46:33] So we do have a product called a custom form which allow customer to upload a few samples here. We usually say minimum of five samples. So, say you have an invoice that doesn’t work with existing models, and so you want to solve the problem when you upload five invoices with similar is fine. These are from the same vendor or kind of looks or similar in structure, and we can figure out these key value pairs and extract them, either unsupervised or supervised. Right? Unsupervised means, customer don’t need to label anything.
So you upload the file documents. The information we’re gaining by looking at these five documents is, well, these documents are supposed to be similar and therefore, they’re going to be a bunch of words in this document that actually is common across these documents. This commonality help us to tell, well this is probably part of the empathy form or the template of the form, while the thing that’s varying across forms are like, these are must be information customer has filled in as kind of different from sample to sample. So with that information, we can actually extract key value pairs out of, without any supervision. All you need is upload five similar documents. Of course that works to a certain degree, but if you’re still not happy with accuracy, we provide a way for you to label your key valued pairs.
So here is like we, we have a UX where you can go and label the fields you care by essentially highlight the OCR text lines where you think this is the value I want to extract. Then we actually learn a model out of five samples and produce a model that can be used by the customer to extract these values.
The accuracy is actually normally pretty high, in the 90/95 percentage range, actually.

Sam Charrington: [00:48:38] So when the customer does this, is this process entirely learned or is there a human in the loop kind of exception handling element to it?

Cha Zhang: [00:48:50] I guess this is probably kind of take a step back. I think all the products, OCR process today, OCR has made a significant advance, but if you actually care about the numbers, think about the invoice. Right? If your total is wrong, it’s really that bad. So, what we recommend is definitely we recommend people to have agent backup.
For all of the products we offer, we give people confidence, right? So how confident we are about the expression of a particular value, and a different customer can choose their own threshold and have an agent to look at them. But I think, today’s accuracy. we don’t recommend kind of strays through, unless you are handling certain specific applications.
I can give you an example. For example, if you have a valid, if you’re verifying receipt image against a employee entered data, so there you can go automatic, right? ‘Cause if the OCR produce a different number than the employee, well, you will need somebody to look at them anyway, but if they actually merged them, well, that probably means it’s okay.

Sam Charrington: [00:50:08] Right.

Cha Zhang: [00:50:08] So the application, you can automate it more.

Sam Charrington: [00:50:13] Got it. So, the question that I was asking is slightly different though, and you know, so say you’ve got someone using automated form recognition and they have their five examples that they haven’t been happy with, and they submit that in through some website, our API, is someone at Microsoft taking those, and going, taking them manually through some process to try to figure out why they’re not working or are they thrown into some training job and then the customer’s result gets better?

Cha Zhang: [00:50:48] Okay. Now, no, we don’t look at the customer’s data. So this is a fully automated product, meaning, you know, customer basically label these files. They call a API to train a model. The whole process is automated.

Sam Charrington: [00:51:04] So under the covers, are they kind of forking off their own model? The last few layers are getting cut off and it’s fine tuning, or is it more elaborate than that, or…?

Cha Zhang: [00:51:17] It’s more elaborate than that. Underneath the hood, there are multiple steps. We leverage a lot of information in these sample documents. For example, as I mentioned earlier, there will be words common across these samples. Those are very strong indicators regarding this might be part of the empathy, part of the form where, you probably think these are not so interesting to the customer. Transfer learning is certainly one way of doing that. Right now we are actually train these models without transfer learning.
So it’s actually, the model is training from scratch for very few customers we’re able to do this. We’re able to do this because some very interesting work that we have done tobasically augument this data to make sure that you have sufficient data to still be able to train a model out of five samples only.
This can be a feedback loop as well. So, if customer’s not happy with a model trained by five samples, you can upload them more and we just train a new model for you. So every time you try and just get a newmodel, that way, it’s a feedback loop where customer can keep improving their model until it to a certain stage where it’s really performing for the customer.

Sam Charrington: [00:52:53] So when you say augmenting the five that they’re providing, are we talking about data augmentation and the sense of a transformation pipeline that kind of changes, adds noise, rotates, that kind of thing? Or are we talking about, you’ve got some other data set that you’re adding to their five and training it on that aggregate data set, and that’s how you’re producing a better model?

Cha Zhang: [00:53:21] Both. Although I think the latter one is more because actually, when customer label these data, they actually provide, we ask them to provide some additional information. For example, they label, this is a date. We know it’s a date. So in this way we can artificially create more data to fill the form so that we can produce more data to train the model. Also, we use a very robust machine learning algorithms that are robust to very few examples. So, that way we can learn with this limitation. Yeah. Normally, if you look at many of the other offerings that people provide. You have to train with hundreds of examples here. We’re pushing it really down to five and we hope to push it even lower in the future.

Sam Charrington: [00:54:11] So I’m assuming that this is a stacked problem and you’ve got some low level OCR, for example, models that are trained with many, many documents. What you’re doing with this form recognizer custom data is more at the top end node of that stack. Is the off the shelf model that I’m using without the five example customization, is that also trained on relatively few examples?

Cha Zhang: [00:54:44] What do you mean?

Sam Charrington: [00:54:45] I guess what, I guess maybe I’ll jump ahead to the conclusion that I’m drawing on.
What’s what’s confusing me is how are you getting better results with few examples if you’re not using any kind of transfer? I guess I heard in your explanation that you’re not doing any kind of transfer.

Cha Zhang: [00:55:03] So right now the custom forms support training model and these models are usually… each model is geared towards one particular form type. So in some way you can think this problem is actually restricted. It’s actually a easier problem. It’s not like a pre-built invoice where essentially you want to handle all your invoices. Here we’re handling one particular invoice coming from, I would say one particular vendor. I say they usually use this template.

Sam Charrington:
[00:55:37] Got it.
So the customer then, do they call a unique API to resolve invoices of this type? Or is that then ensembled, and then there’s something that decides whether it’s of the type that you’ve built the new model for?

Cha Zhang: [00:55:55] Yeah. So here’s a kind of the recommendation that we give to customers, right? So you maybe start with the previous model, and the previous model may work and then your job is done.
If you’re happy, go. Then you certainly say you have a lot of invoices and out of a thousand, 10 of them doesn’t work. So while we offer the customer as well, take these invoices and you can train specific models for these 10 different invoices, you might need to train more than one model as a special model because this invoice may look very different. So imagine you can train like 10 different customer models for this. We actually also offer kind of automatic invoice classification. So a API called a model compose where we can compose these 10 small models into one. So, all you need is just calling to that one. By calling into that one, we also provide you a confidence to say, well, because during testing, the customer send the invoicing. We don’t really know whether it’s one that doesn’t work with this pre-built one or whether it’s part of this. It works well with the previous. So you send this invoice first to the customized version of the model, and we will tell you, ‘Hey, it doesn’t look like any of the 10 you have trained.’ So in this case, you will revert back say, okay, now I’m calling the previous invoice ’cause you sort of know that pre-build actually works well for that. So that’s what we recommend customers to do.

Sam Charrington: [00:57:34] Okay.
I dug into a little bit of the detail there, but it’s interesting to see kind of how the end-end problem is put together. In a case like this, the ends of that problem are on the customer side, not just the service that you’re offering, and so seeing how the pieces are put together is kind of interesting.
Awesome! Well, Cha, thanks so much for taking the time and walking us through some of the interesting things that are happening in these domains.

Cha Zhang: [00:58:12] Thank you for having me.

Sam Charrington: [00:58:14] Great! Thank you.