Spatial Analysis for Real-Time Video Processing

Sam Charrington: Hey, what’s up everyone!

We are just a week away from kicking off TWIMLfest, and I’m super excited to share a rundown of what we’ve got in store for week 1. On deck are the Codenames Bot Competition kickoff, an Accessibility and Computer Vision panel, the first of our Wellness Wednesdays sessions featuring meditation and yoga, as well as the first block of our Unconference Sessions proposed and delivered by folks like you. The leaderboard currently includes sessions on Sampling vs Profiling for Data Logging, Deep Learning for Time Series in Industry, and Machine Learning for Sustainable Agriculture. You can check out and vote on the current proposals or submit your own by visiting twimlai.com/twimlfest/vote/. And of course, we’ll have a couple of amazing keynote interviews that we’ll be unveiling shortly! As if great content isn’t reason enough to get registered for TWIMLcon, by popular demand we are extending our TWIMLfest SWAG BAG giveaway by just a few more days! Everyone who registers for TWIMLfest between now and Wednesday October 7th, will be automatically entered into a drawing for one of five TWIMLfest SWAG BAGs, including a mug, t-shirt, and stickers.
Registration and all the action takes place at twimlfest.com, so if you have not registered yet, be sure to jump over and do it now! We’ll wait here for you.

Before we jump into the interview, I’d like to take a moment to thank Microsoft for their support for the show, and their sponsorship of this series of episodes highlighting just a few of the fundamental innovations behind Azure Cognitive Services. Cognitive Services is a portfolio of domain-specific capabilities that brings AI within the reach of every developer—without requiring machine-learning expertise. All it takes is an API call to embed the ability to see, hear, speak, search, understand, and accelerate decision-making into your apps. Visit aka.ms/cognitive to learn how customers like Volkswagen, Uber, and the BBC have used Azure Cognitive Services to embed services like real-time translation, facial recognition, and natural language understanding to create robust and intelligent user experiences in their apps. While you’re there, you can take advantage of the $200 credit to start building your own intelligent applications when you open an Azure Free Account. That link again is aka.ms/cognitive.

And now, on to the show!

Sam Charrington: [00:00:00] All right, everyone. I am here with Adina Trufinescu. Adina is a Principal Program Manager at Microsoft, working on Computer Vision. Adina, welcome to the TWIML AI podcast.

Adina Trufinescu: [00:00:12] Thank you so much for having me here.

Sam Charrington: [00:00:14] Absolutely. I’m really looking forward to digging into our chats. We’ll be spending quite a bit of time talking about some of the interesting Computer Vision stuff you’re working on, in particular, the spatial analysis product that you work on, and some of the technical innovation that went into making that happen. But before we do that, I’d love for you to share a little bit about your background and how you came to work in Computer Vision.

Adina Trufinescu: [00:00:40] Definitely. I have joined Microsoft in 1998 so I’m a veteran here, and I started as an Engineer. So, I have an Engineer background, not a [research] background. Then after spending more than 10 years as an Engineer working on primarily Windows OS, I switched for Program Management, and I worked on a bunch of products until eventually, I started working Windows on speech recognition.

At the time I was working on Cortana speech recognition, and then, later on, I worked on speech recognition for HoloLens, the mixed reality device. Then for the past year and a half, I transitioned to computer vision. So I’m a Program Manager. I’m working with both the engineering and the research teams on shipping special analysis, and then special analysis – it’s a feature of Computer Vision in Azure Cognitive Services. Then it just shipped as of this week, at Ignite in the public preview.

Sam Charrington: [00:01:37] Nice.

In any other year, I’d asked you, what’s it like down in Orlando? Because that’s where Ignite is historically held. I’ve been to the last several, and I’ve done podcasts from Ignite, but this time, we’re doing it a little bit virtually as Microsoft is with the event. But super excited to bring to our audience a little bit of this update from Ignite. Tell us a little bit about the spatial analysis work that you’re doing there, and start from the top. What’s the problem that the spatial analysis is trying to solve?

Adina Trufinescu: [00:02:13] So, before I talk about spatial analysis, let me give you a bit of background information about Azure Cognitive Services for Computer Vision because it’s important to highlight the difference and the novelty that spatial analysis brings. So, the existing Computer Vision services are image-based, meaning that basically, the developer passes in an image at the time, and then the inference happens either in the cloud or in a container at the edge. Then the result of the inference image by image is being sent back to the developer. Spatial analysis brings the innovation of actually running Computer Vision AI on video strips.

So basically it analyzes life. It can also be recorded but primarily it was designed for live video streams and real-time analysis of these video streams, and in this case for the purpose of understanding people’s movement in physical space. Then when you talk about people’s movement, we’re talking primarily about four things.

The first one is the more basic scenario of people counting. So, basically in a video stream, we run people detection and then either periodically or when the count of people changes, we provide the insights indicating how many people. Then we have social distancing, which is actually called people distance, but we call it social distancing for the obvious reason. But basically you can configure the desired threshold at which you want to measure the distance between people, and then let’s take the magic six feet number, right? So basically, the AI is going to detect the people in the video stream, and then every time, when the people are closer than the minimal threshold, then an event is being generated, indicated that the minimal distance has not been respected.

So these are the first two, and then the next two are what we call entry and exit of physical spaces. So to actually detect when people enter or leave a physical space, we have two operations. One is called person crossing a zone – in and out of a zone, and the person crossing a line.

Let’s take the example of person crossing a line. Let’s say that you have a doorway, so you can draw a directional line, and then every time the bounding box of the detected person is crossing and intersecting the line, then we can generate that event, telling you that the person enter the space or exit the space.

Sam Charrington: [00:04:43] Awesome. So the context in which this is being offered, as you mentioned, the comparison to the image-based services and image-based service might be something I’m using to do object detection or segmentation of an image. I’m passing that to an API and I’m getting a result back where the service is telling me what it thinks is in the image and the probabilities and this is extending that same general idea to video, essentially.

Adina Trufinescu: [00:05:17] That’s right, and that we started with the spatial analysis for people movement. We’re looking to extend this to other domains for other relevant scenarios in the future.

Sam Charrington: [00:05:28] Can you give us an example of the other types of scenarios that folks might want to perform on video?

Adina Trufinescu: [00:05:36] So there are many industries where this is relevant. So, basically you can think about retail which currently is targeted towards this person movement analysis but think about, vehicle analysis. So, that would be like another kind of audit that when detected in a video, then you can have interesting AI insights generated and interesting scenarios.

Sam Charrington: [00:06:02] So, yeah, from even that explanation, I get that unlike an image-based service where generally, these work along the lines of ImageNet where you have these many classes of things that can be detected – toys and fruit and oranges, and things like that.

In video, you’re starting with very specific classes. Can you talk a little bit about why that is? Is it use case driven and that counting people and vehicles, and very specific things are more interesting in video than counting random objects? Or is it more a technical issue or limitation?

Adina Trufinescu: [00:06:46] Oh, it’s not a limitation.

So, we started with understanding people movement because this is where the customer signal was. So, I’ve mentioned retail. We also have many scenarios in manufacturing or in real estate management, and also the current events was also informing our decisions on when to start, but the way the video pipeline and the detection models are being inserted in the video pipeline is fairly generic, which is why we’re looking at enabling other domains in the future. So basically, the detector model that we have for people today can easily swap with a different detector model for a different domain.

Sam Charrington: [00:07:24] Okay. Okay. I’m thinking about the use cases. It sounds like the use cases that you are envisioning are camera-based video streams, as opposed to, I’m going to pipe in a stream of commercial television and ask your service to find anytime a particular can of Coca Cola shows up, or something like that. That’s another use case that I see every once in a while, but clearly it’s not one you’re going after at this point.

Adina Trufinescu: [00:07:59] Not for now, not for now. Speaking about the cameras, the cameras that we work with, we don’t actually provide like a given model. So, any model that supports the RTSP protocol, which is like the universal protocol. Well, I shouldn’t say universal, but it’s a common protocol for video streaming. So, you can have a camera or you can have an NVR; any video management system that actually is capable of streaming over the RTSP protocol. We work with that.

Sam Charrington: [00:08:31] Okay. NVR being network video recorder, surveillance use case, or a technology used in that use case.

Adina Trufinescu: [00:08:41] Yeah, that’s right. So basically we’re looking at not only at a greenfield site areas where customers install new cameras, but also at existing cameras and existing video systems.

Sam Charrington: [00:08:51] So when I think about this type of use case, it makes me think of something like a ring camera where maybe I can grab a raspberry pie or something like that, and have it call out to the service, put in a little USB camera on my raspberry pie and stick it by my door, and do a roll your own ring camera and have it count people that go into some zone or something like that. Could I do that with this?

Adina Trufinescu: [00:09:22] You can do something like that but the device that we are supporting and we have tested extensively for, it’s actually more of a heavyweight device, it’s an Azure Stack Edge. But the idea here is that these spaces where you can dozens of cameras or you can have hundreds of cameras. So, imagine a warehouse where you could potentially have hundreds of cameras.

Basically, we want you to have a way where you can deploy at scale and you can manage this cameras at scale. Then because of video, the sensitivity around the privacy concerns and data control concerns, basically that’s where Azure Stack Edge comes in where you can actually keep the video on your premises. Then all the processing happens on the Azure Stack Edge device, and then only the result of the identified data about the people movement can be sent to the cloud, put your own service in the cloud to your own tenant, and then you can and build a solution in the cloud. Then I should be more specific that the Azure Stack Edge device that we are running with is actually the one that has the Nvidia T4 GPU. So even a more departure from just a [Nano]. This is the initial release. This is the public preview, and then we’re looking at extending the range of devices and hardware acceleration capabilities for something lower, let something less than Azure Stack Edge.

Sam Charrington: [00:10:55] Got it. Then for folks that aren’t familiar, Azure Stack Edge is essentially a way. It’s a pretty heavyweight hardware set out where you’re essentially running the Azure Cloud in your data center. That’s the general idea, right?

Adina Trufinescu: [00:11:09] Yeah, that’s right, and if you have a small space where you have, let’s say 20, 50 cameras, you don’t really need something of the extent of a data center. You need a room, a server closet with a reasonable temperature where you can run these devices.

Sam Charrington: [00:11:33] Okay. Okay. So I’m going to have to wait quite a while for this technology to be democratized, if we will, to the point where I’m running it on a raspberry pie with a USB camera.

Adina Trufinescu: [00:11:48] I was hoping it’s not quite a while but not yet.

Sam Charrington: [00:11:53] Not yet, and I think in this day and age, I think we have to talk about surveillance and the role of technologies like this, and enabling different types of surveillance use cases, some of which are problematic and some of which are necessary in the course of doing business. What’s the general take on making this kind of service available for those kinds of use cases?

Adina Trufinescu: [00:12:24] So when we release spatial analysis, we had in mind what Microsoft calls responsible AI and innovation. So this is where we recognize the potential of harmful use cases, and then with this release, we also released a set of responsible AI guidelines which had three things in mind. The first one is protecting the privacy of the end-user; providing transparency such that the end user and the customer understands the impact of the technology, and then in the end promoted trust. Then the idea there is that we want to pass this responsible AI guidance and practices to our developers. and people that actually build the end to end solutions, such that the end users, the people actually impacted by the technology, can actually be protected, and the human dignity of these people is actually uphold.

Sam Charrington: [00:13:18] So it sounds like even if I did have an Azure Stack Edge, I couldn’t necessarily just turn on the service and do whatever I want with it.

Adina Trufinescu: [00:13:26] So, we have a process for that. We take our customers through, at least for this public preview, where you get access to the container. I’m not sure if I mentioned this, but we started not with an Azure service in the cloud but with a Docker container that you run on your premises on Azure Stack Edge, and basically the container, anybody can download it, but to actually access the functionality in the container, we want you to fill in this form.

You describe the use cases that you are considering for your solution and your deployment, and then we will look together whether these use cases align with the responsible AI guidance and then, if they do, obviously you can proceed, and then if they don’t, we’ll have that conversation to make sure that the responsible AI guidance is upheld.

Sam Charrington: [00:14:15] Okay. Well, let’s maybe shift gears and talk a little bit about some of the tech that went into enabling this. In order to do what you’re doing, you’re doing some kind of standard things like object detection. Is this fresh out of research papers, new techniques to do the detection and classification, or what are some of the things that you’re doing there and the challenges that you ran into and productizing this?

Adina Trufinescu: [00:14:44] So, I think the challenges, they vary depending on the four use cases. So let me try to break it down and then address each one. So for instance, we are running a DNN for people detection, and then we started with something like more heavyweight, and then we had to transition because of the performance concerns.

I’m going to come back to that in a second, but basically we had to transition to a lighter model.

Sam Charrington: [00:15:09] A big ResNet…?

Adina Trufinescu: [00:15:11] Let’s say a big ResNet or a smaller ResNet.

Sam Charrington: [00:15:16] Okay.

Adina Trufinescu: [00:15:17] I’m going to leave it at that. But the idea there is that for instance, for something like people counting, initially for all operations, we started thinking that we can stream at 15 frames per second, and hen we did that. Then we’ve noticed that to get maximum usage out of that Azure Stack Edge, which is quite heavyweight, right? We want to run as many video streams as possible. So basically we try to actually go as low as possible in terms of frame rate, and then for something that’s person count, the person count from one second to another doesn’t change dramatically.

So for something like person count or a person distance, we went from 15 frames per second to one frame per second. Then we were able to maximize the usage of the GPU because now the DNN runs at the lower frame rate, and this way you can fit in more video strips. The challenge we had, for instance, with social distance with person count was around generating ground truth.

So [we create] a 10 minute video. Let’s say you have point in the video and you have to allocate the distance between the people, just looking at the video, you cannot figure out the physical distance between people. So that is where we use synthetic video data. So basically, we are using the same technology that our colleague teams in mixed reality for HoloLens are using, where we generate this game scenes where we can control the positioning of the people and then their relative positioning.

So that was the first challenge for person distancing. The second challenge is that the DNN is going to tell you whether there are people in a frame, but it’s not gonna tell you the actual physical distance. So for that, you need the camera to be calibrated. So this is where the initial thinking was that we will ask the customer for the camera height, for the angle, for the focus distance, but that wasn’t practical either. So this is where we had to actually come up with a calibration algorithm for the camera, such that before the actual operations, where the DNN runs for the purpose of the operation, the algorithm for calibration kicks in such that we ask the customer to have at least two people in the camera field of view. Then the algorithm runs for detecting these people and makes assumptions for their positioning and this way, the camera height and the focal distance are actually calculated. Then we pass it back to the customer as output and we want to make sure that, that reflects the reality, but between a ground truth and the camera calibration, these were the two challenges for person detection.

Sam Charrington: [00:18:06] All right. So just maybe taking a step back. We started out talking about counting people and, it sounds like there’s some research or work that went into getting from this big heavyweight model to the smaller model. So that was one element of it, but also, just fine tuning the end to end process in terms of how quickly you’re able to do it.

In other words, what the frame rate you’re using for counting people. That was part of counting people?

Adina Trufinescu: [00:18:43] Yes, that’s right.

Sam Charrington: [00:18:44] It was just an iterative process. Keep reducing the framework until things start breaking and you’re not able to count accurately or was that something where you’re building out models to tell you how low you can go or something. What all went into that?

Adina Trufinescu: [00:19:02] So, it was a little bit of both. It was like a constant measurement of performance and accuracy in terms of frame rate, we would go lower and lower to the point where we can maintain the accuracy and precision rates. Then you reach a breaking point and then that’s how you know that you have to stop. Then when you have to stop that, I wouldn’t say that this was exactly how it happened, but when you talk about frame rate and doing all these tests, this is where the engineering comes in. Then when you come about the performance of the DNN and the models, this is where research teams are making progress in parallel.

So basically, it was an iterative process where, between engineering and research, they both worked together to arrive to what seems to be the best balance between performance and accuracy.

Sam Charrington: [00:20:01] As part of that counting people process, you’ve got two sub problems there. One is identifying the people in the frame, and then you also have to know from one frame to the next, which person is which. Is that a part of the challenge here?

Adina Trufinescu: [00:20:16] Yeah, that’s right. So see, especially for person crossing in and out of a zone and person crossing the line, that’s where the tracking part of the algorithm comes in, and to be able to tell that it’s the same person from one frame to another, in addition to the DNN model, we are running a combinatorial algorithm such that detection is telling you that I have these people. Then by extracting features, we can run the combinatorial algorithm to tell that from frame P minus one to frame T, we have the same set of people, then the S people are detected across the frames. They are getting this anonymous identifier which tells you that there is the same person from frame one to frame ten. Something like that.

Sam Charrington: [00:21:09] You mentioned extracting features to help the commonitorial algorithm. Are you pulling those out of the bowels of the DNN or is this a separate pipeline or a separate flow that is identifying features and a kind of more traditional computer vision way?

Adina Trufinescu: [00:21:28] So we actually pull it from the DNN and we have the typical feature that you would expect like motion, vector, velocity, and direction in the 2D space and frame by frame, we’re looking at all these attributes. Then we’re making the decision whether the same person shows up across the various frames.

Then I should say that each person gets an identifier and that is an anonymized identifier. There is no facial recognition or anything of this sort.

Sam Charrington: [00:22:03] Okay.

Adina Trufinescu: [00:22:04] Then I should say that in our pursuit of a performance, we started this process at running at 15 frames per second because when you actually look closely at how people move in and out of a zone or cross a line, the action of crossing and the time the person crosses that line is fairly short. So we had to run it more than 15 frames per second. This is where we initially started by running the DNN for the people detection every 15th frame, still keeping it at one frame per second, and running the association algorithm every frame. The problem that we had was the accuracy and the performance had all the typical challenges where the identity of the people will be switched or the identity of two people will be merged. This is the fragmentation and merging typical challenges with association. So, if you don’t actually run the detection on each frame, every time when a person is occluded or every time when a person disappears from the frame or a new person appears, that’s when you have all these association problems of merging and fragmentation. So that was another motivation for us to go to a lighter DNN four-person detection. Something that we can actually run each frame at 15 frames per second.

Sam Charrington: [00:23:31] Okay, but you mentioned that there are some parts of the problem that you do down at one frame per second?

Adina Trufinescu: [00:23:36] Right. So, just to recap, a person counting and social distancing, we keep doing it at one frame per second, and then person crossing a line and person crossing in and out of a zone, we run at 15 frames per second.

Sam Charrington: [00:23:51] Got it. The main idea there is that for counting people and counting distance, it’s not an associative problem. You’re just looking at what’s in the frame.

Adina Trufinescu: [00:24:03] Right, right.

Sam Charrington: [00:24:04] Someone bounces in our out between frames, if they’re not in the frame, you don’t count them.

But when you’re talking about entering and exiting physical spaces, you want to keep track of who was already in the space versus who wasn’t in the space in order for you to provide an accurate account. So you have to, there’s a bit more accounting that has to happen, and then you get these challenges with people disappearing because they were at the edge or something like that. That’s where you have to focus on these segmentation, emerging problems.

Adina Trufinescu: [00:24:34] Yeah, that’s right. So, imagine that counting people over social distance, not a lot whole happened in a second. So, imagine that you have a railway station and you have a doorway where a dozen of people needs to pass through. At that point, you have to run people detection at a higher frame rate such that you do not lose the people, or you do not lose them when they show up and you want to lose them when they disappear.

Sam Charrington: [00:25:00] Yeah. Yeah. Yeah. So you mentioned a bit about the training data challenge that you ran into there, and this is related to that last problem we talked about with entering and exiting physical spaces. Is that correct? Or is it–?

Adina Trufinescu: [00:25:18] Yeah, that’s right. So this is where ground truth was also challenging. Take videos and these videos can be 10 minutes to one hour. You could have, depending on which space are you using, you could have few people or you could have a dozen or you could have a hundred people, right? So annotating that data frame by frame at 15 frames per second, that’s a lot of work. Not only that. You have to track the same person from this frame across all the 15 frames times this many minutes is the same person.

It’s possible but you don’t want to do that. You don’t want to ask any human to do that. So this is where–

Sam Charrington: [00:26:01] If I can just jump in. If the network isn’t tracking the people but it’s a combinatorial type of algorithm, is that a non-learned algorithm where you don’t need to train on associating people or do you also move that–

Adina Trufinescu: [00:26:22] That is not a DNN. It’s an algorithm and you don’t have to train it.

So what we are training is the people detection model, and then we are testing independently first the people detection model, and then we are testing the tracking aspect of it, and then we are testing the combinatorial algorithm.

So that’s where the ground truth needs to cover all the use cases. But then the most challenging one is the one where you have to generate ground truths that annotates each person and the anonymized identity of each person across the frames.

Sam Charrington: [00:27:04] Okay. Yeah.

I was trying to make sure that you actually had to track that because that would seem to make the data collection process quite a bit more challenging when you’re annotating the identity of folks. That can be, if we’re talking about images, that look like a overhead image of Grand Central Station or something, I would imagine that to be difficult for a human annotator.

Adina Trufinescu: [00:27:26] Yeah, right. So this is where synthetics plays the same role as before. We are generating all these synthetics videos where, not only that we want to make sure that it’s the same person across the video, but you want to make sure that the padding of the people in physical spaces across the use cases is most realistic, and then you want to annotate that.

You have the different camera angles, you have the different heights and you have the lighting conditions. So trying to go into the real world to collect all that data, and then to annotate that data, that would be a real challenge. So this is where synthetics played a huge role and was a huge time saver.

Sam Charrington: [00:28:12] Where does this synthetic data come from?

Did you take an Xbox game that kind of looked like it had people in a crowd and try to use that, or did you develop a custom data generator for this problem?

Adina Trufinescu: [00:28:27] It’s pretty much the same technology that is being used for HoloLens and for mixed reality; the same kind of the technology that powers the [same] generation.

We didn’t take a game but the concept is very much game-like where you can overlay an image of actual physical space, and then you can start placing all these characters into the 3D space, and then generating the video streams out of that. Then, because you can play with the physics and then with the lighting, you can have a great variation.

That is actually what we need to assure the high quality of the AI models and of the combinatorial algorithm.

Sam Charrington: [00:29:14] Is that synthetic data approach also related to the camera placement approach that you mentioned? Are you varying the camera angle as part of this synthetic generation?

Adina Trufinescu: [00:29:25] Yeah, that’s right.

So, Computer Vision has a custom vision and we want people to go and create custom vision models, but to the extent where they don’t have to, and then we can actually save them time by creating these high quality models which perform great in all of these conditions. We want to do that. So the goal there was that when we train and when we test, we test with data from all these various conditions. So, part of the synthetic data was to– Like the ceiling in a retail space is different than a ceiling in a manufacturing space. So this is where you need to bring in that variation.

Sam Charrington: [00:30:08] Okay. From a customer perspective, are they sending you pictures from their camera and there’s a model that figures out where their camera might be? You said that you don’t want them to have to send you measurements or anything like that. What’s the input to that process?

Adina Trufinescu: [00:30:27] So, we do not collect data from customers. In the product, none of the video that is being processed is used for training. So the way we are approaching this is visiting customers, looking and learning about their environment, and learning about the parameters of the environment such that we can simulate it. Then we also create simulations of the real world scenarios. Obviously not manufacturing but you might use something like a store layout. That’s something that you can emulate fairly easily, and then in that scenario, you have something where the camera is at 10 feet or camera is at 20 feet.

Then you’re looking at the different angles and the different areas in the store where you want to apply the person crossing zone, person crossing line. That’s how you generate the synthetics data.

Sam Charrington: [00:31:24] Got it. Okay.

Finally, you started to mention a kind of measurement and some of the challenges that measurements pose for this problem. Can you elaborate on the way you score these models and how you assess their accuracy?

Adina Trufinescu: [00:31:45] So we applied the MOT Challenge, and then we used the data set to track the accuracy of the person detection and the person tracking model. We applied the MOT Accuracy and precision formulas.

Sam Charrington: [00:32:04] MOT Challenge – Multi-Object Challenge, [inaudible]?

Adina Trufinescu: [00:32:09] Multi-Object Tracking Challenge. So, we apply the industry standards to assess the precision and accuracy of the model. But, the thing that we did a bit different was that the actual output that goes to the customer is not actually frame by frame, the result of the detection or the tracking.

What we actually send to the customer is the count of people, the distance between people, the time they spent in a zone, or the entry and exit events in the zone, such that they can calculate the dwell time. So we looked at the use cases, and we came up with accuracy measures specific to the scenario, and then we generated ground truths such that we can test holistically, not only the tracking part of the algorithm but the entwined algorithm between tracking association and applying this logic, like person crossing in and out of the zone or person crossing in and out the line.

Sam Charrington: [00:33:11] So did you extend the challenge benchmark to your specific use cases in the higher level metrics that you’re providing to customers, or did you have a separate parallel path that was more reflective of your specific kind of use case specific numbers?

Adina Trufinescu: [00:33:30] It’s pretty much specific to the use case. To give you an example, for the person entering and exiting the zone, we looked at the, what we call dwell time, which is a fairly common use case for what people want to measure. Then we looked at the timestamps for the ground truth. We created ground truth by looking at the timestamps of people entering and exiting the zone. Then we created measures for dwell time entering or exiting. It helped us assure that the accuracy of the end product, which is what the customer is consuming, is at a level that is satisfying the customer requirements.

Sam Charrington: [00:34:22] With these measurements in mind, did you give up a lot going from the huge DNNs to a more compact DNNs and changing frame rates, and things like that? All these things that you needed to do to deliver a product that worked in the kind of environment that you were looking to do, did you lose a lot in accuracy for the measurements that you’re trying to provide?

Adina Trufinescu: [00:34:49] Not really. The goal is to gain in accuracy. You have to make tradeoffs and then you have to balance. It’s always like a tug of war between accuracy and performance, working with customers, thats why we have these public previews. Before the public preview, we had the private preview. So, we work closely with a set of customers to validate the accuracy of the entwined algorithm for their use cases. There were some learnings that we took away and then that’s how we arrived by making the right trade offs, such that both the accuracy and the performance and the cost of the end to end solutions make sense.

Sam Charrington: [00:35:31] Awesome. Awesome. You presented on this at Ignite this week when you unveiled the public stage of release. Any takeaways from your presentation or the reception to it?

Adina Trufinescu: [00:35:44] So, it was well received. I would say that you stay so much focused on performance and accuracy, and then the feedback that we got was, it was very strong feedback. For instance, the measure between people, we provided only in fit. Obviously, you have to stay focused on everything that matters. I mean, we’ll try to move fast and everything happened so fast and that this is something that we plan during the pandemic months.

Then the six feet that you hear every day stuck with us. Then we realized that our customers needs the metric system. So we had feedback like that. But then at this point, we are very excited to have the customer [stride] and I’m pretty sure that there will be more learnings.

Sam Charrington: [00:36:41] Awesome. Awesome.

Well, we’ll be sure to link out to the service where folks can find it in the show notes, but thanks so much for taking the time to share with us an update on this new service, and what you’re up to.

Adina Trufinescu: [00:36:58] Yeah, it was my pleasure. Thank you for having me.

Sam Charrington: [00:37:00] Thanks, Adina.