Dex-Net and the Third Wave of Robot Learning

1024 648 The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Ken Goldberg is involved in several projects in collaboration with multiple organizations at UC Berkeley including some technology-based art projects. To hear about all of them, check out the recent TWIML AI Podcast interview, The Third Wave of Robotic Learning with Ken Goldberg.


Ever thought you had a good grip on your phone and then watched in slow motion as it fell to the floor? Generally, as humans we’ve learned to gauge how to pick something up, and we usually don’t have to think about the microdecisions and movements involved. But even for us, grasping objects and maintaining stability can be difficult at times. It turns out the seemingly simple task of grasping an object is an even bigger challenge for robots, because they have to learn the physical dexterity grasping requires from zero prior knowledge. So how do we efficiently teach machines this skill?

Ken Goldberg is an engineering professor at the University of California, Berkeley where he runs the Laboratory for Automation Science and Engineering (AUTOLAB). The lab is focused on several forms of robotic learning including imitation, deep, and reinforcement learning for a variety of applications spanning surgery to agriculture. One of their major contributions in recent years is the development of the Dexterity Network (Dex-Net), a project that generates datasets for training robust grasping models.

The Challenge of Robotic Grasping 

Researchers have been studying the problem of grasping for decades, but as Ken states, “Robots remain incredibly clumsy today. They’re much better than they were, but industrial arms, if you give them novel objects, they will drop them with a fairly high frequency.” The topic has warranted more attention in recent years with the rapid growth of e-commerce. Training robots to handle packages of various sizes and weights has massive potential for the industry, and large retailers are eager to find a solution, inspiring efforts like the Amazon Picking Challenge in 2017. 

The act of picking something up sounds fairly simple, but because robots lack physical and perceptual context, it’s a much harder problem than it looks. “Humans and animals… seem to cope very well with a problem like grasping, and interacting with the physical world, because we bring to it a sort of inherent understanding, a deeper understanding about the nature of objects. This is very subtle. I can’t describe this exactly. It’s intuitive to us how to pick things up, but it’s very hard for us to formalize that intuition and give that to a robot.” 

According to Ken, there are three fundamental elements of uncertainty that make robot grasping extremely difficult:

Perception. Understanding the precise geometry of where everything is in a scene can be a complex task. There have been developments in depth sensors like LIDAR, “but they still don’t completely solve this problem because if there’s anything reflective or transparent on the surface, that causes the light to react in unpredictable ways, it doesn’t register as a correct position of where that surface really is.” Adding additional sensors doesn’t help much because they often create contradictions, “[the agent] doesn’t know what to trust” in order to act correctly. Perception is especially important in grasping because “a millimeter or less can make the difference between holding something and dropping it.”

  The robot has to maintain control of its grasp meaning, “The robot has to now get its gripper to the precise position in space, consistent with what it believes is happening from its sensors.” If the gripper moves slightly or holds it too tight, the object can drop or break.

This has to do with choosing the right place to grasp the object, understanding friction and mass are significant unknowns. To demonstrate how difficult this is, Ken gives the example of pushing a pencil across the table with your finger. We can estimate the pencil’s center of mass, but we ultimately do not know the frictional properties at play. It’s almost impossible to predict the trajectory because even “one microscopic grain of sand, anything under there is going to cause it to behave extremely differently.” 

What Makes a Grasp “Robust”?

For the robustness of a grasp, we want to consider what happens even when the perception, control, and understanding of the physics are slightly off. “If you pick up a glass of wine, for example…Even if the glass isn’t quite where you thought it was, even if your hand isn’t quite where you thought it was, and even if the thing is slippery, you’re still going to be able to pick it up. That’s a robust grasp.”

Robust grasps are not uniform because objects vary incredibly. “It turns out that for most objects, there are grasps that are more or less robust. What we’re trying to do is get a robot to learn that quality, that robustness.”  

“We can generate that by using [physics and mechanics]. Actually it goes all the way back to centuries of beautiful mechanics of understanding the physics and forces and torques, or wrenches, in space that characterize what happens if we know everything. But then what we do is perturb that statistically and if it’s robust it works for all these statistical perturbations with high probability then we say it’s a robust grasp.”

Physics vs Statistics and The Third Wave of Robot Learning

There’s some debate in the community around the best approaches to robotic learning, which Ken breaks up into three waves of robotic learning. The first wave is the “classic physics” approach which prioritizes traditional understandings of physics in terms of forces, and torques, friction, mass — all that good stuff. The second wave is the more modern, “data-driven approaches that say: ‘Forget about the physics, let’s just learn it from observation purely’” and assume the physics will be learned naturally in the process. 

Then there’s what Ken advocates for, which is the third wave of robot learning that combines the two fields of thought. The goal is to synthesize the knowledge from both perspectives to optimize performance. However, “figuring out where that combination is is the challenge. And that’s really the story of Dex-Net.”

The Dexterity Network

The thinking behind Dex-Net was to do for robotics what  the development of ImageNet did for computer vision. “ImageNet really transformed machine learning by having a very large data set of labeled images.” By providing a large dataset of labeled images, ImageNet helped spur on the development of deep learning in general, and machine vision in particular. 

“The question for us was, could we do something analogous in grasping by assembling a very large data set of three-dimensional objects, three-dimensional CAD models, and then labeling them with robust grasps.” 

To create Dex-Net they used a combination of both physics and statistical-based deep learning techniques. They first applied “that whole first wave [of physics], all that beautiful theory” to loads of simulated models to find which grasps were robust to noise and perturbations. 

The Use of Depth Sensors to Produce Simulations

Pure depth sensors were used to create three-dimensional models and map the objects in space. All other information was stripped away, “I don’t care about the color of things or the texture on things. In fact, that’s a distraction.” Depth sensing makes for nice simulations and perfect models that perturbations and noise could be applied to. 

In the perfect model, “I have an arrangement of points in space, and then I know when that arrangement corresponds to a successful grasp or not because I’m using the physics and statistical model of the sensor.” After the perturbations, “you have a noisy pattern of points in space, and you know what the true, robust grasp was for that pattern of points…The output is just a scalar or number from zero to one, which is the quality, we call it, the probability that that grasp will succeed.” 

They’re able to generate millions of these examples fairly quickly (overnight), producing a solid data set to train with. When the machine is shown objects that it has never seen before, it can evaluate the quality of the grasps. “Then what I do is I try a number of different grasps synthetically on that depth map, and it tells me this is the one with highest quality…we consider that the optimal grasp and we execute it. Here’s the thing: It works remarkably well, far better than we thought.” 

Limitations, Improvements and Applications

Those robust examples were then used to train a deep learning system that could generalize to new examples. The system generalized surprisingly well, but as Ken points out, it’s not perfect. The team was able to reach over a 90% success rate, but that was subject to the nature of the objects. “If the objects are all fairly well-behaved like cylinders and cuboids, then it’s fairly easy to do well, but when you have more complex geometries many systems have trouble.” The system still performed well with irregular objects, but did not get close to 100% success. 

Another limitation is that if you were to change the gripper or sensor, the framework would still apply, but you would have to retrain the system for a new neural network. This is where providing an open dataset and code examples comes in. These can be used to train new grasping models specific to new types of grippers or objects. 

For an example of Dex-Net in action, check out this video Sam shot at last year’s Siemens Spotlight on Innovation event:

In the full interview, Sam and Ken discuss the wide variety of projects he and his lab are working on, from telemedicine to agriculture to art. The conversation on applications picks up at 24:51 in the podcast. Enjoy!

Leave a Reply

Your email address will not be published.