Upside Down Reinforcement Learning

1024 457 The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Jürgen and his research lab at IDSIA, are involved in a number of fascinating projects, to hear about their work beyond Upside-Down Reinforcement Learning, check out the full interview! [TWiML Talk #357 with Jüergen Schmidhuber]. 


Jürgen Schmidhuber is co-founder and chief scientist of NNAISENSE, scientific director at IDSIA (Swiss AI Lab), and professor of AI at USI and SUPSI in Switzerland. His lab is well known for creating the Long Short-Term Memory (LSTM) network which has become a prevalent model in smartphones. This was the subject of our last conversation with Jürgen back in 2017 when he hosted Sam at his lab in Lugano where they recorded LSTM’s, plus a Deep Learning History Lesson.

The Age of Active AI 

Jürgen and his colleagues have been busy pursuing several new projects since that discussion, with the goal of ushering in what Jürgen calls the Age of Active AI. The term refers to the development of machine learning from passive pattern recognition to actively translating data into actions. One example is their team’s recent work with Festo, a leading robot maker. Festo has developed bionic hands controlled by air muscles.  “You have a compressor, which is generating the pressure that you need to move the fingers around. Nobody knows how to control that, and so NNAISENSE is building the brains that learn to control these soft robot hands.”

The Role of Traditional Reinforcement Learning 

One of the common threads underlying Active AI is reinforcement learning. In RL the agent learns which actions to take in its environment based on the numerical value assigned to its rewards. The agent learns through trial and error, by sampling various actions which will produce maximal rewards, and then learns to optimize for future action. 

“Traditional reinforcement learning works like this: You have a neural network or some other adaptive machine, which takes in data from the environment and produces actions, which change the environment, which means new inputs are coming in from the environment.

That is something you want to maximize, which is the cumulative expected reward. So when the robot is acting in a good way and it gets a reward, and it wants to maximize that by making some of these connections in the network stronger and others weaker, such that actions come out in response to the inputs that lead to a lot of success.”

Jürgen gives the example of a RL demo project using small Audis one-eighth the size of the real vehicle. “Whenever one of these little cars bumps against an obstacle, it gets negative reward. Whenever it reaches the parking lot, it gets a positive reward. As always in reinforcement learning, it’s trying to minimize pain and maximize pleasure. So it has to translate the incoming video and other data from the radar or LIDAR sensors, into actions that lead to successful parking strategies. That was the first time something like that was done in the real world.” 

Lack of Sample Efficiency in Reinforcement Learning  

One challenge characteristic of deep reinforcement learning is that it requires a lot of data for models to exhibit the desired behavior. Practically speaking, this data is often collected via simulation environments. While simulation can be costly, collecting data in the real world is much more so, not to mention potentially dangerous. Consider the Audi example above, working with actual life-size cars would be expensive and potentially hazardous. 

One reason for the sample inefficiency described above is that traditional deep reinforcement learning algorithms throw away a lot of data. For example, the agent typically only remembers those actions it took that resulted in large rewards.

How Upside-Down RL Solves for Sample Efficiency

Upside-Down RL turns things around. Instead of predicting rewards as in traditional RL, the “rewards are becoming the inputs to a network which produces the output actions, and which also sees the standard inputs coming from the environment.”  UDRL observes commands given in the form of desired awards and time steps.

For example, a training sequence might begin with a  command like: within X amount of time (e.g. 10 minutes), we want X amount of reward. “Now, the network sees this command and tries to come up with actions that between this time and this time, lead to so much reward. It just tries to obey the command.”

The beauty in UDRL is that if the system returns with less reward than desired within those 10 minutes, it’s okay because valuable learning is still taking place. Jürgen says, “from each of its failures to satisfy the command, it learns something. So it learns each time, something new about how to map commands to action sequences. Over time it will learn the structure of this space, of parameters that you need to translate these commands into these behaviors. Even by failing, the model is constantly learning and adjusting. 

So the model has a self-generated action sequence that it can remember and be trained on. Eventually, the model understands enough about the parameters and “the functions mapping word commands to output actions, such that it can generalize and [produce] desirable behavior.” 

Connection to Supervised Learning, Limitations and Future Research

Upside Down Reinforcement Learning is still in the early stages. In the papers, Training Agents using Upside-Down Reinforcement Learning and Reinforcement Learning Upside Down: Don’t Predict Rewards – Just Map Them to Actions, Jürgen and his team discuss their experiments in comparing the approach with traditional RL algorithms, stating that “Remarkably, a simple UDRL pilot version already outperforms traditional RL methods on certain challenging problems.”

There are still some constraints to the approach. Jürgen notes, “its limitations are exactly the limitations of supervised learning, because basically we are translating reinforcement learning into supervised learning, in a way that  depends on these deep networks that have to learn complicated mapping from reward commands to action sequences.”

Because of the similarities UDRL shares with supervised learning, many of the same techniques for model design, regularization and training can be applied to UDRL to improve the new type of RL. Applications for UDRL are still being explored, but the approach offers great promise in solving some interesting RL and SL challenges


Be sure to check out the full interview where you can also find other resources mentioned during the podcast. 

Leave a Reply

Your email address will not be published.