Stabilizing Off-Policy Reinforcement Learning

For more on the latest advancements in reinforcement learning, check out the podcast that inspired this article—[TWIML AI Podcast #355 with Sergey Levine]. In addition to discussing his team’s work on off-policy reinforcement learning, he also updates us on their efforts in model-based and causal RL.

Typically, reinforcement learning involves an agent that interacts with the world, improves its policy, and then continues to interact with the world. This is a very active, online process.

If we want models learned in this manner to generalize, however, they need to be trained on large amounts of data which is not always easy or scalable. In the case of robotics simulation, where the behavior of the agent and environment are governed by the laws of physics we might turn to simulation. But what about the case where we want to train an agent to make marketing decisions in an e-commerce environment? Building a simulator for this environment is no trivial feat, and why should we even need to, if we’ve already got a lot of data about historical customer decisions when presented with offers?

To address these types of scenarios, Sergey and his team are working on off-policy or offline reinforcement learning, also referred to as batch reinforcement learning. “The basic idea is that you have some data, and in the most extreme version, you’re not even allowed to collect any more data. That data is all you’ve got and you just have to extract the best policy you can out of it.”

Complications with Offline Reinforcement Learning

Some problems arise when using offline data that are at first difficult to understand. When using standard RL methods like Q-learning (a model-free RL algorithm), which in theory should be applicable to batch settings, the algorithms are found to perform very poorly in practice. Intuitively, this appears to be an overfitting problem, but the same issues occur even when more data is added.

The real complication is that the structure of the Q-learning algorithm performs badly if it’s not allowed to interact with the world on its own and learn from its experiences. As Sergey puts it,

“In Q-learning you’re making counterfactual queries. People often don’t realize this…It comes up when you calculate a target value…you took this action, you got this reward and you’re going to land in this state, but then you’re going to run a different policy, not the one that you used in the data…you don’t get to actually run that policy, you just have to ask your Q-function.”

This is problematic in offline settings because the Q-function will only generalize if the new action that you plug in is from the same distribution as the one that was taken in the data., and will not generalize if that action is from a different distribution. When you optimize, your policy ends up looking for actions for which your Q-function incorrectly predicts a high-value reward, essentially exploiting your Q-function with an adversarial action. And because you never interact with the world, you never learn that the nonsensical action it suggested was bad.

“So it’s not an over-fitting problem. It’s actually this counterfactual, out of distribution action problem and once you recognize it for what it is, then you can actually study possible solutions.”

To address this challenge, Levine’s team discovered a particular formulation for the policy constraint can alleviate the issue, the details of which can be found in their paper, Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. They found that their solution works well for many standard benchmark problems, and will be continuing to research its use for actual robotics tasks.

Additional Applications

Having a fully data-driven approach to reinforcement learning is one of the exciting implications of this work. For example in medicine, ”you don’t want to run a reinforcement learning agent to interact with real patients, but maybe we can get some logs. Maybe applications for e-commerce, for educational support agents, decision making support, that sort of thing.” These are areas where online data collection can be challenging, costly, or risky to perform. Sergey is super excited about these real-world applications of the research.

The off-policy RL paper is just one of several papers discussed in our recent interview with Sergey on Advancements in Reinforcement Learning. Be sure to check out the full interview You’ll also find on the show notes page the other 11 (!) papers Sergey and his team at Berkeley shared at last year’s NeurIPS conference, along with other resources mentioned during the podcast.