Reinforcement Learning
Last updated
Last updated
Core Lesson:
Use this link for Sutton and Barto instead:
For Tabular RL & Policy Improvement: I would definitely recommend doing the suggested readings before you get started.
Once you're done, read up to (but not including) the section titled "Common Approaches" from once done with this chapter. Also check out and from 2017. Keep referencing these resources as need be.
Core Lesson:
In DQN, really understand that the target networks are predicting the q-value for the next state while the q-networks are predicting the q-value for the current state! This is probably very obvious, but I didn't quite register it at first pass. I ended up spending a day banging my head against the wall and trying to figure out how in the world DQN worked if both networks were predicting the q-value for the same state, and the target was almost guaranteed to be worse.
Progress Check
Understand that DQN is just an extension of Q-learning to deep learning. Instead of using a table to store all the Q-values for each state-action pair, we train a neural network to learn this function for us.
Give a conceptual overview of the steps of DQN.
use some kind of policy (usually epsilon-greedy)
store these experiences in the replay buffer
use the values from the replay buffer to calculate a TD error (temporal difference error) between a target network and the Q-network (where the Q-network provides the prediction for the current Q-value and the target network provides the prediction for the Q-value after the next step)
update the Q-network via gradient descent
every so often, copy the Q-network weights over to the target network
Why do you even need a target network?
Ans: Remember how in SARSA you were learning from the next action? This is the same thing (and in the first paper there actually wasn't even a target network). It was noticed that learning was very unstable because the Q-network was always changing, so the labels it was chasing after were always changing. Having a lagged copy of the Q-network helps provide it a consistent thing to optimize towards while updating the targets often enough that they improve as well and don't get left behind by the Q-network.
Is DQN suited for continuous action spaces?
No, because the Q function takes in states and returns Q-values (estimated future reward) for each discrete action. It's not even great for large action spaces.
Core Lesson:
For me, PPO felt like it came out of left field, and I was incredibly confused when I first went through the PPO lesson. It took spending more time learning about policy gradients upfront to really grok things.
Before you get started, read the rest of Lilian Weng's post and read up on if you want to go deeper in the math.
Other good resources if the two above don't make much sense:
A different Lilian Weng's specifically on policy gradients
Even though learning the same information from different lenses can often help (and I usually recommend it), you should also be wary of taking this to the extreme. Once you have a solid mental model, move into the lesson.
Progress Check
Describe the interplay between the actor and critic networks.
The actor network needs the critic network to estimate the gradient and perform gradient ascent, and the critic is learning the value function of the actor's current policy. Basically, the actor network learns the policy and the critic network learns the values function V(s).
What is the major innovation of PPO over TRPO?
PPO has a simpler optimization process because of the clipped surrogate objective.
Why did you do a shared network for actor and critic only for Atari (and not for Cartpole or Mujoco)?
Ans: Atari games have massive observation spaces while Cartpole and Mujoco have much smaller ones (4 and 11 respectively). The shared network generates high-level representations of the state, and the diverging branches for actor and critic at the end can use the same representation at the start.
What is the difference between on-policy and off-policy RL? What is the difference in how they update?
Ans: On-policy: The agent learns the value of the policy it is currently following. Basically, the agent learns directly from the actions it is actually taking. Off-policy: The agent learns the value of the optimal policy independent of the actions taken by the current policy. The agent learns about a policy different than the one currently being followed.
Update ans: On-policy: Updates are based on the difference between expected rewards from one step to the next. Off-policy: Updates are based on the difference between action selected by current policy and optimal action.
What is the difference between model-based and model-free RL?
Model-based algorithms have access to the transition probability distribution and the reward function while model-free algorithms help the agent learn about the world via interaction.