Reinforcement Learning

Intro to RL

Core Lesson: Intro to RL

Use this link for Sutton and Barto instead: https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf
For Tabular RL & Policy Improvement: I would definitely recommend doing the suggested readings before you get started.
Once you're done, read up to (but not including) the section titled "Common Approaches" from Lilian Weng's great RL overview once done with this chapter. Also check out Spinning Up's intro to RL and Berkeley's Deep RL Bootcamp from 2017. Keep referencing these resources as need be.

Q-Learning and DQN

Core Lesson: Q-Learning and DQN

In DQN, really understand that the target networks are predicting the q-value for the next state while the q-networks are predicting the q-value for the current state! This is probably very obvious, but I didn't quite register it at first pass. I ended up spending a day banging my head against the wall and trying to figure out how in the world DQN worked if both networks were predicting the q-value for the same state, and the target was almost guaranteed to be worse.
Progress Check
- Understand that DQN is just an extension of Q-learning to deep learning. Instead of using a table to store all the Q-values for each state-action pair, we train a neural network to learn this function for us.
- Give a conceptual overview of the steps of DQN.
  - use some kind of policy (usually epsilon-greedy)
  - store these experiences in the replay buffer
  - use the values from the replay buffer to calculate a TD error (temporal difference error) between a target network and the Q-network (where the Q-network provides the prediction for the current Q-value and the target network provides the prediction for the Q-value after the next step)
  - update the Q-network via gradient descent
  - every so often, copy the Q-network weights over to the target network
- Why do you even need a target network?
  - Ans: Remember how in SARSA you were learning from the next action? This is the same thing (and in the first paper there actually wasn't even a target network). It was noticed that learning was very unstable because the Q-network was always changing, so the labels it was chasing after were always changing. Having a lagged copy of the Q-network helps provide it a consistent thing to optimize towards while updating the targets often enough that they improve as well and don't get left behind by the Q-network.
- Is DQN suited for continuous action spaces?
  - No, because the Q function takes in states and returns Q-values (estimated future reward) for each discrete action. It's not even great for large action spaces.

PPO

Core Lesson: PPO

For me, PPO felt like it came out of left field, and I was incredibly confused when I first went through the PPO lesson. It took spending more time learning about policy gradients upfront to really grok things.
- Before you get started, read the rest of Lilian Weng's post and read up on policy gradients from Spinning Up if you want to go deeper in the math.
  - Other good resources if the two above don't make much sense:
    A different Lilian Weng's post specifically on policy gradients
    Nathan Lambert
  - Even though learning the same information from different lenses can often help (and I usually recommend it), you should also be wary of taking this to the extreme. Once you have a solid mental model, move into the lesson.
Progress Check
- Describe the interplay between the actor and critic networks.
  - The actor network needs the critic network to estimate the gradient and perform gradient ascent, and the critic is learning the value function of the actor's current policy. Basically, the actor network learns the policy and the critic network learns the values function V(s).
- What is the major innovation of PPO over TRPO?
  - PPO has a simpler optimization process because of the clipped surrogate objective.
- Why did you do a shared network for actor and critic only for Atari (and not for Cartpole or Mujoco)?
  - Ans: Atari games have massive observation spaces while Cartpole and Mujoco have much smaller ones (4 and 11 respectively). The shared network generates high-level representations of the state, and the diverging branches for actor and critic at the end can use the same representation at the start.
- What is the difference between on-policy and off-policy RL? What is the difference in how they update?
  - Ans: On-policy: The agent learns the value of the policy it is currently following. Basically, the agent learns directly from the actions it is actually taking. Off-policy: The agent learns the value of the optimal policy independent of the actions taken by the current policy. The agent learns about a policy different than the one currently being followed.
  - Update ans: On-policy: Updates are based on the difference between expected rewards from one step to the next. Off-policy: Updates are based on the difference between action selected by current policy and optimal action.
- What is the difference between model-based and model-free RL?
  - Model-based algorithms have access to the transition probability distribution and the reward function while model-free algorithms help the agent learn about the world via interaction.

RLHF

Core Lesson: RLHF

I think that this week was the most straightforward out of all of them; it logically follows the PPO lesson fairly smoothly. If you're having trouble, most of the same resources from the PPO section should still help.

Exercises/Mini-Projects

The Bonus tabs in the lessons are gems for mini-project ideas
The ones that stood out to me in terms of "coolness" were:
- RLHF
  - learning a human preference reward model
  - the suggested paper replications
  - differential learning rates
- PPO
  - Wordle
  - Other discrete environments

PreviousTransformers

Last updated 9 months ago