Reinforcement Learning
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST
Source
- By David Silver's RL Course at UCL
- By Prof. Zico Kolter at CMU
Table of Contents
1. Reinforcement Learning¶
%%html
<center><iframe src="https://www.youtube.com/embed/un-FhSC0HfY?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
Agent interaction with environment
1.1. Markov Decision Processes¶
Recall a (discounted) Markov decision process is defined by:
$$M = (S,A,P,R)$$
$S$: set of states
$A$: set of actions
$P$: $S \times A \times S \rightarrow [0,1]$: transition probability distribution $P(s' \mid s, a)$
$R: S \rightarrow \mathbb{R}$: reward function, where $R(S)$ is reward for state $s$
$\gamma$: discount factor
Policy $\pi: S \rightarrow A$ is a mapping from states to actions
The RL twist: we do not know $P$ or $R$, or they are too big to enumerate (only have the ability to act in MDP, observe states and rewards)
Limitations of MDP
- Update equations require access to dynamics model $\Rightarrow$ sampling-based approximations
- Iteration over/storage for all states and actions: require small, discrete state-action space $\Rightarrow$ Q/V function fitting
1.2. Solving MDP¶
(Policy evaluation) Determine value of policy $\pi$
$$ \begin{align*} {v}_{\pi}(s) &= \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t) \mid s_0 = s \right] \\ &= R(s) + \gamma \sum_{s' \in S}P \left(s' \mid s,\pi(s) \right) {v}_{\pi}\left(s'\right) \end{align*} $$
accomplished via the iteration (similar to a value iteration, but for a fixed policy)
$${v}_{\pi}(s) \; \leftarrow \; R(s) + \gamma \sum_{s' \in S}P \left(s' \mid s,\pi(s) \right) v_{\pi}\left(s'\right), \quad \forall s \in S$$
(Value iteration) Determine value of optimal policy
$$ {v_*}(s) = R(s) + \gamma \sum_{s' \in S}P(s' \mid s,a) {v_*}\left(s'\right) $$
accomplished via value iteration:
$$ v(s) \leftarrow R(s) + \gamma \max_{a \in A} \sum_{s' \in S}P\left(s' \mid s,a\right) v\left(s'\right), \quad \forall s \in S $$
Optimal policy $\pi_*$ is then $$\pi_*(s) = \arg \max_{a \in A} \sum_{s' \in S}P\left(s' \mid s,a\right) v_*\left(s'\right)$$
How can we compute these quantities when $P$ and $R$ are unknown?
model-based RL
model-free RL
1.3. Overview of RL¶
2. Model-based RL¶
A simple approach: just estimate the MDP from data (known as Monte Carlo method)
- Agent acts in the work (according to some policy), observes episodes of experience
$$s_1, r_1, a_1, s_2,r_2,a_2, \cdots, s_m, r_m, a_m$$
- We form the empirical estimate of the MDP via the counts
$$ \begin{align*} \hat{P}\left(s' \mid s,a\right) &= \frac{\sum_{i=1}^{m-1} \mathbf{1}\left\{s_i = s, a_i = a, s_{i+1} = s'\right\} }{\sum_{i=1}^{m-1} \mathbf{1}\{s_i = s, a_i = a\}}\\ \\ \hat{R}(s) &= \frac{ \sum_{i=1}^{m-1} \mathbf{1}\{s_i = s \} r_i} {\sum_{i=1}^{m-1} \mathbf{1}\{s_i = s\}} \end{align*} $$
Now solve the MDP $(S,A,\hat{P}, \hat{R})$
Will converge to correct MDP (and hence correct value function/policy) given enough samples of each state
How can we ensure we get the "right" samples? (a challenging problem for all methods we present here)
Advantages (informally): makes "efficient" use of data
Disadvantages: requires we build the actual MDP models, not much help if state space is too large
3. Model-free RL¶
Temporal difference methods (TD, SARSA, Q-learning): directly learn value function $v_{\pi}$ or $v_*$
Direct policy search: directly learn optimal policy $\pi_*$
3.1. Temporal Difference (TD) Methods¶
Let's consider computing the value function for a fixed policy via the iteration
$$ v_{\pi}(s) \; \leftarrow \; R(s) + \gamma \sum_{s' \in S}P\left(s' \mid s,\pi(s) \right) \, v_{\pi}\left(s'\right), \quad \forall s \in S $$
Suppose we are in some state $s_t$, receive reward $R_t$, take action $a_t = \pi(s_t)$ and end up in state $s_{t+1}$
We cannot update $v_{\pi}$ for all $s \in S$, but can we update just for $s_t$?
$$ v_{\pi}(s_t) \; \leftarrow \; R_t + \gamma \sum_{s' \in S}P\left(s' \mid s_t,a_t \right) v_{\pi}\left(s'\right) $$
... No, because we still do not know $P\left(s' \mid s_t,a_t \right)$ for all $s' \in S$
But, $s_{t+1}$ is a sample from the distribution $P(s' \mid s_t,a_t)$, so we could perform the update
$$ v_{\pi}(s_t) \; \leftarrow \; R_t + \gamma v_{\pi}(s_{t+1}) $$
It is too "harsh" assignment if we assume that $s_{t+1}$ is the only possible next state;
Instead "smooth" the update using some $\alpha < 1$
$$ v_{\pi}(s_t) \; \leftarrow \; (1-\alpha)\, \left( v_{\pi}(s_{t}) \right) + \alpha \,\left( R_t + \gamma v_{\pi}(s_{t+1}) \right)$$
This is the temporal difference (TD) algorithm. Its mathematical background will be briefly discussed later.
3.2. Issue with traditional TD algorithms¶
TD lets us learn the value function of a policy $\pi$ directly, without ever constructing the MDP.
But is this really that helpful?
Consider trying to execute greedy policy with respect to estimated $v_{\pi}$
$$\pi'(s) = \arg \max_{a \in A} \sum_{s' \in S}P\left(s' \mid s,a \right) v_{\pi}\left(s'\right)$$
We need a model $P \left(s' \mid s,a \right)$ anyway.
4. Entering the Q function¶
$Q$ function (for MDPs in general) are like value functions but defined over state-action pairs
$$ \begin{align*} Q_{\pi}(s,a) &= R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s,a\right) Q_{\pi}\left(s',\pi\left(s'\right)\right) \\ \\ \\ Q_*(s,a) &= R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s,a\right) \max_{a'}Q_* \left(s',a'\right) \\ \\ &= R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s,a\right) v_*\left(s'\right) \end{align*} $$
i.e., $Q$ function is a value of starting state $s$, taking action $a$, and then acting according to $\pi$ (or optimally for $Q_*$)
We can easily construct analogues of value iteration or policy evaluation to construct $Q$ functions directly given an MDP.
Optimal policy
$$ \begin{align*} \pi_*(s) &= \arg \max_a \sum_{s'} P\left(s' \mid s, a \right) v_*\left(s'\right) \quad \text{or}\\ \\ \pi_*(s) &= \arg \max_a Q_*(s,a) \quad \text{without knowing dynamics} \end{align*} $$
4.1. SARSA and Q-learning¶
$Q$ function leads to new TD-like methods.
As with TD, observe state $s$, reward $r$, take action $a$ (but not necessarily $a = \pi(s)$), observe next sate $s'$
- SARSA: estimate $Q_{\pi}(s,a)$ for expectation
$${Q}_{\pi}(s,a) \leftarrow (1-\alpha) \left( {Q}_{\pi}(s,a) \right) + \alpha \left( R_t + \gamma {Q}_{\pi}\left(s',\pi \left(s'\right)\right) \right)$$
- $Q$-learning: estimate $Q_*(s,a)$ for optimality
$${Q}_*(s,a) \leftarrow (1-\alpha) \left( {Q}_*(s,a) \right) + \alpha \left( R_t + \gamma \max_{a'}{Q}_*\left(s',a'\right) \right)$$
Again, these algorithms converge to true $Q_{\pi}, Q_*$ if all state-action pairs are seen frequently enough
Note: State–Action–Reward–State–Action (SARSA)
The advantage of this approach is that we can now select actions without a model of MDP
- SARSA, greedy policy with respect to $Q_{\pi}(s,a)$
$$ \pi'(s) = \arg \max_{a} {Q}_{\pi}(s,a)$$
- $Q$-learning, optimal policy
$$\pi_*(s) = \arg \max_{a} {Q}_*(s,a)$$
So with $Q$-learning, for instance, we can learn optimal policy without model of MDP.
4.2. Solving Q-Value¶
Q-value iterations
$$ \begin{align*} Q_{k+1}(s,a) &\;\leftarrow \; R(s) + \gamma \sum_{s'}P\left(s'\mid s,a\right) \max_{a'} Q_{k}\left(s',a'\right) \\\\ &\; \leftarrow \; 1 \cdot R(s) + \gamma \sum_{s'}P\left(s'\mid s,a\right) \max_{a'} Q_{k}\left(s',a'\right) \\\\ &\; \leftarrow \; \sum_{s'}P\left(s'\mid s,a\right) \cdot R(s) + \gamma \sum_{s' \in S}P\left(s'\mid s,a\right) \max_{a'} Q_{k}\left(s',a'\right) \\\\ &\; \leftarrow \; \sum_{s'}P\left(s'\mid s,a\right) \left[ R(s) + \gamma \max_{a'} Q_{k}\left(s',a'\right) \right] \\\\ Q_{k+1}(s,a) &\; \leftarrow \; \mathbb{E}_{s' \sim P\left(s'\mid s,a\right)} \left[ R(s) + \gamma \max_{a'} Q_{k}\left(s',a'\right) \right] \qquad \text{Rewrite as expectation} \end{align*} $$
Q-Learning Algorithm: replace expectation by samples
For an state-action pair $(s,a)$, receive: $s' \sim P\left(s' \mid s,a\right)$
Consider your old estimate: $Q_k(s,a)$
Consider your new sample estimate:
$$\text{target}\left(s'\right) = R(s) + \gamma \max_{a'}Q_k{\left(s',a'\right)}$$Incorporate the new estimate into a running average [Temporal Difference or learning incrementally]:
$$ \begin{align*} Q_{k+1}(s,a) \; \leftarrow &\; Q_k(s,a) + \alpha \; \left(\text{target}\left(s'\right) - Q_k(s,a) \right) \\ \; \leftarrow &\; (1-\alpha) \; Q_k(s,a) + \alpha \;\text{target}\left(s'\right) \\ \; \leftarrow &\; (1-\alpha) \; Q_k(s,a) + \alpha \left( R(s) + \gamma \max_{a'}{Q}_k\left(s',a'\right) \right) \end{align*}$$
4.3. How to Sample Actions (Exploration vs. Exploitation) ?¶
All the methods discussed so far had some condition like "assuming we visit each state enough"
A fundamental question: if we do not know the system dynamics, should we take exploratory actions that will give us more information, or exploit current knowledge to perform as best we can?
Issue is that bad initial estimates in the first few cases can drive policy into sub-optimal region, and never explore further.
Choose random actions ? or
Choose action that maximizes $Q_k(s,a)$ (i.e., greedily) ?
$\varepsilon$-Greedy: choose random action with probability $\varepsilon$, otherwise choose action greedily
$$\pi(s) = \begin{cases} \max_{a \in A} Q_k(s,a) & \text{with probability } 1 - \varepsilon \\ \\ \text{random action} & \text{otherwise} \end{cases}$$
- Want to decrease $\varepsilon$ as we see more examples
Q-Learning Properties
Amazing result: Q-learning converges to optimal policy if all state-action pairs seen frequently enough
With Q-learning, we can learn optimal policy without model of MDP
This is called off-policy learning
4.4. Q-Learning Algorithm¶
Initialize $Q(s,a)$ arbitrarily
Repeat (for each episode):
$\quad$Initialize $s$
$\quad$Repeat (for each step of episode):
$\quad \quad$Choose $a$ from $s$ using policy derived from $Q$ (e.g., $\varepsilon$ greedy)
$\quad \quad$Take action $a$, observe $R, s'$
$\quad \quad$${Q}*(s,a) \leftarrow (1-\alpha) \left( {Q}(s,a) \right) + \alpha \left( R_t + \gamma \max_{a'} {Q}_\left(s',a'\right) \right)$ <br> $\quad \quad$$s \; \leftarrow \; s'$
until $s$ is terminal
5. Q-Learning with Gym Environment¶
- Agent interaction with environment
-
- A Python API for RL environments
- A set of tools to measure agent performance
- Read https://gym.openai.com/docs/
Examples
# !pip install gym
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
CartPole-v0
- A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
Objective:
- Balance a pole on top of a movable cart
State:
- [position, horizontal velocity, angle, angular speed]
Action:
- horizontal force applied on the cart (binary)
Reward:
- 1 at each time step if the pole is upright
The process gets started by calling reset()
, which returns an initial observation.
import gym
env = gym.make('CartPole-v0')
observation = env.reset() # observation = state
print(observation)
Here’s a bare minimum example of getting something running. This will run an instance of the CartPole-v0
environment for 100 timesteps, rendering the environment at each step. You should see a window pop up rendering the classic cart-pole problem:
env.reset()
for _ in range(50):
env.render()
action = env.action_space.sample() # your agent here (this takes random actions)
observation, reward, done, info = env.step(action)
print(action, observation, reward, done)
env.close()
Normally, we’ll end the simulation before the cart-pole is allowed to go off-screen.
Observations
If we ever want to do better than take random actions at each step, it’d probably be good to actually know what our actions are doing to the environment.
The environment’s step
function returns exactly what we need. In fact, step
returns four values. These are:
observation
(object): an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.reward
(float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.done
(boolean): whether it’s time toreset
the environment again. Most (but not all) tasks are divided up into well-defined episodes, anddone
beingTrue
indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)info
(dict): diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.
This is just an implementation of the classic “agent-environment loop”. Each timestep, the agent chooses an action
, and the environment returns an observation
and a reward
.
The process gets started by calling reset()
, which returns an initial observation
. So a more proper way of writing the previous code would be to respect the done
flag: the episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
for i_episode in range(2):
observation = env.reset()
for k in range(50):
env.render()
print(observation)
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
if done:
print("Episode finished after {} timesteps".format(k+1))
break
env.close()
Cartpole state = [pos, vel_pos, angle, vel_ang]
In the examples above, we’ve been sampling random actions from the environment’s action space. But what actually are those actions? Every environment comes with an action_space
and an observation_space
.
print(env.observation_space)
print(env.action_space)
print(env.action_space.sample())
print(env.observation_space.low)
print(env.observation_space.high)
n_observations = env.observation_space.shape
n_actions = env.action_space.n
print(n_observations)
print(n_actions)
n_bins_pos = 10
n_bins_vel = 10
n_bins_ang = 10
n_bins_anv = 10
Initialize tabular $Q(s,a)$
n_states = n_bins_pos*n_bins_vel*n_bins_ang*n_bins_anv
n_actions = 2
Q_table = np.random.uniform(0, 1, (n_states, n_actions))
env.reset()
observation, _, _, _ = env.step(0)
env.close()
pos, vel, ang, anv = observation
print(observation)
def map_discrete_state(state):
pos, vel, ang, anv = state
idx_pos = np.where(np.histogram(np.clip(pos,-2,2), bins = n_bins_pos, range = (-2,2))[0] == 1)[0][0]
idx_vel = np.where(np.histogram(np.clip(vel,-2,2), bins = n_bins_vel, range = (-2,2))[0] == 1)[0][0]
idx_ang = np.where(np.histogram(np.clip(ang,-0.4,0.4), bins = n_bins_ang, range = (-0.4,0.4))[0] == 1)[0][0]
idx_anv = np.where(np.histogram(np.clip(anv,-2,2), bins = n_bins_anv, range = (-2,2))[0] == 1)[0][0]
states = np.zeros([n_bins_pos, n_bins_vel, n_bins_ang, n_bins_anv])
states[idx_pos, idx_vel, idx_ang, idx_anv] = 1
states = states.reshape(-1,1)
s = np.where(states == 1)[0][0]
return s
- $\varepsilon$-Greedy: choose random action with probability $\varepsilon$, otherwise choose action greedily
$$\pi(s) = \begin{cases} \max_{a \in A} Q_k(s,a) & \text{with probability } 1 - \varepsilon \\ \\ \text{random action} & \text{otherwise} \end{cases}$$
- Q Learning
Initialize $Q(s,a)$ arbitrarily
Repeat (for each episode):
$\quad$Initialize $s$
$\quad$Repeat (for each step of episode):
$\quad \quad$Choose $a$ from $s$ using policy derived from $Q$ (e.g., $\varepsilon$ greedy)
$\quad \quad$Take action $a$, observe $R, s'$
$\quad \quad$${Q}*(s,a) \leftarrow (1-\alpha) \left( {Q}(s,a) \right) + \alpha \left( R_t + \gamma \max_{a'} {Q}_\left(s',a'\right) \right)$ <br> $\quad \quad$$s \; \leftarrow \; s'$
until $s$ is terminal
alpha = 0.3
gamma = 0.9
Q_table = np.random.uniform(0, 1, (n_states, n_actions))
for episode in range(801):
done = False
state = env.reset()
count = 0
while not done:
count += 1
s = map_discrete_state(state)
# Exploration vs. Exploitation for action
epsilon = 0.1
if np.random.uniform() < epsilon:
a = env.action_space.sample()
else:
a = np.argmax(Q_table[s, :])
# next state and reward
next_state, reward, done, _ = env.step(a)
if done:
reward = -100
Q_table[s, a] = reward
else:
# Temporal Difference Update
next_s = map_discrete_state(next_state)
Q_table[s, a] = (1 - alpha)*Q_table[s, a] + alpha*(reward + gamma*np.max(Q_table[next_s, :]))
state = next_state
if episode % 100 == 0:
print("Episode: {} steps: {}".format(episode, count))
env.close()
state = env.reset()
done = False
while not done:
env.render()
s = map_discrete_state(state)
a = np.argmax(Q_table[s,:])
next_state, _, done, _ = env.step(a)
state = next_state
env.close()
6. Video Lectures¶
%%html
<center><iframe src="https://www.youtube.com/embed/h9xplLinwsQ?rel=0"
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
%%html
<center><iframe src="https://www.youtube.com/embed/on9QKpIdYeE?rel=0"
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
%%html
<center><iframe src="https://www.youtube.com/embed/nvPpjOMZolo?rel=0"
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
%%html
<center><iframe src="https://www.youtube.com/embed/vcQkQMYmlUY?rel=0"
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
%%html
<center><iframe src="https://www.youtube.com/embed/w1SFeTdvZWs?rel=0"
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
7. Other Tutorials¶
%%html
<center><iframe src="https://www.youtube.com/embed/JgvyzIkgxF0?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
Stanford Univ. by Serena Yeung
%%html
<center><iframe src="https://www.youtube.com/embed/lvoHnicueoE?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
MIT
%%html
<center><iframe src="https://www.youtube.com/embed/xWe58WGWmlk?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
%%html
<center><iframe src="https://www.youtube.com/embed/s5qqjyGiBdc?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
David Silver's Lecture (DeepMind)
UCL homepage for slides (http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html)
DeepMind for RL videos (https://www.youtube.com/watch?v=2pWv7GOvuf0)
An Introduction to Reinforcement Learning, Sutton and Barto pdf
%%html
<center><iframe src="https://www.youtube.com/embed/2pWv7GOvuf0?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')