Reinforcement Learning
Source
Table of Contents
%%html
<center><iframe src="https://www.youtube.com/embed/un-FhSC0HfY?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
Agent interaction with environment
Recall a (discounted) Markov decision process is defined by:
$$M = (S,A,P,R)$$$S$: set of states
$A$: set of actions
$P$: $S \times A \times S \rightarrow [0,1]$: transition probability distribution $P(s' \mid s, a)$
$R: S \rightarrow \mathbb{R}$: reward function, where $R(S)$ is reward for state $s$
$\gamma$: discount factor
The RL twist: we do not know $P$ or $R$, or they are too big to enumerate (only have the ability to act in MDP, observe states and rewards)
Limitations of MDP
(Policy evaluation) Determine value of policy $\pi$
$$ \begin{align*} {v}_{\pi}(s) &= \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t) \mid s_0 = s \right] \\ &= R(s) + \gamma \sum_{s' \in S}P \left(s' \mid s,\pi(s) \right) {v}_{\pi}\left(s'\right) \end{align*} $$accomplished via the iteration (similar to a value iteration, but for a fixed policy)
(Value iteration) Determine value of optimal policy
$$ {v_*}(s) = R(s) + \gamma \sum_{s' \in S}P(s' \mid s,a) {v_*}\left(s'\right) $$accomplished via value iteration:
Optimal policy $\pi_*$ is then $$\pi_*(s) = \arg \max_{a \in A} \sum_{s' \in S}P\left(s' \mid s,a\right) v_*\left(s'\right)$$
How can we compute these quantities when $P$ and $R$ are unknown?
model-based RL
model-free RL
A simple approach: just estimate the MDP from data (known as Monte Carlo method)
$$
\begin{align*}
\hat{P}\left(s' \mid s,a\right) &= \frac{\sum_{i=1}^{m-1} \mathbf{1}\left\{s_i = s, a_i = a, s_{i+1} = s'\right\} }{\sum_{i=1}^{m-1} \mathbf{1}\{s_i = s, a_i = a\}}\\ \\
\hat{R}(s) &= \frac{ \sum_{i=1}^{m-1} \mathbf{1}\{s_i = s \} r_i} {\sum_{i=1}^{m-1} \mathbf{1}\{s_i = s\}}
\end{align*}
$$
Now solve the MDP $(S,A,\hat{P}, \hat{R})$
Will converge to correct MDP (and hence correct value function/policy) given enough samples of each state
How can we ensure we get the "right" samples? (a challenging problem for all methods we present here)
Advantages (informally): makes "efficient" use of data
Disadvantages: requires we build the actual MDP models, not much help if state space is too large
Let's consider computing the value function for a fixed policy via the iteration
Suppose we are in some state $s_t$, receive reward $R_t$, take action $a_t = \pi(s_t)$ and end up in state $s_{t+1}$
We cannot update $v_{\pi}$ for all $s \in S$, but can we update just for $s_t$?
... No, because we still do not know $P\left(s' \mid s_t,a_t \right)$ for all $s' \in S$
But, $s_{t+1}$ is a sample from the distribution $P(s' \mid s_t,a_t)$, so we could perform the update
It is too "harsh" assignment if we assume that $s_{t+1}$ is the only possible next state;
Instead "smooth" the update using some $\alpha < 1$
This is the temporal difference (TD) algorithm. Its mathematical background will be briefly discussed later.
TD lets us learn the value function of a policy $\pi$ directly, without ever constructing the MDP.
But is this really that helpful?
Consider trying to execute greedy policy with respect to estimated $v_{\pi}$
We need a model $P \left(s' \mid s,a \right)$ anyway.
$Q$ function (for MDPs in general) are like value functions but defined over state-action pairs
i.e., $Q$ function is a value of starting state $s$, taking action $a$, and then acting according to $\pi$ (or optimally for $Q_*$)
We can easily construct analogues of value iteration or policy evaluation to construct $Q$ functions directly given an MDP.
Optimal policy
$$ \begin{align*} \pi_*(s) &= \arg \max_a \sum_{s'} P\left(s' \mid s, a \right) v_*\left(s'\right) \quad \text{or}\\ \\ \pi_*(s) &= \arg \max_a Q_*(s,a) \quad \text{without knowing dynamics} \end{align*} $$$Q$ function leads to new TD-like methods.
As with TD, observe state $s$, reward $r$, take action $a$ (but not necessarily $a = \pi(s)$), observe next sate $s'$
Again, these algorithms converge to true $Q_{\pi}, Q_*$ if all state-action pairs are seen frequently enough
Note: State–Action–Reward–State–Action (SARSA)
The advantage of this approach is that we can now select actions without a model of MDP
So with $Q$-learning, for instance, we can learn optimal policy without model of MDP.
Q-value iterations
$$
\begin{align*}
Q_{k+1}(s,a) &\;\leftarrow \; R(s) + \gamma \sum_{s'}P\left(s'\mid s,a\right) \max_{a'} Q_{k}\left(s',a'\right) \\\\
&\; \leftarrow \; 1 \cdot R(s) + \gamma \sum_{s'}P\left(s'\mid s,a\right) \max_{a'} Q_{k}\left(s',a'\right) \\\\
&\; \leftarrow \; \sum_{s'}P\left(s'\mid s,a\right) \cdot R(s) + \gamma \sum_{s' \in S}P\left(s'\mid s,a\right) \max_{a'} Q_{k}\left(s',a'\right) \\\\
&\; \leftarrow \; \sum_{s'}P\left(s'\mid s,a\right) \left[ R(s) + \gamma \max_{a'} Q_{k}\left(s',a'\right) \right] \\\\
Q_{k+1}(s,a) &\; \leftarrow \; \mathbb{E}_{s' \sim P\left(s'\mid s,a\right)} \left[ R(s) + \gamma \max_{a'} Q_{k}\left(s',a'\right) \right] \qquad \text{Rewrite as expectation}
\end{align*}
$$
Q-Learning Algorithm: replace expectation by samples
1) For an state-action pair $(s,a)$, receive: $s' \sim P\left(s' \mid s,a\right)$
2) Consider your old estimate: $Q_k(s,a)$
3) Consider your new sample estimate:
$$\text{target}\left(s'\right) = R(s) + \gamma \max_{a'}Q_k{\left(s',a'\right)}$$
4) Incorporate the new estimate into a running average [Temporal Difference or learning incrementally]:
$$
\begin{align*}
Q_{k+1}(s,a) \; \leftarrow &\; Q_k(s,a) + \alpha \; \left(\text{target}\left(s'\right) - Q_k(s,a) \right) \\
\; \leftarrow &\; (1-\alpha) \; Q_k(s,a) + \alpha \;\text{target}\left(s'\right) \\
\; \leftarrow &\; (1-\alpha) \; Q_k(s,a) + \alpha \left( R(s) + \gamma \max_{a'}{Q}_k\left(s',a'\right) \right)
\end{align*}$$
All the methods discussed so far had some condition like "assuming we visit each state enough"
A fundamental question: if we do not know the system dynamics, should we take exploratory actions that will give us more information, or exploit current knowledge to perform as best we can?
Issue is that bad initial estimates in the first few cases can drive policy into sub-optimal region, and never explore further.
Choose random actions ? or
Choose action that maximizes $Q_k(s,a)$ (i.e., greedily) ?
Q-Learning Properties
Amazing result: Q-learning converges to optimal policy if all state-action pairs seen frequently enough
With Q-learning, we can learn optimal policy without model of MDP
This is called off-policy learning
Initialize $Q(s,a)$ arbitrarily
Repeat (for each episode):
$\quad$Initialize $s$
$\quad$Repeat (for each step of episode):
$\quad \quad$Choose $a$ from $s$ using policy derived from $Q$ (e.g., $\varepsilon$ greedy)
$\quad \quad$Take action $a$, observe $R, s'$
$\quad \quad$${Q}_*(s,a) \leftarrow (1-\alpha) \left( {Q}_*(s,a) \right) + \alpha \left( R_t + \gamma \max_{a'} {Q}_*\left(s',a'\right) \right)$
$\quad \quad$$s \; \leftarrow \; s'$
until $s$ is terminal
# !pip install gym
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
CartPole-v0
The process gets started by calling reset()
, which returns an initial observation.
import gym
env = gym.make('CartPole-v0')
observation = env.reset() # observation = state
print(observation)
Here’s a bare minimum example of getting something running. This will run an instance of the CartPole-v0
environment for 100 timesteps, rendering the environment at each step. You should see a window pop up rendering the classic cart-pole problem:
env.reset()
for _ in range(50):
env.render()
action = env.action_space.sample() # your agent here (this takes random actions)
observation, reward, done, info = env.step(action)
print(action, observation, reward, done)
env.close()
Normally, we’ll end the simulation before the cart-pole is allowed to go off-screen.
Observations
If we ever want to do better than take random actions at each step, it’d probably be good to actually know what our actions are doing to the environment.
The environment’s step
function returns exactly what we need. In fact, step
returns four values. These are:
observation
(object): an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.reward
(float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.done
(boolean): whether it’s time to reset
the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done
being True
indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)info
(dict): diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.This is just an implementation of the classic “agent-environment loop”. Each timestep, the agent chooses an action
, and the environment returns an observation
and a reward
.
The process gets started by calling reset()
, which returns an initial observation
. So a more proper way of writing the previous code would be to respect the done
flag: the episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
for i_episode in range(2):
observation = env.reset()
for k in range(50):
env.render()
print(observation)
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
if done:
print("Episode finished after {} timesteps".format(k+1))
break
env.close()
Cartpole state = [pos, vel_pos, angle, vel_ang]
In the examples above, we’ve been sampling random actions from the environment’s action space. But what actually are those actions? Every environment comes with an action_space
and an observation_space
.
print(env.observation_space)
print(env.action_space)
print(env.action_space.sample())
print(env.observation_space.low)
print(env.observation_space.high)
n_observations = env.observation_space.shape
n_actions = env.action_space.n
print(n_observations)
print(n_actions)
n_bins_pos = 10
n_bins_vel = 10
n_bins_ang = 10
n_bins_anv = 10
Initialize tabular $Q(s,a)$
n_states = n_bins_pos*n_bins_vel*n_bins_ang*n_bins_anv
n_actions = 2
Q_table = np.random.uniform(0, 1, (n_states, n_actions))
env.reset()
observation, _, _, _ = env.step(0)
env.close()
pos, vel, ang, anv = observation
print(observation)
def map_discrete_state(state):
pos, vel, ang, anv = state
idx_pos = np.where(np.histogram(np.clip(pos,-2,2), bins = n_bins_pos, range = (-2,2))[0] == 1)[0][0]
idx_vel = np.where(np.histogram(np.clip(vel,-2,2), bins = n_bins_vel, range = (-2,2))[0] == 1)[0][0]
idx_ang = np.where(np.histogram(np.clip(ang,-0.4,0.4), bins = n_bins_ang, range = (-0.4,0.4))[0] == 1)[0][0]
idx_anv = np.where(np.histogram(np.clip(anv,-2,2), bins = n_bins_anv, range = (-2,2))[0] == 1)[0][0]
states = np.zeros([n_bins_pos, n_bins_vel, n_bins_ang, n_bins_anv])
states[idx_pos, idx_vel, idx_ang, idx_anv] = 1
states = states.reshape(-1,1)
s = np.where(states == 1)[0][0]
return s
Initialize $Q(s,a)$ arbitrarily
Repeat (for each episode):
$\quad$Initialize $s$
$\quad$Repeat (for each step of episode):
$\quad \quad$Choose $a$ from $s$ using policy derived from $Q$ (e.g., $\varepsilon$ greedy)
$\quad \quad$Take action $a$, observe $R, s'$
$\quad \quad$${Q}_*(s,a) \leftarrow (1-\alpha) \left( {Q}_*(s,a) \right) + \alpha \left( R_t + \gamma \max_{a'} {Q}_*\left(s',a'\right) \right)$
$\quad \quad$$s \; \leftarrow \; s'$
until $s$ is terminal
alpha = 0.3
gamma = 0.9
Q_table = np.random.uniform(0, 1, (n_states, n_actions))
for episode in range(801):
done = False
state = env.reset()
count = 0
while not done:
count += 1
s = map_discrete_state(state)
# Exploration vs. Exploitation for action
epsilon = 0.1
if np.random.uniform() < epsilon:
a = env.action_space.sample()
else:
a = np.argmax(Q_table[s, :])
# next state and reward
next_state, reward, done, _ = env.step(a)
if done:
reward = -100
Q_table[s, a] = reward
else:
# Temporal Difference Update
next_s = map_discrete_state(next_state)
Q_table[s, a] = (1 - alpha)*Q_table[s, a] + alpha*(reward + gamma*np.max(Q_table[next_s, :]))
state = next_state
if episode % 100 == 0:
print("Episode: {} steps: {}".format(episode, count))
env.close()
state = env.reset()
done = False
while not done:
env.render()
s = map_discrete_state(state)
a = np.argmax(Q_table[s,:])
next_state, _, done, _ = env.step(a)
state = next_state
env.close()
%%html
<center><iframe src="https://www.youtube.com/embed/h9xplLinwsQ?rel=0"
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
%%html
<center><iframe src="https://www.youtube.com/embed/on9QKpIdYeE?rel=0"
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
%%html
<center><iframe src="https://www.youtube.com/embed/nvPpjOMZolo?rel=0"
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
%%html
<center><iframe src="https://www.youtube.com/embed/vcQkQMYmlUY?rel=0"
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
%%html
<center><iframe src="https://www.youtube.com/embed/w1SFeTdvZWs?rel=0"
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
%%html
<center><iframe src="https://www.youtube.com/embed/JgvyzIkgxF0?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
Stanford Univ. by Serena Yeung
%%html
<center><iframe src="https://www.youtube.com/embed/lvoHnicueoE?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
MIT
%%html
<center><iframe src="https://www.youtube.com/embed/xWe58WGWmlk?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
%%html
<center><iframe src="https://www.youtube.com/embed/s5qqjyGiBdc?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
David Silver's Lecture (DeepMind)
UCL homepage for slides (http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html)
DeepMind for RL videos (https://www.youtube.com/watch?v=2pWv7GOvuf0)
An Introduction to Reinforcement Learning, Sutton and Barto pdf
%%html
<center><iframe src="https://www.youtube.com/embed/2pWv7GOvuf0?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')