Reinforcement Learning

By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Source

• By David Silver's RL Course at UCL
• By Prof. Zico Kolter at CMU

# 1. Reinforcement Learning¶

In [1]:
%%html
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>


Agent interaction with environment

## 1.1. Markov Decision Processes¶

Recall a (discounted) Markov decision process is defined by:

$$M = (S,A,P,R)$$
• $S$: set of states

• $A$: set of actions

• $P$: $S \times A \times S \rightarrow [0,1]$: transition probability distribution $P(s' \mid s, a)$

• $R: S \rightarrow \mathbb{R}$: reward function, where $R(S)$ is reward for state $s$

• $\gamma$: discount factor

• Policy $\pi: S \rightarrow A$ is a mapping from states to actions

The RL twist: we do not know $P$ or $R$, or they are too big to enumerate (only have the ability to act in MDP, observe states and rewards)

Limitations of MDP

• Update equations require access to dynamics model $\Rightarrow$ sampling-based approximations
• Iteration over/storage for all states and actions: require small, discrete state-action space $\Rightarrow$ Q/V function fitting

## 1.2. Solving MDP¶

(Policy evaluation) Determine value of policy $\pi$

\begin{align*} {v}_{\pi}(s) &= \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t) \mid s_0 = s \right] \\ &= R(s) + \gamma \sum_{s' \in S}P \left(s' \mid s,\pi(s) \right) {v}_{\pi}\left(s'\right) \end{align*}

accomplished via the iteration (similar to a value iteration, but for a fixed policy)

$${v}_{\pi}(s) \; \leftarrow \; R(s) + \gamma \sum_{s' \in S}P \left(s' \mid s,\pi(s) \right) v_{\pi}\left(s'\right), \quad \forall s \in S$$

(Value iteration) Determine value of optimal policy

$${v_*}(s) = R(s) + \gamma \sum_{s' \in S}P(s' \mid s,a) {v_*}\left(s'\right)$$

accomplished via value iteration:

$$v(s) \leftarrow R(s) + \gamma \max_{a \in A} \sum_{s' \in S}P\left(s' \mid s,a\right) v\left(s'\right), \quad \forall s \in S$$

Optimal policy $\pi_*$ is then $$\pi_*(s) = \arg \max_{a \in A} \sum_{s' \in S}P\left(s' \mid s,a\right) v_*\left(s'\right)$$

How can we compute these quantities when $P$ and $R$ are unknown?

• model-based RL

• model-free RL

# 2. Model-based RL¶

A simple approach: just estimate the MDP from data (known as Monte Carlo method)

• Agent acts in the work (according to some policy), observes episodes of experience
$$s_1, r_1, a_1, s_2,r_2,a_2, \cdots, s_m, r_m, a_m$$
• We form the empirical estimate of the MDP via the counts

\begin{align*} \hat{P}\left(s' \mid s,a\right) &= \frac{\sum_{i=1}^{m-1} \mathbf{1}\left\{s_i = s, a_i = a, s_{i+1} = s'\right\} }{\sum_{i=1}^{m-1} \mathbf{1}\{s_i = s, a_i = a\}}\\ \\ \hat{R}(s) &= \frac{ \sum_{i=1}^{m-1} \mathbf{1}\{s_i = s \} r_i} {\sum_{i=1}^{m-1} \mathbf{1}\{s_i = s\}} \end{align*}

Now solve the MDP $(S,A,\hat{P}, \hat{R})$

• Will converge to correct MDP (and hence correct value function/policy) given enough samples of each state

• How can we ensure we get the "right" samples? (a challenging problem for all methods we present here)

• Advantages (informally): makes "efficient" use of data

• Disadvantages: requires we build the actual MDP models, not much help if state space is too large

# 3. Model-free RL¶

• Temporal difference methods (TD, SARSA, Q-learning): directly learn value function $v_{\pi}$ or $v_*$
• Direct policy search: directly learn optimal policy $\pi_*$

## 3.1. Temporal Difference (TD) Methods¶

Let's consider computing the value function for a fixed policy via the iteration

$$v_{\pi}(s) \; \leftarrow \; R(s) + \gamma \sum_{s' \in S}P\left(s' \mid s,\pi(s) \right) \, v_{\pi}\left(s'\right), \quad \forall s \in S$$

Suppose we are in some state $s_t$, receive reward $R_t$, take action $a_t = \pi(s_t)$ and end up in state $s_{t+1}$

We cannot update $v_{\pi}$ for all $s \in S$, but can we update just for $s_t$?

$$v_{\pi}(s_t) \; \leftarrow \; R_t + \gamma \sum_{s' \in S}P\left(s' \mid s_t,a_t \right) v_{\pi}\left(s'\right)$$

... No, because we still do not know $P\left(s' \mid s_t,a_t \right)$ for all $s' \in S$

But, $s_{t+1}$ is a sample from the distribution $P(s' \mid s_t,a_t)$, so we could perform the update

$$v_{\pi}(s_t) \; \leftarrow \; R_t + \gamma v_{\pi}(s_{t+1})$$

• It is too "harsh" assignment if we assume that $s_{t+1}$ is the only possible next state;

• Instead "smooth" the update using some $\alpha < 1$

$$v_{\pi}(s_t) \; \leftarrow \; (1-\alpha)\, \left( v_{\pi}(s_{t}) \right) + \alpha \,\left( R_t + \gamma v_{\pi}(s_{t+1}) \right)$$

This is the temporal difference (TD) algorithm. Its mathematical background will be briefly discussed later.

## 3.2. Issue with traditional TD algorithms¶

TD lets us learn the value function of a policy $\pi$ directly, without ever constructing the MDP.

But is this really that helpful?

Consider trying to execute greedy policy with respect to estimated $v_{\pi}$

$$\pi'(s) = \arg \max_{a \in A} \sum_{s' \in S}P\left(s' \mid s,a \right) v_{\pi}\left(s'\right)$$

We need a model $P \left(s' \mid s,a \right)$ anyway.

# 4. Entering the Q function¶

$Q$ function (for MDPs in general) are like value functions but defined over state-action pairs

\begin{align*} Q_{\pi}(s,a) &= R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s,a\right) Q_{\pi}\left(s',\pi\left(s'\right)\right) \\ \\ \\ Q_*(s,a) &= R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s,a\right) \max_{a'}Q_* \left(s',a'\right) \\ \\ &= R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s,a\right) v_*\left(s'\right) \end{align*}

i.e., $Q$ function is a value of starting state $s$, taking action $a$, and then acting according to $\pi$ (or optimally for $Q_*$)

We can easily construct analogues of value iteration or policy evaluation to construct $Q$ functions directly given an MDP.

Optimal policy

\begin{align*} \pi_*(s) &= \arg \max_a \sum_{s'} P\left(s' \mid s, a \right) v_*\left(s'\right) \quad \text{or}\\ \\ \pi_*(s) &= \arg \max_a Q_*(s,a) \quad \text{without knowing dynamics} \end{align*}

## 4.1. SARSA and Q-learning¶

$Q$ function leads to new TD-like methods.

As with TD, observe state $s$, reward $r$, take action $a$ (but not necessarily $a = \pi(s)$), observe next sate $s'$

• SARSA: estimate $Q_{\pi}(s,a)$ for expectation

$${Q}_{\pi}(s,a) \leftarrow (1-\alpha) \left( {Q}_{\pi}(s,a) \right) + \alpha \left( R_t + \gamma {Q}_{\pi}\left(s',\pi \left(s'\right)\right) \right)$$

• $Q$-learning: estimate $Q_*(s,a)$ for optimality

$${Q}_*(s,a) \leftarrow (1-\alpha) \left( {Q}_*(s,a) \right) + \alpha \left( R_t + \gamma \max_{a'}{Q}_*\left(s',a'\right) \right)$$

Again, these algorithms converge to true $Q_{\pi}, Q_*$ if all state-action pairs are seen frequently enough

Note: State–Action–Reward–State–Action (SARSA)

The advantage of this approach is that we can now select actions without a model of MDP

• SARSA, greedy policy with respect to $Q_{\pi}(s,a)$

$$\pi'(s) = \arg \max_{a} {Q}_{\pi}(s,a)$$

• $Q$-learning, optimal policy

$$\pi_*(s) = \arg \max_{a} {Q}_*(s,a)$$

So with $Q$-learning, for instance, we can learn optimal policy without model of MDP.

## 4.2. Solving Q-Value¶

Q-value iterations

\begin{align*} Q_{k+1}(s,a) &\;\leftarrow \; R(s) + \gamma \sum_{s'}P\left(s'\mid s,a\right) \max_{a'} Q_{k}\left(s',a'\right) \\\\ &\; \leftarrow \; 1 \cdot R(s) + \gamma \sum_{s'}P\left(s'\mid s,a\right) \max_{a'} Q_{k}\left(s',a'\right) \\\\ &\; \leftarrow \; \sum_{s'}P\left(s'\mid s,a\right) \cdot R(s) + \gamma \sum_{s' \in S}P\left(s'\mid s,a\right) \max_{a'} Q_{k}\left(s',a'\right) \\\\ &\; \leftarrow \; \sum_{s'}P\left(s'\mid s,a\right) \left[ R(s) + \gamma \max_{a'} Q_{k}\left(s',a'\right) \right] \\\\ Q_{k+1}(s,a) &\; \leftarrow \; \mathbb{E}_{s' \sim P\left(s'\mid s,a\right)} \left[ R(s) + \gamma \max_{a'} Q_{k}\left(s',a'\right) \right] \qquad \text{Rewrite as expectation} \end{align*}

Q-Learning Algorithm: replace expectation by samples

1) For an state-action pair $(s,a)$, receive: $s' \sim P\left(s' \mid s,a\right)$

2) Consider your old estimate: $Q_k(s,a)$

3) Consider your new sample estimate:

$$\text{target}\left(s'\right) = R(s) + \gamma \max_{a'}Q_k{\left(s',a'\right)}$$

4) Incorporate the new estimate into a running average [Temporal Difference or learning incrementally]:

\begin{align*} Q_{k+1}(s,a) \; \leftarrow &\; Q_k(s,a) + \alpha \; \left(\text{target}\left(s'\right) - Q_k(s,a) \right) \\ \; \leftarrow &\; (1-\alpha) \; Q_k(s,a) + \alpha \;\text{target}\left(s'\right) \\ \; \leftarrow &\; (1-\alpha) \; Q_k(s,a) + \alpha \left( R(s) + \gamma \max_{a'}{Q}_k\left(s',a'\right) \right) \end{align*}

## 4.3. How to Sample Actions (Exploration vs. Exploitation) ?¶

All the methods discussed so far had some condition like "assuming we visit each state enough"

A fundamental question: if we do not know the system dynamics, should we take exploratory actions that will give us more information, or exploit current knowledge to perform as best we can?

Issue is that bad initial estimates in the first few cases can drive policy into sub-optimal region, and never explore further.

• Choose random actions ? or

• Choose action that maximizes $Q_k(s,a)$ (i.e., greedily) ?

• $\varepsilon$-Greedy: choose random action with probability $\varepsilon$, otherwise choose action greedily
$$\pi(s) = \begin{cases} \max_{a \in A} Q_k(s,a) & \text{with probability } 1 - \varepsilon \\ \\ \text{random action} & \text{otherwise} \end{cases}$$
• Want to decrease $\varepsilon$ as we see more examples

Q-Learning Properties

• Amazing result: Q-learning converges to optimal policy if all state-action pairs seen frequently enough

• With Q-learning, we can learn optimal policy without model of MDP

• This is called off-policy learning

## 4.4. Q-Learning Algorithm¶

Initialize $Q(s,a)$ arbitrarily
Repeat (for each episode):
$\quad$Initialize $s$
$\quad$Repeat (for each step of episode):
$\quad \quad$Choose $a$ from $s$ using policy derived from $Q$ (e.g., $\varepsilon$ greedy)
$\quad \quad$Take action $a$, observe $R, s'$
$\quad \quad$${Q}_*(s,a) \leftarrow (1-\alpha) \left( {Q}_*(s,a) \right) + \alpha \left( R_t + \gamma \max_{a'} {Q}_*\left(s',a'\right) \right) \quad \quad$$s \; \leftarrow \; s'$
until $s$ is terminal

# 5. Q-Learning with Gym Environment¶

• Agent interaction with environment

• Examples

In [2]:
# !pip install gym

In [3]:
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline


CartPole-v0

• A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

• Objective:
• Balance a pole on top of a movable cart
• State:
• [position, horizontal velocity, angle, angular speed]
• Action:
• horizontal force applied on the cart (binary)
• Reward:
• 1 at each time step if the pole is upright

The process gets started by calling reset(), which returns an initial observation.

In [4]:
import gym

env = gym.make('CartPole-v0')
observation = env.reset() # observation = state

print(observation)

[ 0.00210483 -0.02827348  0.04151207 -0.00995403]


Here’s a bare minimum example of getting something running. This will run an instance of the CartPole-v0 environment for 100 timesteps, rendering the environment at each step. You should see a window pop up rendering the classic cart-pole problem:

In [5]:
env.reset()

for _ in range(50):
env.render()
action = env.action_space.sample() # your agent here (this takes random actions)
observation, reward, done, info = env.step(action)

print(action, observation, reward, done)

env.close()

1 [ 0.01836796  0.23618189 -0.02163872 -0.32077347] 1.0 False
1 [ 0.0230916   0.43160521 -0.02805418 -0.6202011 ] 1.0 False
1 [ 0.0317237   0.6271075  -0.04045821 -0.921586  ] 1.0 False
1 [ 0.04426585  0.82275212 -0.05888993 -1.22670424] 1.0 False
1 [ 0.06072089  1.01858064 -0.08342401 -1.53724146] 1.0 False
1 [ 0.08109251  1.21460177 -0.11416884 -1.8547488 ] 1.0 False
1 [ 0.10538454  1.41077825 -0.15126382 -2.18059057] 1.0 False
0 [ 0.13360011  1.21741493 -0.19487563 -1.93815965] 1.0 False
0 [ 0.15794841  1.02483628 -0.23363882 -1.71169092] 1.0 True

c:\users\seungchul\appdata\local\programs\python\python35\lib\site-packages\gym\logger.py:30: UserWarning: WARN: You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.
warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))

0 [ 0.17844513  0.83303491 -0.26787264 -1.49987221] 0.0 True
0 [ 0.19510583  0.64197197 -0.29787008 -1.30131495] 0.0 True
0 [ 0.20794527  0.45158868 -0.32389638 -1.11460017] 0.0 True
0 [ 0.21697704  0.26181452 -0.34618839 -0.93830985] 0.0 True
0 [ 0.22221333  0.07257295 -0.36495458 -0.77104677] 0.0 True
1 [ 0.22366479  0.27042523 -0.38037552 -1.15320998] 0.0 True
1 [ 0.2290733   0.46801656 -0.40343972 -1.53756593] 0.0 True
1 [ 0.23843363  0.6651877  -0.43419104 -1.92499789] 0.0 True
1 [ 0.25173738  0.86171727 -0.47269099 -2.3161174 ] 0.0 True
0 [ 0.26897173  0.67288998 -0.51901334 -2.19778832] 0.0 True
0 [ 0.28242953  0.48499017 -0.56296911 -2.09888663] 0.0 True
0 [ 0.29212933  0.29794557 -0.60494684 -2.01852606] 0.0 True
1 [ 0.29808824  0.4929017  -0.64531736 -2.42626584] 0.0 True
0 [ 0.30794628  0.30615955 -0.69384268 -2.37930777] 0.0 True
0 [ 0.31406947  0.12012335 -0.74142883 -2.35278428] 0.0 True
0 [ 0.31647193 -0.06531434 -0.78848452 -2.34619328] 0.0 True
1 [ 0.31516565  0.12612208 -0.83540839 -2.75714464] 0.0 True
0 [ 0.31768809 -0.05988032 -0.89055128 -2.78798987] 0.0 True
0 [ 0.31649048 -0.24566883 -0.94631108 -2.84126407] 0.0 True
0 [ 0.31157711 -0.43143178 -1.00313636 -2.91685799] 0.0 True
0 [ 0.30294847 -0.61737827 -1.06147352 -3.01478332] 0.0 True
0 [ 0.2906009  -0.80374236 -1.12176918 -3.13516453] 0.0 True
1 [ 0.27452606 -0.62241851 -1.18447247 -3.51808638] 0.0 True
1 [ 0.26207769 -0.44463786 -1.2548342  -3.89089653] 0.0 True
0 [ 0.25318493 -0.63685602 -1.33265213 -4.08075066] 0.0 True
1 [ 0.24044781 -0.46603762 -1.41426715 -4.42689721] 0.0 True
1 [ 0.23112706 -0.29948369 -1.50280509 -4.75624919] 0.0 True
0 [ 0.22513738 -0.50097732 -1.59793007 -5.02903602] 0.0 True
0 [ 0.21511784 -0.70615177 -1.69851079 -5.33127751] 0.0 True
1 [ 0.2009948  -0.55147895 -1.80513634 -5.5933326 ] 0.0 True
1 [ 0.18996522 -0.39978524 -1.917003   -5.82646176] 0.0 True
0 [ 0.18196952 -0.61660166 -2.03353223 -6.21337689] 0.0 True
1 [ 0.16963748 -0.46952886 -2.15779977 -6.37797875] 0.0 True
0 [ 0.16024691 -0.6929735  -2.28535934 -6.8084028 ] 0.0 True
0 [ 0.14638744 -0.91988177 -2.4215274  -7.25351962] 0.0 True
1 [ 0.1279898  -0.77047234 -2.56659779 -7.27891247] 0.0 True
1 [ 0.11258036 -0.61340529 -2.71217604 -7.2410838 ] 0.0 True
1 [ 0.10031225 -0.4471189  -2.85699772 -7.13670431] 0.0 True
0 [ 0.09136987 -0.65883578 -2.9997318  -7.52405131] 0.0 True
1 [ 0.07819316 -0.47380059 -3.15021283 -7.29085399] 0.0 True
1 [ 0.06871714 -0.27810904 -3.29602991 -6.99479328] 0.0 True


Normally, we’ll end the simulation before the cart-pole is allowed to go off-screen.

Observations

If we ever want to do better than take random actions at each step, it’d probably be good to actually know what our actions are doing to the environment.

The environment’s step function returns exactly what we need. In fact, step returns four values. These are:

• observation (object): an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.
• reward (float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.
• done (boolean): whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)
• info (dict): diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.

This is just an implementation of the classic “agent-environment loop”. Each timestep, the agent chooses an action, and the environment returns an observation and a reward.

The process gets started by calling reset(), which returns an initial observation. So a more proper way of writing the previous code would be to respect the done flag: the episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

In [6]:
for i_episode in range(2):
observation = env.reset()

for k in range(50):
env.render()
print(observation)
action = env.action_space.sample()
observation, reward, done, info = env.step(action)

if done:
print("Episode finished after {} timesteps".format(k+1))
break

env.close()

[ 0.01250663 -0.03324781  0.00759072 -0.04267362]
[ 0.01184168 -0.22847779  0.00673725  0.25239455]
[ 0.00727212 -0.42369529  0.01178514  0.5471949 ]
[-0.00120178 -0.61898081  0.02272904  0.84356759]
[-0.0135814  -0.4241763   0.03960039  0.55811806]
[-0.02206493 -0.61983112  0.05076275  0.86300967]
[-0.03446155 -0.42543568  0.06802294  0.58670997]
[-0.04297026 -0.62144109  0.07975714  0.90002146]
[-0.05539908 -0.42748525  0.09775757  0.63343631]
[-0.06394879 -0.62382521  0.1104263   0.95523509]
[-0.07642529 -0.43034791  0.129531    0.69918618]
[-0.08503225 -0.62700513  0.14351472  1.0296765 ]
[-0.09757235 -0.4340541   0.16410825  0.78527407]
[-0.10625344 -0.2415214   0.17981374  0.54838675]
[-0.11108386 -0.04932037  0.19078147  0.3173143 ]
[-0.11207027  0.14264495  0.19712776  0.09034086]
[-0.10921737 -0.05467604  0.19893457  0.43817106]
[-0.11031089  0.13715639  0.207698    0.21419924]
Episode finished after 18 timesteps
[ 0.04089544 -0.01757504 -0.04706547 -0.00768573]
[ 0.04054394 -0.21199151 -0.04721919  0.2697839 ]
[ 0.03630411 -0.40640893 -0.04182351  0.5472077 ]
[ 0.02817593 -0.6009191  -0.03087935  0.82642529]
[ 0.01615755 -0.79560548 -0.01435085  1.10923854]
[ 2.45443264e-04 -9.90535932e-01  7.83392365e-03  1.39738511e+00]
[-0.01956528 -1.18575443  0.03578163  1.69250702]
[-0.04328036 -0.99106345  0.06963177  1.41117504]
[-0.06310163 -1.18697623  0.09785527  1.72478728]
[-0.08684116 -0.99310058  0.13235101  1.46408861]
[-0.10670317 -1.18957222  0.16163279  1.79501637]
[-0.13049461 -1.38609377  0.19753311  2.13326985]
Episode finished after 12 timesteps


Cartpole state = [pos, vel_pos, angle, vel_ang]

In the examples above, we’ve been sampling random actions from the environment’s action space. But what actually are those actions? Every environment comes with an action_space and an observation_space.

In [7]:
print(env.observation_space)
print(env.action_space)
print(env.action_space.sample())

Box(4,)
Discrete(2)
1

In [8]:
print(env.observation_space.low)
print(env.observation_space.high)

[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]
[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]

In [9]:
n_observations = env.observation_space.shape
n_actions = env.action_space.n

print(n_observations)
print(n_actions)

(4,)
2

In [10]:
n_bins_pos = 10
n_bins_vel = 10
n_bins_ang = 10
n_bins_anv = 10


Initialize tabular $Q(s,a)$

In [11]:
n_states = n_bins_pos*n_bins_vel*n_bins_ang*n_bins_anv
n_actions = 2

Q_table = np.random.uniform(0, 1, (n_states, n_actions))

In [12]:
env.reset()
observation, _, _, _ = env.step(0)
env.close()

pos, vel, ang, anv = observation
print(observation)

[ 0.00585446 -0.23894502  0.03713528  0.28681932]

In [13]:
def map_discrete_state(state):
pos, vel, ang, anv = state

idx_pos = np.where(np.histogram(np.clip(pos,-2,2), bins = n_bins_pos, range = (-4,4))[0] == 1)[0][0]
idx_vel = np.where(np.histogram(np.clip(vel,-2,2), bins = n_bins_vel, range = (-2,2))[0] == 1)[0][0]
idx_ang = np.where(np.histogram(np.clip(ang,-0.4,0.4), bins = n_bins_ang, range = (-0.4,0.4))[0] == 1)[0][0]
idx_anv = np.where(np.histogram(np.clip(anv,-2,2), bins = n_bins_anv, range = (-2,2))[0] == 1)[0][0]

states = np.zeros([n_bins_pos, n_bins_vel, n_bins_ang, n_bins_anv])
states[idx_pos, idx_vel, idx_ang, idx_anv] = 1

states = states.reshape(-1,1)

s = np.where(states == 1)[0][0]

return s

• $\varepsilon$-Greedy: choose random action with probability $\varepsilon$, otherwise choose action greedily
$$\pi(s) = \begin{cases} \max_{a \in A} Q_k(s,a) & \text{with probability } 1 - \varepsilon \\ \\ \text{random action} & \text{otherwise} \end{cases}$$
• Q Learning

Initialize $Q(s,a)$ arbitrarily
Repeat (for each episode):
$\quad$Initialize $s$
$\quad$Repeat (for each step of episode):
$\quad \quad$Choose $a$ from $s$ using policy derived from $Q$ (e.g., $\varepsilon$ greedy)
$\quad \quad$Take action $a$, observe $R, s'$
$\quad \quad$${Q}_*(s,a) \leftarrow (1-\alpha) \left( {Q}_*(s,a) \right) + \alpha \left( R_t + \gamma \max_{a'} {Q}_*\left(s',a'\right) \right) \quad \quad$$s \; \leftarrow \; s'$
until $s$ is terminal

In [14]:
alpha = 0.3
gamma = 0.9

Q_table = np.random.uniform(0, 1, (n_states, n_actions))

for episode in range(801):

done = False
state = env.reset()

count = 0

while not done:

count += 1

s = map_discrete_state(state)

# Exploration vs. Exploitation for action
epsilon = 0.1
if np.random.uniform() < epsilon:
a = env.action_space.sample()
else:
a = np.argmax(Q_table[s, :])

# next state and reward
next_state, reward, done, _ = env.step(a)

if done:
reward = -100
Q_table[s, a] = reward

else:
# Temporal Difference Update
next_s = map_discrete_state(next_state)
Q_table[s, a] = (1 - alpha)*Q_table[s, a] + alpha*(reward + gamma*np.max(Q_table[next_s, :]))

state = next_state

if episode % 100 == 0:
print("Episode: {} steps: {}".format(episode, count))

env.close()

Episode: 0 steps: 8
Episode: 100 steps: 57
Episode: 200 steps: 69
Episode: 300 steps: 85
Episode: 400 steps: 89
Episode: 500 steps: 93
Episode: 600 steps: 19
Episode: 700 steps: 173
Episode: 800 steps: 145

In [15]:
state = env.reset()

done = False

while not done:

env.render()

s = map_discrete_state(state)
a = np.argmax(Q_table[s,:])

next_state, _, done, _ = env.step(a)
state = next_state

env.close()


# 6. Other Tutorials¶

In [16]:
%%html
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>


Stanford Univ. by Serena Yeung

In [17]:
%%html
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>


MIT

In [18]:
%%html
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

In [19]:
%%html
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>


David Silver's Lecture (DeepMind)

In [20]:
%%html

%%javascript