Processing math: 100%



Reinforcement Learning


By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Source

  • By David Silver's RL Course at UCL
  • By Prof. Zico Kolter at CMU

1. Reinforcement Learning

In [1]:
%%html
<center><iframe src="https://www.youtube.com/embed/un-FhSC0HfY?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

Agent interaction with environment




1.1. Markov Decision Processes

Recall a (discounted) Markov decision process is defined by:

M=(S,A,P,R)
  • S: set of states

  • A: set of actions

  • P: S×A×S[0,1]: transition probability distribution P(ss,a)

  • R:SR: reward function, where R(S) is reward for state s

  • γ: discount factor

  • Policy π:SA is a mapping from states to actions

The RL twist: we do not know P or R, or they are too big to enumerate (only have the ability to act in MDP, observe states and rewards)


Limitations of MDP

  • Update equations require access to dynamics model sampling-based approximations
  • Iteration over/storage for all states and actions: require small, discrete state-action space Q/V function fitting

1.2. Solving MDP

(Policy evaluation) Determine value of policy π

vπ(s)=E[t=0γtR(st)s0=s]=R(s)+γsSP(ss,π(s))vπ(s)

accomplished via the iteration (similar to a value iteration, but for a fixed policy)


vπ(s)R(s)+γsSP(ss,π(s))vπ(s),sS

(Value iteration) Determine value of optimal policy

v(s)=R(s)+γsSP(ss,a)v(s)

accomplished via value iteration:


v(s)R(s)+γmaxaAsSP(ss,a)v(s),sS

Optimal policy π is then π(s)=argmaxaAsSP(ss,a)v(s)


How can we compute these quantities when P and R are unknown?

  • model-based RL

  • model-free RL

1.3. Overview of RL




2. Model-based RL

A simple approach: just estimate the MDP from data (known as Monte Carlo method)

  • Agent acts in the work (according to some policy), observes episodes of experience
s1,r1,a1,s2,r2,a2,,sm,rm,am
  • We form the empirical estimate of the MDP via the counts


ˆP(ss,a)=m1i=11{si=s,ai=a,si+1=s}m1i=11{si=s,ai=a}ˆR(s)=m1i=11{si=s}rim1i=11{si=s}

Now solve the MDP (S,A,ˆP,ˆR)

  • Will converge to correct MDP (and hence correct value function/policy) given enough samples of each state

  • How can we ensure we get the "right" samples? (a challenging problem for all methods we present here)

  • Advantages (informally): makes "efficient" use of data

  • Disadvantages: requires we build the actual MDP models, not much help if state space is too large

3. Model-free RL

  • Temporal difference methods (TD, SARSA, Q-learning): directly learn value function vπ or v
  • Direct policy search: directly learn optimal policy π

3.1. Temporal Difference (TD) Methods

Let's consider computing the value function for a fixed policy via the iteration


vπ(s)R(s)+γsSP(ss,π(s))vπ(s),sS


Suppose we are in some state st, receive reward Rt, take action at=π(st) and end up in state st+1

We cannot update vπ for all sS, but can we update just for st?


vπ(st)Rt+γsSP(sst,at)vπ(s)


... No, because we still do not know P(sst,at) for all sS


But, st+1 is a sample from the distribution P(sst,at), so we could perform the update


vπ(st)Rt+γvπ(st+1)


  • It is too "harsh" assignment if we assume that st+1 is the only possible next state;

  • Instead "smooth" the update using some α<1


vπ(st)(1α)(vπ(st))+α(Rt+γvπ(st+1))


This is the temporal difference (TD) algorithm. Its mathematical background will be briefly discussed later.

3.2. Issue with traditional TD algorithms

TD lets us learn the value function of a policy π directly, without ever constructing the MDP.

But is this really that helpful?

Consider trying to execute greedy policy with respect to estimated vπ


π(s)=argmaxaAsSP(ss,a)vπ(s)


We need a model P(ss,a) anyway.

4. Entering the Q function

Q function (for MDPs in general) are like value functions but defined over state-action pairs


Qπ(s,a)=R(s)+γsSP(ss,a)Qπ(s,π(s))Q(s,a)=R(s)+γsSP(ss,a)maxaQ(s,a)=R(s)+γsSP(ss,a)v(s)


i.e., Q function is a value of starting state s, taking action a, and then acting according to π (or optimally for Q)


We can easily construct analogues of value iteration or policy evaluation to construct Q functions directly given an MDP.


Optimal policy

π(s)=argmaxasP(ss,a)v(s)orπ(s)=argmaxaQ(s,a)without knowing dynamics

4.1. SARSA and Q-learning

Q function leads to new TD-like methods.

As with TD, observe state s, reward r, take action a (but not necessarily a=π(s)), observe next sate s


  • SARSA: estimate Qπ(s,a) for expectation


Qπ(s,a)(1α)(Qπ(s,a))+α(Rt+γQπ(s,π(s)))


  • Q-learning: estimate Q(s,a) for optimality


Q(s,a)(1α)(Q(s,a))+α(Rt+γmaxaQ(s,a))


Again, these algorithms converge to true Qπ,Q if all state-action pairs are seen frequently enough


Note: State–Action–Reward–State–Action (SARSA)


The advantage of this approach is that we can now select actions without a model of MDP


  • SARSA, greedy policy with respect to Qπ(s,a)


π(s)=argmaxaQπ(s,a)


  • Q-learning, optimal policy


π(s)=argmaxaQ(s,a)


So with Q-learning, for instance, we can learn optimal policy without model of MDP.

4.2. Solving Q-Value

Q-value iterations


Qk+1(s,a)R(s)+γsP(ss,a)maxaQk(s,a)1R(s)+γsP(ss,a)maxaQk(s,a)sP(ss,a)R(s)+γsSP(ss,a)maxaQk(s,a)sP(ss,a)[R(s)+γmaxaQk(s,a)]Qk+1(s,a)EsP(ss,a)[R(s)+γmaxaQk(s,a)]Rewrite as expectation


Q-Learning Algorithm: replace expectation by samples

1) For an state-action pair (s,a), receive: sP(ss,a)

2) Consider your old estimate: Qk(s,a)

3) Consider your new sample estimate:

target(s)=R(s)+γmaxaQk(s,a)

4) Incorporate the new estimate into a running average [Temporal Difference or learning incrementally]:

Qk+1(s,a)Qk(s,a)+α(target(s)Qk(s,a))(1α)Qk(s,a)+αtarget(s)(1α)Qk(s,a)+α(R(s)+γmaxaQk(s,a))


4.3. How to Sample Actions (Exploration vs. Exploitation) ?

All the methods discussed so far had some condition like "assuming we visit each state enough"

A fundamental question: if we do not know the system dynamics, should we take exploratory actions that will give us more information, or exploit current knowledge to perform as best we can?

Issue is that bad initial estimates in the first few cases can drive policy into sub-optimal region, and never explore further.

  • Choose random actions ? or

  • Choose action that maximizes Qk(s,a) (i.e., greedily) ?

  • ε-Greedy: choose random action with probability ε, otherwise choose action greedily
π(s)={maxaAQk(s,a)with probability 1εrandom actionotherwise
  • Want to decrease ε as we see more examples


Q-Learning Properties

  • Amazing result: Q-learning converges to optimal policy if all state-action pairs seen frequently enough

  • With Q-learning, we can learn optimal policy without model of MDP

  • This is called off-policy learning


4.4. Q-Learning Algorithm

Initialize Q(s,a) arbitrarily
Repeat (for each episode):
Initialize s
Repeat (for each step of episode):
Choose a from s using policy derived from Q (e.g., ε greedy)
Take action a, observe R,s
Q(s,a)(1α)(Q(s,a))+α(Rt+γmaxaQ(s,a))
ss
until s is terminal

5. Q-Learning with Gym Environment


  • Agent interaction with environment



  • Examples




In [2]:
# !pip install gym
In [3]:
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

CartPole-v0

  • A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.



  • Objective:
    • Balance a pole on top of a movable cart
  • State:
    • [position, horizontal velocity, angle, angular speed]
  • Action:
    • horizontal force applied on the cart (binary)
  • Reward:
    • 1 at each time step if the pole is upright

The process gets started by calling reset(), which returns an initial observation.

In [4]:
import gym

env = gym.make('CartPole-v0')
observation = env.reset() # observation = state

print(observation)
[ 0.0269356  -0.04532015 -0.04633726  0.00219446]

Here’s a bare minimum example of getting something running. This will run an instance of the CartPole-v0 environment for 100 timesteps, rendering the environment at each step. You should see a window pop up rendering the classic cart-pole problem:

In [5]:
env.reset()

for _ in range(50):
    env.render()    
    action = env.action_space.sample() # your agent here (this takes random actions)
    observation, reward, done, info = env.step(action)
    
    print(action, observation, reward, done)
    
env.close()
1 [-0.01315711  0.20808312 -0.00669239 -0.29433073] 1.0 False
1 [-0.00899545  0.40329984 -0.012579   -0.5891168 ] 1.0 False
1 [-0.00092945  0.59859566 -0.02436134 -0.88573547] 1.0 False
1 [ 0.01104246  0.79403972 -0.04207605 -1.1859761 ] 1.0 False
1 [ 0.02692326  0.98968134 -0.06579557 -1.4915455 ] 1.0 False
0 [ 0.04671689  0.79541901 -0.09562648 -1.22011246] 1.0 False
0 [ 0.06262527  0.60165072 -0.12002873 -0.95885929] 1.0 False
0 [ 0.07465828  0.40832909 -0.13920591 -0.70616698] 1.0 False
0 [ 0.08282486  0.21538217 -0.15332925 -0.4603408 ] 1.0 False
0 [ 0.08713251  0.02272214 -0.16253607 -0.21964354] 1.0 False
0 [ 0.08758695 -0.16974847 -0.16692894  0.01768178] 1.0 False
1 [ 0.08419198  0.02732511 -0.1665753  -0.32266902] 1.0 False
0 [ 0.08473848 -0.16508166 -0.17302868 -0.08680068] 1.0 False
1 [ 0.08143685  0.03204374 -0.1747647  -0.42869049] 1.0 False
1 [ 0.08207772  0.22915413 -0.18333851 -0.77097201] 1.0 False
0 [ 0.08666081  0.03696469 -0.19875795 -0.54111941] 1.0 False
0 [ 0.0874001  -0.15489036 -0.20958033 -0.31705338] 1.0 True
0 [ 0.08430229 -0.34650739 -0.2159214  -0.09708373] 0.0 True
0 [ 0.07737214 -0.53798691 -0.21786308  0.12047738] 0.0 True
1 [ 0.06661241 -0.34051651 -0.21545353 -0.23227263] 0.0 True
0 [ 0.05980208 -0.5320141  -0.22009898 -0.01452193] 0.0 True
1 [ 0.04916179 -0.33452541 -0.22038942 -0.3677965 ] 0.0 True
1 
c:\users\seungchul\appdata\local\programs\python\python35\lib\site-packages\gym\logger.py:30: UserWarning: WARN: You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
[ 0.04247129 -0.13706345 -0.22774535 -0.72109647] 0.0 True
1 [ 0.03973002  0.06036198 -0.24216728 -1.07596753] 0.0 True
0 [ 0.04093726 -0.13088724 -0.26368663 -0.86796785] 0.0 True
0 [ 0.03831951 -0.32164377 -0.28104599 -0.66835172] 0.0 True
1 [ 0.03188664 -0.1239339  -0.29441302 -1.03482512] 0.0 True
0 [ 0.02940796 -0.3142042  -0.31510952 -0.8470123 ] 0.0 True
0 [ 0.02312387 -0.50398342 -0.33204977 -0.66747653] 0.0 True
0 [ 0.01304421 -0.69335493 -0.3453993  -0.49477413] 0.0 True
0 [-8.22893262e-04 -8.82403222e-01 -3.55294784e-01 -3.27489679e-01] 0.0 True
0 [-0.01847096 -1.07121243 -0.36184458 -0.16423709] 0.0 True
0 [-0.03989521 -1.25986572 -0.36512932 -0.00365739] 0.0 True
1 [-0.06509252 -1.06180817 -0.36520247 -0.38613768] 0.0 True
1 [-0.08632868 -0.86380201 -0.37292522 -0.76855829] 0.0 True
0 [-0.10360472 -1.05243569 -0.38829639 -0.61217252] 0.0 True
0 [-0.12465344 -1.24069581 -0.40053984 -0.46211674] 0.0 True
0 [-0.14946735 -1.42865865 -0.40978217 -0.31712334] 0.0 True
0 [-0.17804053 -1.61639849 -0.41612464 -0.17596113] 0.0 True
1 [-0.2103685  -1.41835411 -0.41964386 -0.56651707] 0.0 True
0 [-0.23873558 -1.60598829 -0.4309742  -0.42927213] 0.0 True
0 [-0.27085535 -1.79334939 -0.43955965 -0.29674945] 0.0 True
0 [-0.30672233 -1.98050533 -0.44549463 -0.16781141] 0.0 True
0 [-0.34633244 -2.16752164 -0.44885086 -0.04135266] 0.0 True
1 [-0.38968287 -1.96952163 -0.44967792 -0.4365093 ] 0.0 True
0 [-0.4290733  -2.1565249  -0.4584081  -0.31168479] 0.0 True
0 [-0.4722038  -2.34332106 -0.4646418  -0.19051954] 0.0 True
1 [-0.51907022 -2.14537811 -0.46845219 -0.58769791] 0.0 True
0 [-0.56197779 -2.33208948 -0.48020615 -0.47054567] 0.0 True
1 [-0.60861958 -2.13428026 -0.48961706 -0.86951791] 0.0 True

Normally, we’ll end the simulation before the cart-pole is allowed to go off-screen.

Observations

If we ever want to do better than take random actions at each step, it’d probably be good to actually know what our actions are doing to the environment.

The environment’s step function returns exactly what we need. In fact, step returns four values. These are:

  • observation (object): an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.
  • reward (float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.
  • done (boolean): whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)
  • info (dict): diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.

This is just an implementation of the classic “agent-environment loop”. Each timestep, the agent chooses an action, and the environment returns an observation and a reward.

The process gets started by calling reset(), which returns an initial observation. So a more proper way of writing the previous code would be to respect the done flag: the episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

In [6]:
for i_episode in range(2):
    observation = env.reset()
    
    for k in range(50):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        
        if done:            
            print("Episode finished after {} timesteps".format(k+1))
            break

env.close()
[ 0.01572238 -0.01097918 -0.03519514  0.01641505]
[ 0.01550279  0.18462937 -0.03486684 -0.28716131]
[ 0.01919538 -0.00997844 -0.04061007 -0.00567579]
[ 0.01899581 -0.20449517 -0.04072359  0.27392266]
[ 0.01490591 -0.39901312 -0.03524513  0.55348825]
[ 0.00692565 -0.20341442 -0.02417537  0.2499125 ]
[ 0.00285736 -0.00795572 -0.01917712 -0.05029674]
[ 0.00269824  0.18743588 -0.02018305 -0.34896799]
[ 0.00644696 -0.00739328 -0.02716241 -0.06271717]
[ 0.00629909 -0.20211547 -0.02841676  0.2212736 ]
[ 0.00225679 -0.39681996 -0.02399128  0.50485902]
[-0.00567961 -0.59159574 -0.0138941   0.78988584]
[-0.01751153 -0.39628577  0.00190361  0.49286443]
[-0.02543724 -0.20119072  0.0117609   0.20078205]
[-0.02946106 -0.39647888  0.01577654  0.49715166]
[-0.03739064 -0.2015829   0.02571958  0.20948218]
[-0.04142229 -0.00683796  0.02990922 -0.07497789]
[-0.04155905 -0.20237564  0.02840966  0.22698944]
[-0.04560657 -0.39789184  0.03294945  0.52849671]
[-0.0535644  -0.20324857  0.04351938  0.24637567]
[-0.05762937 -0.00877432  0.0484469  -0.03226885]
[-0.05780486  0.18562061  0.04780152 -0.30928129]
[-0.05409245  0.38003004  0.0416159  -0.58651404]
[-0.04649185  0.57454518  0.02988561 -0.86580258]
[-0.03500094  0.3790295   0.01256956 -0.56387497]
[-0.02742035  0.57397285  0.00129206 -0.85257154]
[-0.0159409   0.76907717 -0.01575937 -1.1448479 ]
[-5.59353064e-04  5.74164583e-01 -3.86563252e-02 -8.57148387e-01]
[ 0.01092394  0.76979129 -0.05579929 -1.16173136]
[ 0.02631976  0.57543872 -0.07903392 -0.88705271]
[ 0.03782854  0.38147338 -0.09677497 -0.6202247 ]
[ 0.04545801  0.18782669 -0.10917947 -0.35952125]
[ 0.04921454 -0.00558759 -0.11636989 -0.10316229]
[ 0.04910279  0.19099314 -0.11843314 -0.43017464]
[ 0.05292265  0.38757561 -0.12703663 -0.75772076]
[ 0.06067416  0.58419801 -0.14219105 -1.08752609]
[ 0.07235812  0.39120798 -0.16394157 -0.84262601]
[ 0.08018228  0.19865752 -0.18079409 -0.6056562 ]
[ 0.08415543  0.00646263 -0.19290721 -0.37492704]
[ 0.08428469 -0.185471   -0.20040575 -0.14873047]
[ 0.08057527 -0.37724342 -0.20338036  0.07464522]
[ 0.0730304  -0.56895718 -0.20188746  0.2969064 ]
[ 0.06165125 -0.76071489 -0.19594933  0.5197485 ]
[ 0.04643696 -0.95261706 -0.18555436  0.74485203]
[ 0.02738462 -1.14475988 -0.17065732  0.97387834]
[ 0.00448942 -1.33723254 -0.15117975  1.2084633 ]
[-0.02225523 -1.14051639 -0.12701049  0.87247693]
[-0.04506556 -0.94391734 -0.10956095  0.54271298]
[-0.06394391 -0.74743996 -0.09870669  0.21761746]
[-0.07889271 -0.55105569 -0.09435434 -0.10449774]
[ 0.04233323  0.04516275  0.00307717 -0.02175756]
[ 0.04323648  0.24024044  0.00264201 -0.31346802]
[ 0.04804129  0.43532465 -0.00362735 -0.60531657]
[ 0.05674778  0.24025361 -0.01573368 -0.31377838]
[ 0.06155286  0.04535929 -0.02200925 -0.02609858]
[ 0.06246004  0.24078984 -0.02253122 -0.32564361]
[ 0.06727584  0.04599581 -0.02904409 -0.04015035]
[ 0.06819575 -0.14869787 -0.0298471   0.24322924]
[ 0.0652218   0.04683742 -0.02498251 -0.05871679]
[ 0.06615855 -0.14791759 -0.02615685  0.22598047]
[ 0.06320019 -0.34265614 -0.02163724  0.51029914]
[ 0.05634707 -0.5374667  -0.01143125  0.79608573]
[ 0.04559774 -0.34218976  0.00449046  0.49982874]
[ 0.03875394 -0.1471314   0.01448703  0.20856434]
[ 0.03581131 -0.34245748  0.01865832  0.50578174]
[ 0.02896216 -0.14760336  0.02877396  0.21903667]
[ 0.0260101   0.04709572  0.03315469 -0.06443268]
[ 0.02695201 -0.14848551  0.03186604  0.23852363]
[ 0.0239823  -0.34404787  0.03663651  0.54108527]
[ 0.01710134 -0.14945951  0.04745821  0.26016732]
[ 0.01411215  0.04495396  0.05266156 -0.01717706]
[ 0.01501123 -0.15088212  0.05231802  0.29164517]
[ 0.01199359 -0.3467095   0.05815092  0.6003588 ]
[ 0.0050594  -0.15244721  0.0701581   0.32654465]
[0.00201046 0.04160939 0.07668899 0.05678541]
[ 0.00284264  0.23555276  0.0778247  -0.21075013]
[ 0.0075537   0.42948069  0.0736097  -0.47790418]
[ 0.01614331  0.62349038  0.06405161 -0.74650896]
[ 0.02861312  0.81767284  0.04912143 -1.01836705]
[ 0.04496658  0.62193176  0.02875409 -0.7106737 ]
[ 0.05740521  0.81664395  0.01454062 -0.99416872]
[ 0.07373809  1.0115684  -0.00534276 -1.28224969]
[ 0.09396946  0.81651491 -0.03098775 -0.99124439]
[ 0.11029976  0.62182106 -0.05081264 -0.70845276]
[ 0.12273618  0.4274384  -0.06498169 -0.43218759]
[ 0.13128495  0.23329383 -0.07362544 -0.16067654]
[ 0.13595082  0.42938838 -0.07683898 -0.47564782]
[ 0.14453859  0.62550648 -0.08635193 -0.79152539]
[ 0.15704872  0.43166957 -0.10218244 -0.52720932]
[ 0.16568211  0.6280696  -0.11272663 -0.85026209]
[ 0.1782435   0.82453332 -0.12973187 -1.17615874]
[ 0.19473417  0.63131313 -0.15325504 -0.92679828]
[ 0.20736043  0.82813519 -0.17179101 -1.26345188]
[ 0.22392314  1.02498554 -0.19706005 -1.60463949]
Episode finished after 44 timesteps

Cartpole state = [pos, vel_pos, angle, vel_ang]

In the examples above, we’ve been sampling random actions from the environment’s action space. But what actually are those actions? Every environment comes with an action_space and an observation_space.

In [7]:
print(env.observation_space)
print(env.action_space)
print(env.action_space.sample())
Box(4,)
Discrete(2)
0
In [8]:
print(env.observation_space.low)
print(env.observation_space.high)
[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]
[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]
In [9]:
n_observations = env.observation_space.shape
n_actions = env.action_space.n

print(n_observations)
print(n_actions)
(4,)
2
In [10]:
n_bins_pos = 10
n_bins_vel = 10
n_bins_ang = 10
n_bins_anv = 10

Initialize tabular Q(s,a)

In [11]:
n_states = n_bins_pos*n_bins_vel*n_bins_ang*n_bins_anv
n_actions = 2

Q_table = np.random.uniform(0, 1, (n_states, n_actions))
In [12]:
env.reset()
observation, _, _, _ = env.step(0)
env.close()

pos, vel, ang, anv = observation
print(observation)
[-0.01088748 -0.15746192 -0.00746352  0.28721886]
In [13]:
def map_discrete_state(state):
    pos, vel, ang, anv = state
    
    idx_pos = np.where(np.histogram(np.clip(pos,-2,2), bins = n_bins_pos, range = (-2,2))[0] == 1)[0][0]
    idx_vel = np.where(np.histogram(np.clip(vel,-2,2), bins = n_bins_vel, range = (-2,2))[0] == 1)[0][0]
    idx_ang = np.where(np.histogram(np.clip(ang,-0.4,0.4), bins = n_bins_ang, range = (-0.4,0.4))[0] == 1)[0][0]
    idx_anv = np.where(np.histogram(np.clip(anv,-2,2), bins = n_bins_anv, range = (-2,2))[0] == 1)[0][0]

    states = np.zeros([n_bins_pos, n_bins_vel, n_bins_ang, n_bins_anv])
    states[idx_pos, idx_vel, idx_ang, idx_anv] = 1

    states = states.reshape(-1,1)

    s = np.where(states == 1)[0][0]
    
    return s            
  • ε-Greedy: choose random action with probability ε, otherwise choose action greedily
π(s)={maxaAQk(s,a)with probability 1εrandom actionotherwise
  • Q Learning

Initialize Q(s,a) arbitrarily
Repeat (for each episode):
Initialize s
Repeat (for each step of episode):
Choose a from s using policy derived from Q (e.g., ε greedy)
Take action a, observe R,s
Q(s,a)(1α)(Q(s,a))+α(Rt+γmaxaQ(s,a))
ss
until s is terminal

In [14]:
alpha = 0.3
gamma = 0.9

Q_table = np.random.uniform(0, 1, (n_states, n_actions))

for episode in range(801):
    
    done = False
    state = env.reset()        
    
    count = 0
    
    while not done:
        
        count += 1
        
        s = map_discrete_state(state)    

        # Exploration vs. Exploitation for action       
        epsilon = 0.1
        if np.random.uniform() < epsilon:
            a = env.action_space.sample()
        else:
            a = np.argmax(Q_table[s, :])                

        # next state and reward
        next_state, reward, done, _ = env.step(a) 
                
        if done:        
            reward = -100        
            Q_table[s, a] = reward
                 
        else:                                               
            # Temporal Difference Update
            next_s = map_discrete_state(next_state)
            Q_table[s, a] = (1 - alpha)*Q_table[s, a] + alpha*(reward + gamma*np.max(Q_table[next_s, :]))

        state = next_state

    if episode % 100 == 0:
        print("Episode: {} steps: {}".format(episode, count))
        
env.close()
Episode: 0 steps: 51
Episode: 100 steps: 38
Episode: 200 steps: 130
Episode: 300 steps: 110
Episode: 400 steps: 174
Episode: 500 steps: 17
Episode: 600 steps: 163
Episode: 700 steps: 167
Episode: 800 steps: 104
In [15]:
state = env.reset()

done = False

while not done:
   
    env.render()
    
    s = map_discrete_state(state)
    a = np.argmax(Q_table[s,:])
    
    next_state, _, done, _ = env.step(a)      
    state = next_state    

env.close()

6. Video Lectures

In [16]:
%%html
<center><iframe src="https://www.youtube.com/embed/h9xplLinwsQ?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
In [17]:
%%html
<center><iframe src="https://www.youtube.com/embed/on9QKpIdYeE?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
In [18]:
%%html
<center><iframe src="https://www.youtube.com/embed/nvPpjOMZolo?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
In [19]:
%%html
<center><iframe src="https://www.youtube.com/embed/vcQkQMYmlUY?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
In [20]:
%%html
<center><iframe src="https://www.youtube.com/embed/w1SFeTdvZWs?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>

7. Other Tutorials

In [21]:
%%html
<center><iframe src="https://www.youtube.com/embed/JgvyzIkgxF0?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

Stanford Univ. by Serena Yeung

In [22]:
%%html
<center><iframe src="https://www.youtube.com/embed/lvoHnicueoE?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

MIT

In [23]:
%%html
<center><iframe src="https://www.youtube.com/embed/xWe58WGWmlk?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
In [24]:
%%html
<center><iframe src="https://www.youtube.com/embed/s5qqjyGiBdc?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

David Silver's Lecture (DeepMind)

In [25]:
%%html
<center><iframe src="https://www.youtube.com/embed/2pWv7GOvuf0?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
In [26]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')