Deep Q-Learning


By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Source

  • By David Silver's RL Course at UCL
  • By Prof. Zico Kolter at CMU

Table of Contents

1. Deep Q-Learning (DQN)


Value iteration algorithm: use Bellman equation as an iterative update


$$ Q_{k+1}(s,a) \; \leftarrow \; \mathbb{E}_{s' \sim P\left(s'\mid s,a\right)} \left[ R(s) + \gamma \max_{a'} Q_{k}\left(s',a'\right) \right] $$

$Q_k$ will converge to $Q_*$ as $k$ goes to $\infty$

1.1. What is the problem with this?

  • Not scalable. Must compute tabular $Q(s,a)$ for every state-action pair. If state is for instance current game state pixels, computationally infeasible to compute for entire state space !
    • too many states to visit them all in training
    • too many states to hold the Q-table in memory
  • States in Tetris: naive board configuration + shape of the falling piece $\sim 10^{60}$ states !!!



  • Solution: use a function approximator to estimate $Q(s,a)$
    • learn about some small number of training states from experience
    • generalize that experience to new, similar situations

1.2. Approximate Q-Learning

Instead of a table, we have a parameterized Q function: $Q_{\omega}(s,a) \approx Q_*(s,a)$

  • can be a linear function in features


$$Q_{\omega}(s,a) = \omega_0 f_0(s,a) + \omega_1 f_1(s,a) + \cdots + \omega_n f_n(s,a)$$

  • or a complicated neural network $\rightarrow$ (Deep Q-learning Network)

Learning rule:

  • transition = $\left(s,a,r,s' \right)$

  • remember:


$$\text{target}\left(s'\right) = R(s) + \gamma \max_{a'}Q_{\omega}{\left(s',a'\right)}$$

$$Q_{k+1}(s,a) \; \leftarrow \; Q_k(s,a) + \alpha \; \underbrace{\left(\text{target}\left(s'\right) - Q_k(s,a) \right)}_{\text{difference}}$$


  • difference
$$ \text{difference} = \left[R(s) + \gamma \max_{a'}Q_{\omega}{\left(s',a'\right)} \right] - Q_{\omega}{\left(s,a\right)}$$


  • loss function by mean-squared error in Q-value


$$\ell(\omega) = \left[ \frac{1}{2} \left(\text{target}\left(s'\right) - Q_{\omega}(s,a) \right)^2 \right]$$


  • GD update:


$$ \begin{align*} \omega_{k+1} \; &\leftarrow \; \omega_k - \alpha \; \nabla_{\omega} \ell(\omega) \qquad \qquad \implies \text{Deep Learning Framework}\\ \\ \; &\leftarrow \; \omega_k - \alpha \; \nabla_{\omega} \left[ \frac{1}{2} \left(\text{target}\left(s'\right) - Q_{\omega}(s,a)\right)^2 \right] \\ \\ \; &\leftarrow \; \omega_k + \alpha \; \left[ \left(\text{target}\left(s'\right) - Q_{\omega}(s,a)\right) \right] \nabla_{\omega} Q_{\omega}(s,a) \end{align*} $$

1.3. Linear Function Approximator


$$Q_{\omega}(s,a) = \omega_0 f_0(s,a) + \omega_1 f_1(s,a) + \cdots + \omega_n f_n(s,a)$$


  • Exact $Q$
$$Q(s,a) \;\leftarrow \; Q(s,a) + \alpha \, [\text{difference}] $$
  • Approximate $Q$
$$\omega_i \;\leftarrow\; \omega_i + \alpha \, [\text{difference}] \, f_i(s,a)$$



Example: linear regression


$$\ell = \frac{1}{2} \left(y - \hat{y} \right)^2$$


  • difference
$$y - \hat{y} = y - \left(\omega_0 f_0(x) + \omega_1 f_1(x) + \cdots + \omega_n f_n(x)\right) = \underbrace{y}_{\text{target}} - \underbrace{\sum_i \omega_i f_i(x)}_{\text{prediction}} $$
  • For one point $x$
$$ \begin{align*} \frac{\partial \ell(\omega)}{\partial \omega_k} &= - \left(y - \sum_i \omega_i f_i(x) \right)f_k(x)\\ \\ \omega_k \;&\leftarrow\; \omega_i + \alpha \left(y - \sum_i \omega_i f_i(x) \right)f_k(x) \\ \\ \omega_k \;&\leftarrow\; \omega_i + \alpha \, \left[\text{difference} \right] \,f_k(x) \end{align*}$$

1.4. Deep Q-Networks (DQN)

DQN method (Q-learning with deep network as Q function approximator) became famous in 2013 for learning to play a wide variety of Atari games better than humans.

  • Deep Mind, 2015
    • Used a deep learning network to represent Q



In [1]:
%%html
<center><iframe src="https://www.youtube.com/embed/V1eYniJ0Rnk?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
  • (Tabular) Q-Learning like DQN, but inefficient




  • Better DQN



2. Q Network with Gym Environment

In [2]:
# !pip install gym
In [26]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

import random
import gym
In [4]:
env = gym.make('CartPole-v1')
In [5]:
n_input = env.observation_space.shape[0] # 4
n_actions = env.action_space.n # 2
n_output = n_actions

n_hidden1 = 100
n_hidden2 = 50
In [6]:
x = tf.placeholder(tf.float32, [None, n_input])
y = tf.placeholder(tf.float32, [None, n_output])
In [7]:
weights = {
    'hidden1' : tf.Variable(tf.random_normal([n_input, n_hidden1], stddev = 0.01)),
    'hidden2' : tf.Variable(tf.random_normal([n_hidden1, n_hidden2], stddev = 0.01)),
    'output' : tf.Variable(tf.random_normal([n_hidden2, n_output], stddev = 0.01))
}

biases = {
    'hidden1' : tf.Variable(tf.random_normal([n_hidden1], stddev = 0.01)),
    'hidden2' : tf.Variable(tf.random_normal([n_hidden2], stddev = 0.01)),
    'output' : tf.Variable(tf.random_normal([n_output], stddev = 0.01))
}
In [8]:
def Qnet(x, weights, biases):
    hidden1 = tf.add(tf.matmul(x, weights['hidden1']), biases['hidden1'])
    hidden1 = tf.nn.relu(hidden1)
    
    hidden2 = tf.add(tf.matmul(hidden1, weights['hidden2']), biases['hidden2'])
    hidden2 = tf.nn.relu(hidden2)
    
    output = tf.add(tf.matmul(hidden2, weights['output']), biases['output'])
    return output
In [9]:
Qpred = Qnet(x, weights, biases)
loss = tf.reduce_mean(tf.square(y - Qpred))

LR = 0.001
optm = tf.train.AdamOptimizer(LR).minimize(loss)
In [10]:
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
In [11]:
gamma = 0.9

for episode in range(801):
    
    done = False
    state = env.reset()        
    
    count = 0
    
    while not done:
        
        count += 1
        state = np.reshape(state, [1, n_input])
        Q = sess.run(Qpred, feed_dict = {x: state})
            
        # Exploration vs. Exploitation for action       
        epsilon = 0.1
        if np.random.uniform() < epsilon:
            a = env.action_space.sample()
        else:
            a = np.argmax(Q)

        # next state and reward
        next_state, reward, done, _ = env.step(a) 
                
        if done:        
            reward = -200        
            Q[0][a] = reward
                 
        else:                                               
            next_state = np.reshape(next_state, [1, n_input])
            next_Q = sess.run(Qpred, feed_dict = {x: next_state})
            Q[0][a] = reward + gamma*np.max(next_Q)

        sess.run(optm, feed_dict = {x: state, y: Q})
        
        state = next_state

    if episode % 100 == 0:
        print("Episode: {} steps: {}".format(episode, count))
        
env.close()
Episode: 0 steps: 10
Episode: 100 steps: 55
Episode: 200 steps: 32
Episode: 300 steps: 51
Episode: 400 steps: 61
Episode: 500 steps: 77
Episode: 600 steps: 109
Episode: 700 steps: 208
Episode: 800 steps: 228
In [12]:
state = env.reset()

done = False

while not done:
    env.render()
    
    state = np.reshape(state, [1, n_input])
    Q = sess.run(Qpred, feed_dict = {x: state})
    action = np.argmax(Q)
    
    next_state, reward, done, _ = env.step(action)
    state = next_state
        
env.close()
In [13]:
sess.close()

3. DQN with Gym Environment

In [14]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

import random
from collections import deque

import gym
%matplotlib inline
In [15]:
env = gym.make('CartPole-v1')
In [16]:
n_input = env.observation_space.shape[0] # 4
n_actions = env.action_space.n # 2
n_output = n_actions

n_hidden1 = 100
n_hidden2 = 50
In [17]:
x = tf.placeholder(tf.float32, [None, n_input])
y = tf.placeholder(tf.float32, [None, n_output])
In [18]:
weights = {
    'hidden1' : tf.Variable(tf.random_normal([n_input, n_hidden1], stddev = 0.01)),
    'hidden2' : tf.Variable(tf.random_normal([n_hidden1, n_hidden2], stddev = 0.01)),
    'output' : tf.Variable(tf.random_normal([n_hidden2, n_output], stddev = 0.01))
}

biases = {
    'hidden1' : tf.Variable(tf.random_normal([n_hidden1], stddev = 0.01)),
    'hidden2' : tf.Variable(tf.random_normal([n_hidden2], stddev = 0.01)),
    'output' : tf.Variable(tf.random_normal([n_output], stddev = 0.01))
}
In [19]:
def DQN(x, weights, biases):
    hidden1 = tf.add(tf.matmul(x, weights['hidden1']), biases['hidden1'])
    hidden1 = tf.nn.relu(hidden1)
    
    hidden2 = tf.add(tf.matmul(hidden1, weights['hidden2']), biases['hidden2'])
    hidden2 = tf.nn.relu(hidden2)
    
    output = tf.add(tf.matmul(hidden2, weights['output']), biases['output'])
    return output
In [20]:
Qpred = DQN(x, weights, biases)
loss = tf.reduce_mean(tf.square(y - Qpred))

LR = 0.001
optm = tf.train.AdamOptimizer(LR).minimize(loss)
In [21]:
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
In [22]:
replay_buffer = []
n_buffer = 5000
n_batch = 10

gamma = 0.9

for episode in range(2001):
        
    done = False
    state = env.reset()
    
    count = 0
    
    while not done:                        
        
        count += 1        
        Q = sess.run(Qpred, feed_dict = {x: np.reshape(state, [1, n_input])}) 
        
        # Exploration vs. Exploitation for action 
        epsilon = 0.1
        if np.random.uniform() < epsilon:
            action = env.action_space.sample()
        else:                          
            action = np.argmax(Q)
        
        next_state, reward, done, _ = env.step(action) 
        
        if done:        
            reward = -200
            #next_state = np.zeros(state.shape)
      
        replay_buffer.append([state, action, reward, next_state, done])
        
        if len(replay_buffer) > n_buffer:            
            replay_buffer.pop(0)
        
        state = next_state                        
        
    if episode % n_batch == 1:
        
        for _ in range(50):
         
            if len(replay_buffer) < n_batch:    
                break
                
            minibatch = random.sample(replay_buffer, n_batch)

            x_stack = np.empty(0).reshape(0, n_input)
            y_stack = np.empty(0).reshape(0, n_output)

            for state, action, reward, next_state, done in minibatch:                

                Q = sess.run(Qpred, feed_dict = {x: np.reshape(state, [1, n_input])}) 

                if done:
                    Q[0][action] = reward                        
                else:                    
                    next_Q = sess.run(Qpred, feed_dict = {x: np.reshape(next_state, [1, n_input])})               
                    Q[0][action] = reward + gamma*np.max(next_Q)

                x_stack = np.vstack([x_stack, state])
                y_stack = np.vstack([y_stack, Q])

            sess.run(optm, feed_dict = {x: x_stack, y: y_stack})                

    if episode % 200 == 1:
        print("Episode: {} steps: {}".format(episode, count))
        
env.close()   
Episode: 1 steps: 9
Episode: 201 steps: 35
Episode: 401 steps: 43
Episode: 601 steps: 42
Episode: 801 steps: 39
Episode: 1001 steps: 67
Episode: 1201 steps: 64
Episode: 1401 steps: 101
Episode: 1601 steps: 102
Episode: 1801 steps: 103
In [23]:
state = env.reset()

done = False

while not done:
    env.render()
    
    state = np.reshape(state, [1, n_input])
    Q = sess.run(Qpred, feed_dict = {x: state})  
    action = np.argmax(Q)
    
    next_state, reward, done, _ = env.step(action)
    state = next_state
        
env.close()
In [24]:
sess.close()

4. Other Tutorials


In [25]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')