Markov Decision Process


By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Source

  • By David Silver's RL Course at UCL
  • By Prof. Zico Kolter at CMU

Table of Contents

1. Markov Decision Process

  • So far, we analyzed the passive behavior of a Markov chain with rewards
  • A Markov decision process (MDP) is a Markov reward process with decisions (or actions).

1.1. MDP Definition

Definition: A Markov Decision Process is a tuple $\langle S,\color{red}{A},P,R,\gamma \rangle$

  • $S$ is a finite set of states
  • $A$ is a finite set of actions
  • $P$ is a state transition probability matrix
$$P_{ss'}^\color{red}{a} = P\left[S_{t+1} = s' \mid S_t = s, A_t = \color{red}{a} \right]$$
  • $R$ is a reward function, $R_s^\color{red}{a} = \mathbb{E} \left[ R_{t+1} \mid S_t = s, A_t = \color{red}{a} \right] \quad \left(= \mathbb{E} \left[ R_{t+1} \mid S_t = s \right], \; \text{often assumed} \right)$
  • $\gamma$ is a discount factor, $\gamma \in [0,1]$

Example: Startup MDP

  • You run a startup company. In every state, you must choose between Saving money or Advertising



1.2. Policies

  • A policy is a mapping from state to actions, $\pi: S \rightarrow A$
  • A policy fully defines the behavior of an agent
    • It can be deteministic or stochastic
  • Given a state, it specifies a distribution over actions
$$\pi (a \mid s) = P(A_t = a \mid S_t = s)$$
  • MDP policies depend on the current state (not the history)
  • Policies are stationary (time-independent, but it turns out to be optimal)
  • Let $P^{\pi}$ be a matrix containing probabilities for each transition under policy $\pi$
  • Given an MDP $\mathcal{M} = \langle S,A,P,R,\gamma \rangle$ and a policy $\pi$

    • The state sequence $s_1, s_2, \cdots $ is a Markov process $\langle S, P^{\pi} \rangle$

    • The state and reward sequence is a Markov reward process $\langle S,P^{\pi},R^{\pi},\gamma \rangle$

  • Questions on MDP policy

    • How many possible policies in our example?

    • Which of the above two policies is best?

    • How do you compute the optimal policy?

2. Bellman Equation

2.1. Value Function

State-value function

  • The state-value function $v_{\pi}(s)$ of an MDP is the exptected return starting from state $s$, and then following policy $\pi$


$$ \begin{align*} v_{\pi}(s) &= \mathbb{E}_{\pi} \left[G_t \mid S_t =s \right]\\ &= \mathbb{E}_{\pi}[ R_{t+1} + \gamma v_{\pi}(S_{t+1}) \mid S_t = s ] \end{align*} $$


Action-value function

  • The action-value function $q_{\pi}(s,a)$ of an MDP is the exptected return starting from state $s$, taking action $a$, and then following policy $\pi$


$$ \begin{align*} q_{\pi}(s,a) &= \mathbb{E}_{\pi} \left[G_t \mid S_t =s, A_t = a \right]\\ & = \mathbb{E}_{\pi} \left[R_{t+1} + \gamma q_{\pi} \left(S_{t+1}, A_{t+1} \right) \mid S_t =s, A_t = a \right] \end{align*} $$


2.2. Bellman Expectation Equation

2.2.1. Relationship between $v_{\pi}(s)$ and $q_{\pi}(s,a)$

  • State-value function using policy $\pi$


$$v_{\pi}(s) = \sum_{a \in A}\pi(a \mid s) q_{\pi}(s,a)$$


  • Action-value function using policy $\pi$


$$q_{\pi}(s,a) = R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s,a \right) v_{\pi}\left(s'\right)$$


2.2.2. Bellman Expectation Equation

  • Bellman Expectation Equation for $v_{\pi}(s)$


$$ \begin{align*} v_{\pi}(s) &= \sum_{a \in A}\pi(a \mid s) \underline{q_{\pi}(s,a)} \\ \\ &= \sum_{a \in A}\pi(a \mid s) \left( R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s,a \right) v_{\pi}\left(s'\right) \right) \end{align*} $$

  • Bellman Expectation Equation for $q_{\pi}(s,a)$


$$ \begin{align*} q_{\pi}(s,a) &= R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s,a \right) \underline{v_{\pi} \left(s'\right)}\\ \\ &= R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s,a \right) \left( \sum_{a' \in A} \pi \left(a' \mid s' \right) q_{\pi} \left(s', a' \right) \right) \end{align*} $$


2.2.3. Solving the Bellman Expectation Equation

  • The Bellman expectation equation can be expressed concisely in a matrix form,


$$v_{\pi} = R + \gamma \, P^{\pi} \,v_{\pi} \quad \implies \quad v_{\pi} = (I - \gamma \,P^{\pi})^{-1}R $$


  • Iterative


$$ v_{\pi} (s) \; \leftarrow \; R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s, a \right) \;v_{\pi} \left(s' \right)$$

Example

  • Given the policy 1, it is MRP

In [1]:
# [PU PF RU RF] = [0 1 2 3]

import numpy as np

P = [[1, 0, 0, 0],
    [0, 1, 0, 0],
    [0.5, 0, 0.5, 0],
    [0, 1, 0, 0]]

R = [0, 0, 10, 10]

P = np.asmatrix(P)
R = np.asmatrix(R)
R = R.T

gamma = 0.9
v = (np.eye(4) - gamma*P).I*R
print(v)
[[ 0.        ]
 [ 0.        ]
 [18.18181818]
 [10.        ]]
In [2]:
v = np.zeros([4,1])

for i in range(100):
    v = R + gamma*P*v

print(v)
[[ 0.        ]
 [ 0.        ]
 [18.18181818]
 [10.        ]]
  • Given the policy 2, it is MRP

In [3]:
# [PU PF RU RF] = [0 1 2 3]

import numpy as np

P = [[0.5, 0.5, 0, 0],
    [0, 1, 0, 0],
    [0.5, 0.5, 0, 0],
    [0, 1, 0, 0]]

R = [0, 0, 10, 10]

P = np.asmatrix(P)
R = np.asmatrix(R)
R = R.T

gamma = 0.9
v = (np.eye(4) - gamma*P).I*R
print(v)
[[ 0.]
 [ 0.]
 [10.]
 [10.]]
In [4]:
v = np.zeros([4,1])

for i in range(100):
    v = R + gamma*P*v

print(v)
[[ 0.]
 [ 0.]
 [10.]
 [10.]]

2.3. Bellman Optimality Equation

2.3.1. Optimal Value Function

  • The optimal state-value functin $v_*(s)$ is the maximum value function over all policies


$$v_*(s) = \max_{\pi} v_{\pi}(s)$$


  • The optimal action-value functin $q_*(s,a)$ is the maximum action-value function over all policies


$$q_*(s,a) = \max_{\pi} q_{\pi}(s,a)$$


  • Bellman Optimality Equations


$$ \begin{align*} v_*(s) &= \max_a q_{*}(s,a) \\ & = \max_a \left( R(s) + \gamma \sum_{s' \in S} P(s' \mid s,a) v_{*}(s') \right)\\ & = R(s) + \gamma \max_a \sum_{s' \in S} P(s' \mid s,a) v_{*}(s') \end{align*}$$


$$ \begin{align*} q_*(s,a) &= \max_{\pi} q_{\pi}(s,a) \\ & = \max_{\pi} \left( R(s) + \gamma \sum_{s' \in S} P(s' \mid s,a) v_{\pi}(s') \right) \\ & = R(s) + \gamma \sum_{s' \in S} P(s' \mid s,a) \max_{\pi} v_{\pi}(s') \\ & = R(s) + \gamma \sum_{s' \in S} P(s' \mid s,a) v_{*}(s') \\ & = R(s) + \gamma \sum_{s' \in S} P(s' \mid s,a) \max_{a'} q_{*}(s',a') \end{align*} $$


2.3.2. Optimal Policy

  • The optimal policy is the policy that achieves the highest value for every state


$$\pi_{*}(s) = \arg\max_{\pi} v_{\pi}(s)$$

$\quad \;\;$and its optimal value function is written $v_{*}(s)$


  • An optimal policy can be found by maximizing over $q_{*} (s,a)$


$$\pi_{*} (a \mid s) = \begin{cases} 1 \quad \text{if } a = \arg \max_{a \in A} q_{*}(s,a)\\ 0 \quad \text{otherwise} \end{cases} $$


  • There is always a deterministic optimal policy for any MDP
  • If we know $\pi_{*} (a \mid s)$, we immediately have the opimal policy

2.4. Solving the Bellman Optimality Equation

We can directly define the optimal value function using Bellman optimality equation


$$v_* (s) = R(s) + \gamma \max_{a} \sum_{s' \in S} P(s' \mid s, a) \;v_* \left(s' \right)$$


and optimal policy is simply the action that attains this max


$$\pi_*(s) = \arg\max_{a} \sum_{s' \in S} P\left(s' \mid s,a \right) \, v_* \left(s'\right)$$


  • Bellman Optimality Equation is non-linear
  • No closed form solution (in general)
  • (Will learn later) many iterative solution methods
    • Value Iteration
    • Policy Iteration
    • Q-learning
    • SARSA
  • You will get into details in the course of reinforcement learning

2.4.1. Value Iteration

Algorithm


$\quad$ 1. Initialize an estimate for the value function arbitrarily (or zeros)


$$ v(s) \; \leftarrow \; 0 \quad \forall s \in S $$

$\quad$ 2. Repeat, update


$$v (s) \; \leftarrow \; R(s) + \gamma \max_{a} \sum_{s' \in S} P(s' \mid s,a) \;v \left(s' \right), \quad \forall s \in S$$



Note

  • If we know the solution to subproblems $v_* \left(s' \right)$
  • Then solution $v_* \left(s' \right)$ can be found by one-step lookahead


$$v (s) \; \leftarrow \; R(s) + \gamma \max_{a} \sum_{s' \in S} P\left(s' \mid s,a\right) \;v \left(s' \right), \quad \forall s \in S$$


  • The idea of value iteration is to apply these updates iteratively

2.4.2. Policy Iteration

Algorithm

$\quad$ 1. initialize policy $\hat{\pi}$ (e.g., randomly)

$\quad$ 2. Compute a value function of policy, $v_{\pi}$ (e.g., via solving linear system or Bellman expectation equation iteratively)

$\quad$ 3. Update $\pi$ to be greedy policy with respect to $v_{\pi}$

$$\pi(s) \leftarrow \arg\max_{a} \sum_{s' \in S}P \left(s' \mid s,a \right) v_{\pi}\left(s'\right)$$

$\quad$ 4. If policy $\pi$ changed in last iteration, return to step 2




Note

  • Given a policy $\pi$, then evaluate the policy $\pi$

  • Improve the policy by acting greedily with respect to $v_{\pi}$

  • Policy iteration requires fewer iterations that value iteration, but each iteration requires solving a linear system instead of just applying Bellman operator

  • In practice, policy iteration is often faster, especially if the transition probabilities are structured (e.g., sparse) to make solution of linear system efficient

In [5]:
%%html
<center><iframe src="https://www.youtube.com/embed/XLEpZzdLxI4?rel=0" 
 width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
In [6]:
%%html
<center><iframe src="https://www.youtube.com/embed/4R6kDYDcq8U?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

3. Example


Define MDP as a two-level dictionary

P is a two-level dictionary where the first key is the state and the second key is the action.

  • State indices [0, 1, 2, 3] correspond to [PU, PF, RU, RF]
  • Action indices [0, 1] correspond to [Saving momey, Advertising]

P[state][action] is a list of tuples (probability, nextstate).

For example,

  • the transition information for s = 0, a = 0 is P[0][0] = [(1, 0)]
  • the transition information for s = 3, a = 0 is P[3][0] = [(0.5, 2), (0.5, 3)]
In [7]:
P = {    
    0: {0: [(1, 0)], 
        1: [(0.5, 0), (0.5, 1)]},
    1: {0: [(0.5, 0), (0.5, 3)], 
        1: [(1, 1)]},
    2: {0: [(0.5, 0), (0.5, 2)], 
        1: [(0.5, 0), (0.5, 1)]},
    3: {0: [(0.5, 2), (0.5, 3)], 
        1: [(1, 1)]},
}

R = [0, 0, 10, 10]
gamma = 0.9

States = [0, 1, 2, 3]
Actions = [0, 1]

v = [0]*4
$$\sum_{s' \in S} p\left(s\ \mid s,a\right) \;v \left(s' \right)$$
In [8]:
P[2][0]
Out[8]:
[(0.5, 0), (0.5, 2)]
In [9]:
# compute the above summation

v = [0, 0, 10, 10]

temp = 0
for trans in P[2][0]:
    temp = temp + trans[0]*v[trans[1]]
    
print(temp)
5.0
In [10]:
# shorten

sum(trans[0]*v[trans[1]] for trans in P[2][0])
Out[10]:
5.0

3.1. Value Iteration

$$\pi^*(s) = \arg\max_{a} \sum_{s' \in S} p\left(s' \mid s,a\right) \, v_{\pi}\left(s'\right)$$
In [11]:
# optimal value fuction

v = [0]*4

for i in range(100):
    for s in States:
        q_0 = sum(trans[0]*v[trans[1]] for trans in P[s][0])
        q_1 = sum(trans[0]*v[trans[1]] for trans in P[s][1])

        v[s] = R[s] + gamma*max(q_0, q_1)
    
print(v)
[31.58508953413495, 38.60400287377479, 44.02416232966445, 54.20158563176306]
In [12]:
# optimal policy

# once v computed

optPolicy = [0]*4

for s in States:
    q_0 = sum(trans[0]*v[trans[1]] for trans in P[s][0])
    q_1 = sum(trans[0]*v[trans[1]] for trans in P[s][1])   
    
    optPolicy[s] = np.argmax([q_0, q_1])

print(optPolicy)    
[1, 0, 0, 0]
In [13]:
# shorten

v = [0]*4

for i in range(100):
    for s in States:
        v[s] = R[s] + gamma*max([sum([trans[0]*v[trans[1]] for trans in P[s][a]]) for a in Actions])    
    
print(v)
[31.58508953413495, 38.60400287377479, 44.02416232966445, 54.20158563176306]
In [14]:
optPolicy = [0]*4

for s in States:       
    optPolicy[s] = np.argmax([sum([trans[0]*v[trans[1]] for trans in P[s][a]]) for a in Actions])

print(optPolicy)    
[1, 0, 0, 0]

3.2. Policy Iteration

In [15]:
policy = np.random.randint(0,2,4)
policy
Out[15]:
array([1, 1, 0, 0])
In [16]:
def cal_value(policy):
    v = [0]*4
    for i in range(100):
        for s in States:
            q = sum(trans[0]*v[trans[1]] for trans in P[s][policy[s]])
            v[s] = R[s] + gamma*q
    return v
In [17]:
V_pi = cal_value(policy)
print(V_pi)
[0.0, 0.0, 18.18181818181818, 33.05785123966942]
In [18]:
for _ in range(10): 
    for s in range(4): 
        values_temp = []
        for action in [0,1]: 
            temp = 0
            for trans in P[s][action]: 
                temp = temp + trans[0]*V_pi[trans[1]]
            values_temp.append(temp) 

        policy[s] = np.argmax(values_temp) 
        V_pi = cal_value(policy)

optPolicy = policy

print(optPolicy)
[1 0 0 0]

4. Exercise (Gridworld Domain)

  • Simple grid world with a goal state with reward and a “bad state” with reward -100
  • Actions move in the desired direction with probably 0.8, in one of the perpendicular directions with
  • Taking an action that would bump into a wall leaves agent where it is

In [19]:
P = {
     0: {0: [(0.9,0), (0.1,1), (0,4)], 
         1: [(0.8,1), (0.1,4), (0.1,0)], 
         2: [(0.8,4), (0.1,1), (0.1,0)], 
         3: [(0.9,0), (0.1,4)]},
     1: {0: [(0.8,1), (0.1,2), (0.1,0)], 
         1: [(0.8,2), (0.2,1)], 
         2: [(0.8,1), (0.1,0), (0.1,2)], 
         3: [(0.8,0), (0.2,1)]},
     2: {0: [(0.8,2), (0.1,3), (0.1,1)], 
         1: [(0.8,3), (0.1,5), (0.1,2)], 
         2: [(0.8,5), (0.1,1), (0.1,3)], 
         3: [(0.8,1), (0.1,2), (0.1,5)]},
     3: {0: [(0.9,3), (0.1,2)], 
         1: [(0.9,3), (0.1,6)], 
         2: [(0.8,6), (0.1,2), (0.1,3)], 
         3: [(0.8,2), (0.1,3), (0.1,6)]},
     4: {0: [(0.8,0), (0.2,4)], 
         1: [(0.8,4), (0.1,7), (0.1,0)], 
         2: [(0.8,7), (0.2,4)], 
         3: [(0.8,4), (0.1,0), (0.1,7)]},
     5: {0: [(0.8,2), (0.1,6), (0.1,5)], 
         1: [(0.8,6), (0.1,9), (0.1,2)], 
         2: [(0.8,9), (0.1,5), (0.1,6)], 
         3: [(0.8,5), (0.1,2), (0.1,9)]},
     6: {0: [(0.8,3), (0.1,6), (0.1,5)], 
         1: [(0.8,6), (0.1,10), (0.1,3)], 
         2: [(0.8,10), (0.1,5), (0.1,6)], 
         3: [(0.8,5), (0.1,3), (0.1,10)]},
     7: {0: [(0.8,4), (0.1,8), (0.1,7)], 
         1: [(0.8,8), (0.1,7), (0.1,4)], 
         2: [(0.9,7), (0.1,8)], 
         3: [(0.9,7), (0.1,4)]},
     8: {0: [(0.8,8), (0.1,9), (0.1,7)], 
         1: [(0.8,9), (0.2,8)], 
         2: [(0.8,8), (0.1,7), (0.1,9)], 
         3: [(0.8,7), (0.2,8)]},
     9: {0: [(0.8,5), (0.1,10), (0.1,8)], 
         1: [(0.8,9), (0.1,9), (0.1,5)], 
         2: [(0.8,9), (0.1,8), (0.1,10)], 
         3: [(0.8,8), (0.1,5), (0.1,9)]},
     10: {0: [(0.8,6), (0.1,10), (0.1,9)], 
          1: [(0.9,10), (0.1,6)], 
          2: [(0.9,10), (0.1,9)], 
          3: [(0.8,9), (0.1,6), (0.1,10)]}
}

R = [0, 0, 0, 1, 0, 0, -100, 0, 0, 0, 0]
gamma = 0.9

States = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Actions = [0, 1, 2, 3] # [north, east, south, west]
In [20]:
# optimal value fuction

v = [0]*11

for i in range(100):
    for s in States:
        v[s] = R[s] + gamma*max([sum([trans[0]*v[trans[1]] for trans in P[s][a]]) for a in Actions])  
    
print(v)
[5.46991289990088, 6.313016781079707, 7.189835364530538, 8.668832766371658, 4.8028486314273, 3.346646443535637, -96.67286272722137, 4.161433444369266, 3.6539401768050603, 3.2220160316109103, 1.526193402980731]
In [21]:
# optimal policy

optPolicy = [0]*11

for s in States:       
    optPolicy[s] = np.argmax([sum([trans[0]*v[trans[1]] for trans in P[s][a]]) for a in Actions])

print(optPolicy)    
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]

5. Video Lectures

In [22]:
%%html
<center><iframe src="https://www.youtube.com/embed/3TXa2IQ7Vhg?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
In [23]:
%%html
<center><iframe src="https://www.youtube.com/embed/FG4WrSCpENc?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
In [24]:
%%html
<center><iframe src="https://www.youtube.com/embed/M3sbfB5C0W4?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
In [25]:
%%html
<center><iframe src="https://www.youtube.com/embed/HaBUSk6rat4?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
In [26]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')