Markov Decision Process

By Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST

Source

By David Silver's RL Course at UCL
By Prof. Zico Kolter at CMU

Table of Contents

1. Markov Decision Process¶

So far, we analyzed the passive behavior of a Markov chain with rewards
A Markov decision process (MDP) is a Markov reward process with decisions (or actions).

1.1. MDP Definition¶

Definition: A Markov Decision Process is a tuple $\langle S,\color{red}{A},P,R,\gamma \rangle$

$S$ is a finite set of states
$A$ is a finite set of actions
$P$ is a state transition probability matrix

$$P_{ss'}^\color{red}{a} = P\left[S_{t+1} = s' \mid S_t = s, A_t = \color{red}{a} \right]$$

$R$ is a reward function, $R_s^\color{red}{a} = \mathbb{E} \left[ R_{t+1} \mid S_t = s, A_t = \color{red}{a} \right] \quad \left(= \mathbb{E} \left[ R_{t+1} \mid S_t = s \right], \; \text{often assumed} \right)$
$\gamma$ is a discount factor, $\gamma \in [0,1]$

Example: Startup MDP

You run a startup company. In every state, you must choose between Saving money or Advertising

No description has been provided for this image

1.2. Policies¶

A policy is a mapping from state to actions, $\pi: S \rightarrow A$
A policy fully defines the behavior of an agent
- It can be deteministic or stochastic
Given a state, it specifies a distribution over actions

$$\pi (a \mid s) = P(A_t = a \mid S_t = s)$$

MDP policies depend on the current state (not the history)
Policies are stationary (time-independent, but it turns out to be optimal)
Let $P^{\pi}$ be a matrix containing probabilities for each transition under policy $\pi$
Given an MDP $\mathcal{M} = \langle S,A,P,R,\gamma \rangle$ and a policy $\pi$
- The state sequence $s_1, s_2, \cdots $ is a Markov process $\langle S, P^{\pi} \rangle$
- The state and reward sequence is a Markov reward process $\langle S,P^{\pi},R^{\pi},\gamma \rangle$
Questions on MDP policy
- How many possible policies in our example?
- Which of the above two policies is best?
- How do you compute the optimal policy?

2. Bellman Equation¶

2.1. Value Function¶

State-value function

The state-value function $v_{\pi}(s)$ of an MDP is the exptected return starting from state $s$, and then following policy $\pi$

$$ \begin{align*} v_{\pi}(s) &= \mathbb{E}_{\pi} \left[G_t \mid S_t =s \right]\\ &= \mathbb{E}_{\pi}[ R_{t+1} + \gamma v_{\pi}(S_{t+1}) \mid S_t = s ] \end{align*} $$

Action-value function

The action-value function $q_{\pi}(s,a)$ of an MDP is the exptected return starting from state $s$, taking action $a$, and then following policy $\pi$

$$ \begin{align*} q_{\pi}(s,a) &= \mathbb{E}_{\pi} \left[G_t \mid S_t =s, A_t = a \right]\\ & = \mathbb{E}_{\pi} \left[R_{t+1} + \gamma q_{\pi} \left(S_{t+1}, A_{t+1} \right) \mid S_t =s, A_t = a \right] \end{align*} $$

2.2. Bellman Expectation Equation¶

2.2.1. Relationship between $v_{\pi}(s)$ and $q_{\pi}(s,a)$¶

State-value function using policy $\pi$

$$v_{\pi}(s) = \sum_{a \in A}\pi(a \mid s) q_{\pi}(s,a)$$

Action-value function using policy $\pi$

$$q_{\pi}(s,a) = R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s,a \right) v_{\pi}\left(s'\right)$$

2.2.2. Bellman Expectation Equation¶

Bellman Expectation Equation for $v_{\pi}(s)$

$$ \begin{align*} v_{\pi}(s) &= \sum_{a \in A}\pi(a \mid s) \underline{q_{\pi}(s,a)} \\ \\ &= \sum_{a \in A}\pi(a \mid s) \left( R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s,a \right) v_{\pi}\left(s'\right) \right) \end{align*} $$ No description has been provided for this image

Bellman Expectation Equation for $q_{\pi}(s,a)$

$$ \begin{align*} q_{\pi}(s,a) &= R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s,a \right) \underline{v_{\pi} \left(s'\right)}\\ \\ &= R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s,a \right) \left( \sum_{a' \in A} \pi \left(a' \mid s' \right) q_{\pi} \left(s', a' \right) \right) \end{align*} $$
No description has been provided for this image

2.2.3. Solving the Bellman Expectation Equation¶

The Bellman expectation equation can be expressed concisely in a matrix form,

$$v_{\pi} = R + \gamma \, P^{\pi} \,v_{\pi} \quad \implies \quad v_{\pi} = (I - \gamma \,P^{\pi})^{-1}R $$

Iterative

$$ v_{\pi} (s) \; \leftarrow \; R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s, a \right) \;v_{\pi} \left(s' \right)$$

Example

Given the policy 1, it is MRP

# [PU PF RU RF] = [0 1 2 3]

import numpy as np

P = [[1, 0, 0, 0],
    [0, 1, 0, 0],
    [0.5, 0, 0.5, 0],
    [0, 1, 0, 0]]

R = [0, 0, 10, 10]

P = np.asmatrix(P)
R = np.asmatrix(R)
R = R.T

gamma = 0.9
v = (np.eye(4) - gamma*P).I*R
print(v)

[[ 0.        ]
 [ 0.        ]
 [18.18181818]
 [10.        ]]

v = np.zeros([4,1])

for i in range(100):
    v = R + gamma*P*v

print(v)

[[ 0.        ]
 [ 0.        ]
 [18.18181818]
 [10.        ]]

Given the policy 2, it is MRP

# [PU PF RU RF] = [0 1 2 3]

import numpy as np

P = [[0.5, 0.5, 0, 0],
    [0, 1, 0, 0],
    [0.5, 0.5, 0, 0],
    [0, 1, 0, 0]]

R = [0, 0, 10, 10]

P = np.asmatrix(P)
R = np.asmatrix(R)
R = R.T

gamma = 0.9
v = (np.eye(4) - gamma*P).I*R
print(v)

[[ 0.]
 [ 0.]
 [10.]
 [10.]]

v = np.zeros([4,1])

for i in range(100):
    v = R + gamma*P*v

print(v)

[[ 0.]
 [ 0.]
 [10.]
 [10.]]

2.3. Bellman Optimality Equation¶

2.3.1. Optimal Value Function¶

The optimal state-value functin $v_*(s)$ is the maximum value function over all policies

$$v_*(s) = \max_{\pi} v_{\pi}(s)$$

The optimal action-value functin $q_*(s,a)$ is the maximum action-value function over all policies

$$q_*(s,a) = \max_{\pi} q_{\pi}(s,a)$$

Bellman Optimality Equations

$$ \begin{align*} v_*(s) &= \max_a q_{*}(s,a) \\ & = \max_a \left( R(s) + \gamma \sum_{s' \in S} P(s' \mid s,a) v_{*}(s') \right)\\ & = R(s) + \gamma \max_a \sum_{s' \in S} P(s' \mid s,a) v_{*}(s') \end{align*}$$

$$ \begin{align*} q_*(s,a) &= \max_{\pi} q_{\pi}(s,a) \\ & = \max_{\pi} \left( R(s) + \gamma \sum_{s' \in S} P(s' \mid s,a) v_{\pi}(s') \right) \\ & = R(s) + \gamma \sum_{s' \in S} P(s' \mid s,a) \max_{\pi} v_{\pi}(s') \\ & = R(s) + \gamma \sum_{s' \in S} P(s' \mid s,a) v_{*}(s') \\ & = R(s) + \gamma \sum_{s' \in S} P(s' \mid s,a) \max_{a'} q_{*}(s',a') \end{align*} $$

2.3.2. Optimal Policy¶

The optimal policy is the policy that achieves the highest value for every state

$$\pi_{*}(s) = \arg\max_{\pi} v_{\pi}(s)$$

$\quad \;\;$and its optimal value function is written $v_{*}(s)$

An optimal policy can be found by maximizing over $q_{*} (s,a)$

$$\pi_{*} (a \mid s) = \begin{cases} 1 \quad \text{if } a = \arg \max_{a \in A} q_{*}(s,a)\\ 0 \quad \text{otherwise} \end{cases} $$

There is always a deterministic optimal policy for any MDP
If we know $\pi_{*} (a \mid s)$, we immediately have the opimal policy

2.4. Solving the Bellman Optimality Equation¶

We can directly define the optimal value function using Bellman optimality equation

$$v_* (s) = R(s) + \gamma \max_{a} \sum_{s' \in S} P(s' \mid s, a) \;v_* \left(s' \right)$$

and optimal policy is simply the action that attains this max

$$\pi_*(s) = \arg\max_{a} \sum_{s' \in S} P\left(s' \mid s,a \right) \, v_* \left(s'\right)$$

Bellman Optimality Equation is non-linear
No closed form solution (in general)
(Will learn later) many iterative solution methods
- Value Iteration
- Policy Iteration
- Q-learning
- SARSA
You will get into details in the course of reinforcement learning

2.4.1. Value Iteration¶

Algorithm

$\quad$ 1. Initialize an estimate for the value function arbitrarily (or zeros)

$$ v(s) \; \leftarrow \; 0 \quad \forall s \in S $$

$\quad$ 2. Repeat, update

$$v (s) \; \leftarrow \; R(s) + \gamma \max_{a} \sum_{s' \in S} P(s' \mid s,a) \;v \left(s' \right), \quad \forall s \in S$$

Note

If we know the solution to subproblems $v_* \left(s' \right)$
Then solution $v_* \left(s' \right)$ can be found by one-step lookahead

$$v (s) \; \leftarrow \; R(s) + \gamma \max_{a} \sum_{s' \in S} P\left(s' \mid s,a\right) \;v \left(s' \right), \quad \forall s \in S$$

The idea of value iteration is to apply these updates iteratively

2.4.2. Policy Iteration¶

Algorithm

$\quad$ 1. initialize policy $\hat{\pi}$ (e.g., randomly)

$\quad$ 2. Compute a value function of policy, $v_{\pi}$ (e.g., via solving linear system or Bellman expectation equation iteratively)

$\quad$ 3. Update $\pi$ to be greedy policy with respect to $v_{\pi}$

$$\pi(s) \leftarrow \arg\max_{a} \sum_{s' \in S}P \left(s' \mid s,a \right) v_{\pi}\left(s'\right)$$

$\quad$ 4. If policy $\pi$ changed in last iteration, return to step 2

Note

Given a policy $\pi$, then evaluate the policy $\pi$
Improve the policy by acting greedily with respect to $v_{\pi}$
Policy iteration requires fewer iterations that value iteration, but each iteration requires solving a linear system instead of just applying Bellman operator
In practice, policy iteration is often faster, especially if the transition probabilities are structured (e.g., sparse) to make solution of linear system efficient

%%html
<center><iframe src="https://www.youtube.com/embed/XLEpZzdLxI4?rel=0" 
 width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

%%html
<center><iframe src="https://www.youtube.com/embed/4R6kDYDcq8U?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

3. Example¶

Define MDP as a two-level dictionary

P is a two-level dictionary where the first key is the state and the second key is the action.

State indices [0, 1, 2, 3] correspond to [PU, PF, RU, RF]
Action indices [0, 1] correspond to [Saving momey, Advertising]

P[state][action] is a list of tuples (probability, nextstate).

For example,

the transition information for s = 0, a = 0 is P[0][0] = [(1, 0)]
the transition information for s = 3, a = 0 is P[3][0] = [(0.5, 2), (0.5, 3)]

P = {    
    0: {0: [(1, 0)], 
        1: [(0.5, 0), (0.5, 1)]},
    1: {0: [(0.5, 0), (0.5, 3)], 
        1: [(1, 1)]},
    2: {0: [(0.5, 0), (0.5, 2)], 
        1: [(0.5, 0), (0.5, 1)]},
    3: {0: [(0.5, 2), (0.5, 3)], 
        1: [(1, 1)]},
}

R = [0, 0, 10, 10]
gamma = 0.9

States = [0, 1, 2, 3]
Actions = [0, 1]

v = [0]*4

$$\sum_{s' \in S} p\left(s\ \mid s,a\right) \;v \left(s' \right)$$

P[2][0]

[(0.5, 0), (0.5, 2)]

# compute the above summation

v = [0, 0, 10, 10]

temp = 0
for trans in P[2][0]:
    temp = temp + trans[0]*v[trans[1]]
    
print(temp)

5.0

# shorten

sum(trans[0]*v[trans[1]] for trans in P[2][0])

5.0

3.1. Value Iteration¶

$$\pi^*(s) = \arg\max_{a} \sum_{s' \in S} p\left(s' \mid s,a\right) \, v_{\pi}\left(s'\right)$$

# optimal value fuction

v = [0]*4

for i in range(100):
    for s in States:
        q_0 = sum(trans[0]*v[trans[1]] for trans in P[s][0])
        q_1 = sum(trans[0]*v[trans[1]] for trans in P[s][1])

        v[s] = R[s] + gamma*max(q_0, q_1)
    
print(v)

[31.58508953413495, 38.60400287377479, 44.02416232966445, 54.20158563176306]

# optimal policy

# once v computed

optPolicy = [0]*4

for s in States:
    q_0 = sum(trans[0]*v[trans[1]] for trans in P[s][0])
    q_1 = sum(trans[0]*v[trans[1]] for trans in P[s][1])   
    
    optPolicy[s] = np.argmax([q_0, q_1])

print(optPolicy)

[1, 0, 0, 0]

# shorten

v = [0]*4

for i in range(100):
    for s in States:
        v[s] = R[s] + gamma*max([sum([trans[0]*v[trans[1]] for trans in P[s][a]]) for a in Actions])    
    
print(v)

[31.58508953413495, 38.60400287377479, 44.02416232966445, 54.20158563176306]

optPolicy = [0]*4

for s in States:       
    optPolicy[s] = np.argmax([sum([trans[0]*v[trans[1]] for trans in P[s][a]]) for a in Actions])

print(optPolicy)

[1, 0, 0, 0]

3.2. Policy Iteration¶

policy = np.random.randint(0,2,4)
policy

array([1, 0, 1, 0])

def cal_value(policy):
    v = [0]*4
    for i in range(100):
        for s in States:
            q = sum(trans[0]*v[trans[1]] for trans in P[s][policy[s]])
            v[s] = R[s] + gamma*q
    return v

V_pi = cal_value(policy)
print(V_pi)

[29.362497240834138, 35.887497566214655, 39.36249766317196, 50.387497756266676]

for _ in range(10): 
    for s in range(4): 
        values_temp = []
        for action in [0,1]: 
            temp = 0
            for trans in P[s][action]: 
                temp = temp + trans[0]*V_pi[trans[1]]
            values_temp.append(temp) 

        policy[s] = np.argmax(values_temp) 
        V_pi = cal_value(policy)

optPolicy = policy

print(optPolicy)

[1 0 0 0]

4. Exercise (Gridworld Domain)¶

Simple grid world with a goal state with reward and a “bad state” with reward -100
Actions move in the desired direction with probably 0.8, in one of the perpendicular directions with
Taking an action that would bump into a wall leaves agent where it is

P = {
     0: {0: [(0.9,0), (0.1,1), (0,4)], 
         1: [(0.8,1), (0.1,4), (0.1,0)], 
         2: [(0.8,4), (0.1,1), (0.1,0)], 
         3: [(0.9,0), (0.1,4)]},
     1: {0: [(0.8,1), (0.1,2), (0.1,0)], 
         1: [(0.8,2), (0.2,1)], 
         2: [(0.8,1), (0.1,0), (0.1,2)], 
         3: [(0.8,0), (0.2,1)]},
     2: {0: [(0.8,2), (0.1,3), (0.1,1)], 
         1: [(0.8,3), (0.1,5), (0.1,2)], 
         2: [(0.8,5), (0.1,1), (0.1,3)], 
         3: [(0.8,1), (0.1,2), (0.1,5)]},
     3: {0: [(0.9,3), (0.1,2)], 
         1: [(0.9,3), (0.1,6)], 
         2: [(0.8,6), (0.1,2), (0.1,3)], 
         3: [(0.8,2), (0.1,3), (0.1,6)]},
     4: {0: [(0.8,0), (0.2,4)], 
         1: [(0.8,4), (0.1,7), (0.1,0)], 
         2: [(0.8,7), (0.2,4)], 
         3: [(0.8,4), (0.1,0), (0.1,7)]},
     5: {0: [(0.8,2), (0.1,6), (0.1,5)], 
         1: [(0.8,6), (0.1,9), (0.1,2)], 
         2: [(0.8,9), (0.1,5), (0.1,6)], 
         3: [(0.8,5), (0.1,2), (0.1,9)]},
     6: {0: [(0.8,3), (0.1,6), (0.1,5)], 
         1: [(0.8,6), (0.1,10), (0.1,3)], 
         2: [(0.8,10), (0.1,5), (0.1,6)], 
         3: [(0.8,5), (0.1,3), (0.1,10)]},
     7: {0: [(0.8,4), (0.1,8), (0.1,7)], 
         1: [(0.8,8), (0.1,7), (0.1,4)], 
         2: [(0.9,7), (0.1,8)], 
         3: [(0.9,7), (0.1,4)]},
     8: {0: [(0.8,8), (0.1,9), (0.1,7)], 
         1: [(0.8,9), (0.2,8)], 
         2: [(0.8,8), (0.1,7), (0.1,9)], 
         3: [(0.8,7), (0.2,8)]},
     9: {0: [(0.8,5), (0.1,10), (0.1,8)], 
         1: [(0.8,9), (0.1,9), (0.1,5)], 
         2: [(0.8,9), (0.1,8), (0.1,10)], 
         3: [(0.8,8), (0.1,5), (0.1,9)]},
     10: {0: [(0.8,6), (0.1,10), (0.1,9)], 
          1: [(0.9,10), (0.1,6)], 
          2: [(0.9,10), (0.1,9)], 
          3: [(0.8,9), (0.1,6), (0.1,10)]}
}

R = [0, 0, 0, 1, 0, 0, -100, 0, 0, 0, 0]
gamma = 0.9

States = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Actions = [0, 1, 2, 3] # [north, east, south, west]

# optimal value fuction

v = [0]*11

for i in range(100):
    for s in States:
        v[s] = R[s] + gamma*max([sum([trans[0]*v[trans[1]] for trans in P[s][a]]) for a in Actions])  
    
print(v)

[5.46991289990088, 6.313016781079707, 7.189835364530538, 8.668832766371658, 4.8028486314273, 3.346646443535637, -96.67286272722137, 4.161433444369266, 3.6539401768050603, 3.2220160316109103, 1.526193402980731]

# optimal policy

optPolicy = [0]*11

for s in States:       
    optPolicy[s] = np.argmax([sum([trans[0]*v[trans[1]] for trans in P[s][a]]) for a in Actions])

print(optPolicy)

[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')