Markov Decision Process


By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Source

  • By David Silver's RL Course at UCL
  • By Prof. Zico Kolter at CMU

Table of Contents

1. Markov Decision Process

  • So far, we analyzed the passive behavior of a Markov chain with rewards
  • A Markov decision process (MDP) is a Markov reward process with decisions (or actions).

1.1. MDP Definition

Definition: A Markov Decision Process is a tuple $\langle S,\color{red}{A},P,R,\gamma \rangle$

  • $S$ is a finite set of states
  • $A$ is a finite set of actions
  • $P$ is a state transition probability matrix
$$P_{ss'}^\color{red}{a} = P\left[S_{t+1} = s' \mid S_t = s, A_t = \color{red}{a} \right]$$
  • $R$ is a reward function, $R_s^\color{red}{a} = \mathbb{E} \left[ R_{t+1} \mid S_t = s, A_t = \color{red}{a} \right] \quad \left(= \mathbb{E} \left[ R_{t+1} \mid S_t = s \right], \; \text{often assumed} \right)$
  • $\gamma$ is a discount factor, $\gamma \in [0,1]$

Example: Startup MDP

  • You run a startup company. In every state, you must choose between Saving money or Advertising



1.2. Policies

  • A policy is a mapping from state to actions, $\pi: S \rightarrow A$
  • A policy fully defines the behavior of an agent
    • It can be deteministic or stochastic
  • Given a state, it specifies a distribution over actions
$$\pi (a \mid s) = P(A_t = a \mid S_t = s)$$
  • MDP policies depend on the current state (not the history)
  • Policies are stationary (time-independent, but it turns out to be optimal)
  • Let $P^{\pi}$ be a matrix containing probabilities for each transition under policy $\pi$
  • Given an MDP $\mathcal{M} = \langle S,A,P,R,\gamma \rangle$ and a policy $\pi$

    • The state sequence $s_1, s_2, \cdots $ is a Markov process $\langle S, P^{\pi} \rangle$

    • The state and reward sequence is a Markov reward process $\langle S,P^{\pi},R^{\pi},\gamma \rangle$

  • Questions on MDP policy

    • How many possible policies in our example?

    • Which of the above two policies is best?

    • How do you compute the optimal policy?

2. Bellman Equation

2.1. Value Function

State-value function

  • The state-value function $v_{\pi}(s)$ of an MDP is the exptected return starting from state $s$, and then following policy $\pi$


$$ \begin{align*} v_{\pi}(s) &= \mathbb{E}_{\pi} \left[G_t \mid S_t =s \right]\\ &= \mathbb{E}_{\pi}[ R_{t+1} + \gamma v_{\pi}(S_{t+1}) \mid S_t = s ] \end{align*} $$


Action-value function

  • The action-value function $q_{\pi}(s,a)$ of an MDP is the exptected return starting from state $s$, taking action $a$, and then following policy $\pi$


$$ \begin{align*} q_{\pi}(s,a) &= \mathbb{E}_{\pi} \left[G_t \mid S_t =s, A_t = a \right]\\ & = \mathbb{E}_{\pi} \left[R_{t+1} + \gamma q_{\pi} \left(S_{t+1}, A_{t+1} \right) \mid S_t =s, A_t = a \right] \end{align*} $$


2.2. Bellman Expectation Equation

2.2.1. Relationship between $v_{\pi}(s)$ and $q_{\pi}(s,a)$

  • State-value function using policy $\pi$


$$v_{\pi}(s) = \sum_{a \in A}\pi(a \mid s) q_{\pi}(s,a)$$


  • Action-value function using policy $\pi$


$$q_{\pi}(s,a) = R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s,a \right) v_{\pi}\left(s'\right)$$


2.2.2. Bellman Expectation Equation

  • Bellman Expectation Equation for $v_{\pi}(s)$


$$ \begin{align*} v_{\pi}(s) &= \sum_{a \in A}\pi(a \mid s) \underline{q_{\pi}(s,a)} \\ \\ &= \sum_{a \in A}\pi(a \mid s) \left( R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s,a \right) v_{\pi}\left(s'\right) \right) \end{align*} $$

  • Bellman Expectation Equation for $q_{\pi}(s,a)$


$$ \begin{align*} q_{\pi}(s,a) &= R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s,a \right) \underline{v_{\pi} \left(s'\right)}\\ \\ &= R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s,a \right) \left( \sum_{a' \in A} \pi \left(a' \mid s' \right) q_{\pi} \left(s', a' \right) \right) \end{align*} $$


2.2.3. Solving the Bellman Expectation Equation

  • The Bellman expectation equation can be expressed concisely in a matrix form,


$$v_{\pi} = R + \gamma \, P^{\pi} \,v_{\pi} \quad \implies \quad v_{\pi} = (I - \gamma \,P^{\pi})^{-1}R $$


  • Iterative


$$ v_{\pi} (s) \; \leftarrow \; R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s, a \right) \;v_{\pi} \left(s' \right)$$

Example

  • Given the policy 1, it is MRP

$$v_{\pi} = R + \gamma \, P^{\pi} \,v_{\pi} \quad \implies \quad v_{\pi} = (I - \gamma \,P^{\pi})^{-1}R $$
In [1]:
# [PU PF RU RF] = [0 1 2 3]

import numpy as np

P = [[1, 0, 0, 0],
    [0, 1, 0, 0],
    [0.5, 0, 0.5, 0],
    [0, 1, 0, 0]]

R = [0, 0, 10, 10]

P = np.asmatrix(P)
R = np.asmatrix(R)
R = R.T

gamma = 0.9
v = (np.eye(4) - gamma*P).I*R

print(v)
[[ 0.        ]
 [ 0.        ]
 [18.18181818]
 [10.        ]]
$$ v_{\pi} (s) \; \leftarrow \; R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s, a \right) \;v_{\pi} \left(s' \right)$$
In [2]:
v = np.zeros([4,1])

for _ in range(10):
    v = R + gamma*P*v

print(v)    
[[ 0.        ]
 [ 0.        ]
 [18.17562716]
 [10.        ]]
  • Given the policy 2, it is MRP

In [3]:
# [PU PF RU RF] = [0 1 2 3]

import numpy as np

P = [[0.5, 0.5, 0, 0],
    [0, 1, 0, 0],
    [0.5, 0.5, 0, 0],
    [0, 1, 0, 0]]

R = [0, 0, 10, 10]

P = np.asmatrix(P)
R = np.asmatrix(R)
R = R.T

gamma = 0.9
v = (np.eye(4) - gamma*P).I*R
print(v)
[[ 0.]
 [ 0.]
 [10.]
 [10.]]
In [4]:
v = np.zeros([4,1])

for i in range(100):
    v = R + gamma*P*v

print(v)
[[ 0.]
 [ 0.]
 [10.]
 [10.]]

2.3. Bellman Optimality Equation

2.3.1. Optimal Value Function

  • The optimal state-value functin $v_*(s)$ is the maximum value function over all policies


$$ \begin{align*} v_*(s) &= \max_{\pi} v_{\pi}(s) \\ & = \max_a q_{\pi}(s,a) \\ & = \max_a \left( R(s) + \gamma \sum_{s' \in S} P(s' \mid s,a) v_{\pi}(s') \right)\\ & = R(s) + \gamma \max_a \sum_{s' \in S} P(s' \mid s,a) v_{\pi}(s') \end{align*}$$


  • The optimal action-value functin $q_*(s,a)$ is the maximum action-value function over all policies


$$ \begin{align*} q_*(s,a) &= \max_{\pi} q_{\pi}(s,a) \\ & = \max_{\pi} \left( R(s) + \gamma \sum_{s' \in S} P(s' \mid s,a) v_{\pi}(s') \right) \\ & = R(s) + \gamma \sum_{s' \in S} P(s' \mid s,a) \max_{\pi} v_{\pi}(s') \end{align*} $$


2.3.2. Optimal Policy

  • The optimal policy is the policy that achieves the highest value for every state


$$\pi_{*}(s) = \arg\max_{\pi} v_{\pi}(s)$$

$\quad \;\;$and its optimal value function is written $v_{*}(s)$


  • An optimal policy can be found by maximizing over $q_{*} (s,a)$


$$\pi_{*} (a \mid s) = \begin{cases} 1 \quad \text{if } a = \arg \max_{a \in A} q_{*}(s,a)\\ 0 \quad \text{otherwise} \end{cases} $$


  • There is always a deterministic optimal policy for any MDP
  • If we know $\pi_{*} (a \mid s)$, we immediately have the opimal policy

2.4. Solving the Bellman Optimality Equation

We can directly define the optimal value function using Bellman optimality equation


$$v_* (s) = R(s) + \gamma \max_{a} \sum_{s' \in S} P(s' \mid s, a) \;v_* \left(s' \right)$$


and optimal policy is simply the action that attains this max


$$\pi_*(s) = \arg\max_{a} \sum_{s' \in S} P\left(s' \mid s,a \right) \, v_* \left(s'\right)$$


  • Bellman Optimality Equation is non-linear
  • No closed form solution (in general)
  • (Will learn later) many iterative solution methods
    • Value Iteration
    • Policy Iteration
    • Q-learning
    • SARSA
  • You will get into details in the course of reinforcement learning

2.4.1. Value Iteration

Algorithm


$\quad$ 1. Initialize an estimate for the value function arbitrarily (or zeros)


$$ v(s) \; \leftarrow \; 0 \quad \forall s \in S $$

$\quad$ 2. Repeat, update


$$v (s) \; \leftarrow \; R(s) + \gamma \max_{a} \sum_{s' \in S} P(s' \mid s,a) \;v \left(s' \right), \quad \forall s \in S$$



Note

  • If we know the solution to subproblems $v_* \left(s' \right)$
  • Then solution $v_* \left(s' \right)$ can be found by one-step lookahead


$$v (s) \; \leftarrow \; R(s) + \gamma \max_{a} \sum_{s' \in S} P\left(s' \mid s,a\right) \;v \left(s' \right), \quad \forall s \in S$$


  • The idea of value iteration is to apply these updates iteratively

2.4.2. Policy Iteration

Algorithm

$\quad$ 1. initialize policy $\hat{\pi}$ (e.g., randomly)

$\quad$ 2. Compute a value function of policy, $v_{\pi}$ (e.g., via solving linear system or Bellman expectation equation iteratively)

$\quad$ 3. Update $\pi$ to be greedy policy with respect to $v_{\pi}$

$$\pi(s) \leftarrow \arg\max_{a} \sum_{s' \in S}P \left(s' \mid s,a \right) v_{\pi}\left(s'\right)$$

$\quad$ 4. If policy $\pi$ changed in last iteration, return to step 2




Note

  • Given a policy $\pi$, then evaluate the policy $\pi$

  • Improve the policy by acting greedily with respect to $v_{\pi}$

  • Policy iteration requires fewer iterations that value iteration, but each iteration requires solving a linear system instead of just applying Bellman operator

  • In practice, policy iteration is often faster, especially if the transition probabilities are structured (e.g., sparse) to make solution of linear system efficient

In [5]:
%%html
<center><iframe src="https://www.youtube.com/embed/XLEpZzdLxI4?rel=0" 
 width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
In [6]:
%%html
<center><iframe src="https://www.youtube.com/embed/4R6kDYDcq8U?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

3. Example