Markov Reward Process


By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Source

  • By David Silver's RL Course at UCL
  • By Prof. Zico Kolter at CMU

Table of Contents

1. Markov Reward Process

  • Suppose that each transition in a Markov chain is associated with a reward, $r$
  • As the Markov chain proceeds from state to state, there is an associated sequence of rewards
  • Discount factor $\gamma$
  • Later, we will study dynamic programming and Markov decision theory $\implies$ Markov Decision Process (MDP)

1.1. Definition of MRP

Definition: A Markov Reward Process is a tuple $\langle S,P,R,\gamma \rangle$

  • $S$ is a finite set of states
  • $P$ is a state transition probability matrix
$$P_{ss'} = P\left[S_{t+1} = s' \mid S_t = s \right]$$
  • $R$ is a reward function, $R = \mathbb{E} \left[ R_{t+1} \mid S_t = s \right]$
  • $\gamma$ is a discount factor, $\gamma \in [0,1]$


Example: student MRP

1.2. Reward over Multiple Transitions (= Return $G_t$)

Definition: The return $G_t$ is the total discounted reward from time-step $t$


$$G_t = R_{t+1} + \gamma R_{t+2} + \cdots = \sum_{k=0}^{\infty}\gamma^k R_{t+k+1}$$


Discount factor $\gamma$

  • It is reasonable to maximize the sum of rewards
  • It is also reasonable to prefer rewards now to rewards later
  • One solution: values of rewards decay exponentially
  • Mathematically convenient (avoid infinite returns and values)
  • Human often acts as if there is a discount factor $\gamma < 1$
    • $\gamma = 0$: Only care about immediate reward
    • $\gamma = 1$: Future reward is as beneficial as immediate reward
In [1]:
import numpy as np
In [2]:
# [C1 C2 C3 Pass Pub FB Sleep] = [0 1 2 3 4 5 6]

R = [-2, -2, -2, 10, 1, -1, 0]
gamma = 0.9

# if a sequence is given
S = [0, 1, 2, 4, 2, 4]

G = 0
for i in range(5):
    G = G + (gamma**i)*R[S[i]]
    
print(G)    
-6.0032
In [3]:
R = [-2, -2, -2, 10, 1, -1, 0]
gamma = 0.9

P = [[0, 0.5, 0, 0, 0, 0.5, 0],
    [0, 0, 0.8, 0, 0, 0, 0.2],
    [0, 0, 0, 0.6, 0.4, 0, 0],
    [0, 0, 0, 0, 0, 0, 1],
    [0.2, 0.4, 0.4, 0, 0, 0, 0],
    [0.1, 0, 0, 0, 0, 0.9, 0],
    [0, 0, 0, 0, 0, 0, 1]]

# sequence generated by Markov chain
# [C1 C2 C3 Pass Pub FB Sleep] = [0 1 2 3 4 5 6]

# starting from 0
x = 0 
S = []
S.append(x)

for i in range(5):
    x = np.random.choice(len(P),1,p = P[x][:])[0]    
    S.append(x)

G = 0
for i in range(5):
    G = G + (gamma**i)*R[S[i]]

print(S)      
print(G)    
[0, 1, 2, 3, 6, 6]
1.870000000000001

1.3. Value Function


Definition: The state value function $v(s)$ of an MRP is the expected return starting from state $s$

$$ \begin{align*} v(s) & = \mathbb{E}[G_t \mid S_t = s] \\ & = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \mid S_t = s] \end{align*} $$


  • The value function $v(s)$ gives the long-term value of state $s$


  • Sample returns for student MRP: starting from $S_1 = C_1$ with $\gamma = \frac{1}{2}$


$$G_1 = R_2 + \gamma R_3 + \cdots + \gamma^{T-2} R_T$$


  • (Naive) Computing the value function in MRP
    • Generate a large number of episodes and compute the average return

2. Bellman Equations for MRP

  • The value function $v(S_t)$ can be decomposed into two parts:
    • Immediate reward $R_{t+1}$ at state $S_t$
    • Discounted value of successor state $\gamma v (S_{t+1})$


$$ \begin{align*} v(s) &= \mathbb{E} \left[G_t \mid S_t = s \right] \\ &= \mathbb{E} \left[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \mid S_t = s \right] \\ &= \mathbb{E} \left[R_{t+1} + \gamma \left( R_{t+2} + \gamma R_{t+3} + \cdots \right) \mid S_t = s \right] \\ &= \mathbb{E} \left[R_{t+1} + \gamma G_{t+1} \mid S_t = s \right] \\ &= \mathbb{E} \left[R_{t+1} + \gamma v \left(S_{t+1} \right) \mid S_t = s \right] \end{align*} $$



  • Bellman Equation for Student MRP

2.1. Bellman Equation in Matrix Form



$$v(s) = R + \gamma \sum_{s' \in S} P_{ss'}v\left(s'\right) \qquad \forall s$$

The Bellman equation can be expressed concisely using matrices,


$$v = R + \gamma P v$$

where $v$ is a column vector with one entry per state


$$ \begin{bmatrix} v(1)\\ \vdots \\v(n) \end{bmatrix} = \begin{bmatrix} R_1\\ \vdots \\R_n \end{bmatrix} + \gamma \begin{bmatrix} p_{11} & \cdots & p_{1n}\\ &\vdots& \\p_{n1}& \cdots &p_{nn} \end{bmatrix} \begin{bmatrix} v(1)\\ \vdots \\v(n) \end{bmatrix} $$

2.2. Solving the Bellman Equation

  • The Bellman equation is a linear equation
  • It can be solved directly:


$$ \begin{align*} v &= R + \gamma P v \\ (I-\gamma R) v & = R \\ v & = (I - \gamma P)^{-1}R \end{align*} $$


  • Direct solution only possible for small MRP
    • Computational complexity is $O(n^3)$ for $n$ states


  • There are many iterative methods for large MRP
    • Dynamic programming
    • Monte-Carlo simulation
    • Temporal-Difference learning
  • Iterative algorithm for value function (Value Iteration)

    • Initialize $v_1(s) = 0$ for all $s$

    • For $k=1$ until convergence

      • for all $s$ in $S$
$$v_{k+1}(s) \;\; \longleftarrow \;\; R(s) + \gamma \sum_{s' \in S} p\left(s' \mid s\right) v_k \left(s'\right) $$
    



In [4]:
# [C1 C2 C3 Pass Pub FB Sleep] = [0 1 2 3 4 5 6]

P = [[0, 0.5, 0, 0, 0, 0.5, 0],
    [0, 0, 0.8, 0, 0, 0, 0.2],
    [0, 0, 0, 0.6, 0.4, 0, 0],
    [0, 0, 0, 0, 0, 0, 1],
    [0.2, 0.4, 0.4, 0, 0, 0, 0],
    [0.1, 0, 0, 0, 0, 0.9, 0],
    [0, 0, 0, 0, 0, 0, 1]]

R = [-2, -2, -2, 10, 1, -1, 0]

P = np.asmatrix(P)
R = np.asmatrix(R)
R = R.T

gamma = 0.9
v = (np.eye(7) - gamma*P).I*R
print(v)
[[-5.01272891]
 [ 0.9426553 ]
 [ 4.08702125]
 [10.        ]
 [ 1.90839235]
 [-7.63760843]
 [ 0.        ]]
In [5]:
gamma = 0.9

v = np.zeros([7,1])

for i in range(100):
    v = R + gamma*P*v

print(v)
[[-5.01272786]
 [ 0.94265541]
 [ 4.08702138]
 [10.        ]
 [ 1.90839268]
 [-7.63760653]
 [ 0.        ]]

In [6]:
gamma = 1

v = np.zeros([7,1])

for i in range(100):
    v = R + gamma*P*v

print(v)
[[-12.4182877 ]
 [  1.47030894]
 [  4.33713371]
 [ 10.        ]
 [  0.84103688]
 [-22.31800952]
 [  0.        ]]
In [7]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')