1. Markov Reward Process¶

Suppose that each transition in a Markov chain is associated with a reward, $r$
As the Markov chain proceeds from state to state, there is an associated sequence of rewards
Discount factor $\gamma$
Later, we will study dynamic programming and Markov decision theory $\implies$ Markov Decision Process (MDP)

1.1. Definition of MRP¶

Definition: A Markov Reward Process is a tuple $\langle S,P,R,\gamma \rangle$

$S$ is a finite set of states
$P$ is a state transition probability matrix

$$P_{ss'} = P\left[S_{t+1} = s' \mid S_t = s \right]$$

$R$ is a reward function, $R = \mathbb{E} \left[ R_{t+1} \mid S_t = s \right]$
$\gamma$ is a discount factor, $\gamma \in [0,1]$

Example: student MRP

No description has been provided for this image

1.2. Reward over Multiple Transitions (= Return $G_t$)¶

Definition: The return $G_t$ is the total discounted reward from time-step $t$

$$G_t = R_{t+1} + \gamma R_{t+2} + \cdots = \sum_{k=0}^{\infty}\gamma^k R_{t+k+1}$$

Discount factor $\gamma$

It is reasonable to maximize the sum of rewards
It is also reasonable to prefer rewards now to rewards later
One solution: values of rewards decay exponentially
Mathematically convenient (avoid infinite returns and values)
Human often acts as if there is a discount factor $\gamma < 1$
- $\gamma = 0$: Only care about immediate reward
- $\gamma = 1$: Future reward is as beneficial as immediate reward

import numpy as np

# [C1 C2 C3 Pass Pub FB Sleep] = [0 1 2 3 4 5 6]

R = [-2, -2, -2, 10, 1, -1, 0]
gamma = 0.9

# if a sequence is given
S = [0, 1, 2, 4, 2, 4]

G = 0
for i in range(5):
    G = G + (gamma**i)*R[S[i]]
    
print(G)

-6.0032

R = [-2, -2, -2, 10, 1, -1, 0]
gamma = 0.9

P = [[0, 0.5, 0, 0, 0, 0.5, 0],
    [0, 0, 0.8, 0, 0, 0, 0.2],
    [0, 0, 0, 0.6, 0.4, 0, 0],
    [0, 0, 0, 0, 0, 0, 1],
    [0.2, 0.4, 0.4, 0, 0, 0, 0],
    [0.1, 0, 0, 0, 0, 0.9, 0],
    [0, 0, 0, 0, 0, 0, 1]]

# sequence generated by Markov chain
# [C1 C2 C3 Pass Pub FB Sleep] = [0 1 2 3 4 5 6]

# starting from 0
x = 0 
S = []
S.append(x)

for i in range(5):
    x = np.random.choice(len(P),1,p = P[x][:])[0]    
    S.append(x)

G = 0
for i in range(5):
    G = G + (gamma**i)*R[S[i]]

print(S)      
print(G)

[0, 1, 2, 3, 6, 6]
1.870000000000001

1.3. Value Function¶

Definition: The state value function $v(s)$ of an MRP is the expected return starting from state $s$

$$ \begin{align*} v(s) & = \mathbb{E}[G_t \mid S_t = s] \\ & = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \mid S_t = s] \end{align*} $$

The value function $v(s)$ gives the long-term value of state $s$

Sample returns for student MRP: starting from $S_1 = C_1$ with $\gamma = \frac{1}{2}$

$$G_1 = R_2 + \gamma R_3 + \cdots + \gamma^{T-2} R_T$$

(Naive) Computing the value function in MRP
- Generate a large number of episodes and compute the average return

2. Bellman Equations for MRP¶

The value function $v(S_t)$ can be decomposed into two parts:
- Immediate reward $R_{t+1}$ at state $S_t$
- Discounted value of successor state $\gamma v (S_{t+1})$

$$ \begin{align*} v(s) &= \mathbb{E} \left[G_t \mid S_t = s \right] \\ &= \mathbb{E} \left[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \mid S_t = s \right] \\ &= \mathbb{E} \left[R_{t+1} + \gamma \left( R_{t+2} + \gamma R_{t+3} + \cdots \right) \mid S_t = s \right] \\ &= \mathbb{E} \left[R_{t+1} + \gamma G_{t+1} \mid S_t = s \right] \\ &= \mathbb{E} \left[R_{t+1} + \gamma v \left(S_{t+1} \right) \mid S_t = s \right] \end{align*} $$

No description has been provided for this image

Bellman Equation for Student MRP

2.1. Bellman Equation in Matrix Form¶

$$v(s) = R + \gamma \sum_{s' \in S} P_{ss'}v\left(s'\right) \qquad \forall s$$

The Bellman equation can be expressed concisely using matrices,

$$v = R + \gamma P v$$ where $v$ is a column vector with one entry per state

$$ \begin{bmatrix} v(1)\\ \vdots \\v(n) \end{bmatrix} = \begin{bmatrix} R_1\\ \vdots \\R_n \end{bmatrix} + \gamma \begin{bmatrix} p_{11} & \cdots & p_{1n}\\ &\vdots& \\p_{n1}& \cdots &p_{nn} \end{bmatrix} \begin{bmatrix} v(1)\\ \vdots \\v(n) \end{bmatrix} $$

2.2. Solving the Bellman Equation¶

The Bellman equation is a linear equation
It can be solved directly:

$$ \begin{align*} v &= R + \gamma P v \\ (I-\gamma R) v & = R \\ v & = (I - \gamma P)^{-1}R \end{align*} $$

Direct solution only possible for small MRP
- Computational complexity is $O(n^3)$ for $n$ states

There are many iterative methods for large MRP
- Dynamic programming
- Monte-Carlo simulation
- Temporal-Difference learning
Iterative algorithm for value function (Value Iteration)
- Initialize $v_1(s) = 0$ for all $s$
- For $k=1$ until convergence
  - for all $s$ in $S$

$$v_{k+1}(s) \;\; \longleftarrow \;\; R(s) + \gamma \sum_{s' \in S} p\left(s' \mid s\right) v_k \left(s'\right) $$

No description has been provided for this image

# [C1 C2 C3 Pass Pub FB Sleep] = [0 1 2 3 4 5 6]

P = [[0, 0.5, 0, 0, 0, 0.5, 0],
    [0, 0, 0.8, 0, 0, 0, 0.2],
    [0, 0, 0, 0.6, 0.4, 0, 0],
    [0, 0, 0, 0, 0, 0, 1],
    [0.2, 0.4, 0.4, 0, 0, 0, 0],
    [0.1, 0, 0, 0, 0, 0.9, 0],
    [0, 0, 0, 0, 0, 0, 1]]

R = [-2, -2, -2, 10, 1, -1, 0]

P = np.asmatrix(P)
R = np.asmatrix(R)
R = R.T

gamma = 0.9
v = (np.eye(7) - gamma*P).I*R
print(v)

[[-5.01272891]
 [ 0.9426553 ]
 [ 4.08702125]
 [10.        ]
 [ 1.90839235]
 [-7.63760843]
 [ 0.        ]]

gamma = 0.9

v = np.zeros([7,1])

for i in range(100):
    v = R + gamma*P*v

print(v)

[[-5.01272786]
 [ 0.94265541]
 [ 4.08702138]
 [10.        ]
 [ 1.90839268]
 [-7.63760653]
 [ 0.        ]]

gamma = 1

v = np.zeros([7,1])

for i in range(100):
    v = R + gamma*P*v

print(v)

[[-12.4182877 ]
 [  1.47030894]
 [  4.33713371]
 [ 10.        ]
 [  0.84103688]
 [-22.31800952]
 [  0.        ]]

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')