Week 9: Policy evaluation

Week 9: Policy evaluation#

What you see#

The example shows the policy evaluation algorithm on a simple (deterministic) gridworld with a living reward of \(-0.05\) per step and no discount (\(\gamma = 1\)). The goal is to get to the upper-right corner.

Every time you move pacman the game will execute a single update of the policy-evaluation algorithm (applied to the random policy where each action is taken with probability \(\pi(a|s) = \frac{1}{4}\)). You can change between the value-function and action-value function by pressing m.

The algorithm will converge after about 20 steps and thereby compute both \(v_\pi(s)\) and \(q_\pi(s, a)\) (depending on the view-mode). These represent the expected reward according to the (random) policy \(\pi\).

How it works#

When computing e.g. the value function \(v(s)\), the algorithm will in each step, and for all states, perform the update:

\[V(s) \leftarrow \mathbb{E}_{\pi}[R_{t+1} + \gamma V(S_{t+1}) | S_{t} = s]\]

Where the expectation is with respect to the next state. Let’s consider a concrete example. In the starting state \(s_0\) (bottom-left corner), a random policy will with probability \(\frac{1}{2}\) move pacman into the wall (and therefore stay in state \(s_0\)) and with probability \(\frac{1}{4}\) move pacman up/left and thereby get to states \(s'\) and \(s''\). Since the living reward is \(-0.05\), we can then insert and get:

\[V(s_0) \leftarrow \frac{1}{2} (-0.05 + V(s_0) ) + \frac{1}{4} (-0.05 + V(s') ) + \frac{1}{4} (-0.05 + V(s'') )\]

You can verify for yourself that this update is always correct.