Week 9: Policy evaluation#
What you see#
The example shows the policy evaluation algorithm on a simple (deterministic) gridworld with a living reward of \(-0.05\) per step and no discount (\(\gamma = 1\)). The goal is to get to the upper-right corner.
Every time you move pacman the game will execute a single update of the policy-evaluation algorithm (applied to the random policy where each action is taken with probability \(\pi(a|s) = \frac{1}{4}\)). You can change between the value-function and action-value function by pressing m.
The algorithm will converge after about 20 steps and thereby compute both \(v_\pi(s)\) and \(q_\pi(s, a)\) (depending on the view-mode). These represent the expected reward according to the (random) policy \(\pi\).
How it works#
When computing e.g. the value function \(v(s)\), the algorithm will in each step, and for all states, perform the update:
Where the expectation is with respect to the next state. Let’s consider a concrete example. In the starting state \(s_0\) (bottom-left corner), a random policy will with probability \(\frac{1}{2}\) move pacman into the wall (and therefore stay in state \(s_0\)) and with probability \(\frac{1}{4}\) move pacman up/left and thereby get to states \(s'\) and \(s''\). Since the living reward is \(-0.05\), we can then insert and get:
You can verify for yourself that this update is always correct.