{% import macros as m with context %} .. _policy_evaluation: Week 9: Policy evaluation ============================================================================================================= {{ m.embed_game('week9_policy_evaluation') }} .. topic:: Controls :class: margin :kbd:`arrows` Move pacman and execute a step of the policy evaluation algorithm :kbd:`Space` Take a single action according to the (random) policy :kbd:`m` Change between the value function :math:`v` and action-value function :math:`q`. :kbd:`p` Follow the current (random) policy :kbd:`r` Reset the game .. rubric:: Run locally :gitref:`../irlc/lectures/lec09/unf_policy_evaluation_gridworld.py`. What you see ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The example shows the policy evaluation algorithm on a simple (deterministic) gridworld with a living reward of :math:`-0.05` per step and no discount (:math:`\gamma = 1`). The goal is to get to the upper-right corner. Every time you move pacman the game will execute a single update of the policy-evaluation algorithm (applied to the random policy where each action is taken with probability :math:`\pi(a|s) = \frac{1}{4}`). You can change between the value-function and action-value function by pressing :kbd:`m`. The algorithm will converge after about 20 steps and thereby compute both :math:`v_\pi(s)` and :math:`q_\pi(s, a)` (depending on the view-mode). These represent the expected reward according to the (random) policy :math:`\pi`. How it works ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When computing e.g. the value function :math:`v(s)`, the algorithm will in each step, and for all states, perform the update: .. math:: V(s) \leftarrow \mathbb{E}_{\pi}[R_{t+1} + \gamma V(S_{t+1}) | S_{t} = s] Where the expectation is with respect to the next state. Let's consider a concrete example. In the starting state :math:`s_0` (bottom-left corner), a random policy will with probability :math:`\frac{1}{2}` move pacman into the wall (and therefore stay in state :math:`s_0`) and with probability :math:`\frac{1}{4}` move pacman up/left and thereby get to states :math:`s'` and :math:`s''`. Since the living reward is :math:`-0.05`, we can then insert and get: .. math:: V(s_0) \leftarrow \frac{1}{2} (-0.05 + V(s_0) ) + \frac{1}{4} (-0.05 + V(s') ) + \frac{1}{4} (-0.05 + V(s'') ) You can verify for yourself that this update is always correct.