{% import macros as m with context  %}

.. _policy_evaluation:

Week 9: Policy evaluation
=============================================================================================================

{{ m.embed_game('week9_policy_evaluation') }}

.. topic:: Controls
    :class: margin

    :kbd:`arrows`
        Move pacman and execute a step of the policy evaluation algorithm
    :kbd:`Space`
        Take a single action according to the (random) policy
    :kbd:`m`
        Change between the value function :math:`v` and action-value function :math:`q`.
    :kbd:`p`
        Follow the current (random) policy
    :kbd:`r`
        Reset the game

    .. rubric:: Run locally

    :gitref:`../irlc/lectures/lec09/unf_policy_evaluation_gridworld.py`.


What you see
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The example shows the policy evaluation algorithm on a simple (deterministic) gridworld with a living reward of :math:`-0.05` per step and no discount (:math:`\gamma = 1`). The goal is to get to the upper-right corner.

Every time you move pacman the game will execute a single update of the policy-evaluation algorithm (applied to the random policy where each action is taken with probability :math:`\pi(a|s) = \frac{1}{4}`). You can
change between the value-function and action-value function by pressing :kbd:`m`.

The algorithm will converge after about 20 steps and thereby compute both :math:`v_\pi(s)` and :math:`q_\pi(s, a)` (depending on the view-mode).  These represent the expected reward according to the (random) policy :math:`\pi`.

How it works
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When computing e.g. the value function :math:`v(s)`, the algorithm will in each step, and for all states, perform the update:

.. math::

    V(s) \leftarrow \mathbb{E}_{\pi}[R_{t+1} + \gamma V(S_{t+1}) | S_{t} = s]

Where the expectation is with respect to the next state. Let's consider a concrete example. In the starting state :math:`s_0` (bottom-left corner), a random policy will with probability :math:`\frac{1}{2}` move pacman into the wall (and therefore stay in state :math:`s_0`)
and with probability :math:`\frac{1}{4}` move pacman up/left and thereby get to states :math:`s'` and :math:`s''`. Since the living reward is :math:`-0.05`, we can then insert and get:

.. math::

    V(s_0) \leftarrow \frac{1}{2} (-0.05 + V(s_0) ) + \frac{1}{4} (-0.05 + V(s') ) + \frac{1}{4} (-0.05 + V(s'') )

You can verify for yourself that this update is always correct.