{% import macros as m with context  %}

.. _policy_improvement:

Week 9: Policy iteration
=============================================================================================================

{{ m.embed_game('week9_policy_iteration') }}

.. topic:: Controls
    :class: margin

    :kbd:`arrows`
        Move pacman and execute a step of the policy iteration algorithm
    :kbd:`Space`
        Take a single action according to the current policy :math:`\pi`
    :kbd:`m`
        Change between the value function :math:`v` and action-value function :math:`q`.
    :kbd:`p`
        Follow the current policy
    :kbd:`r`
        Reset the game

    .. rubric:: Run locally

    :gitref:`../irlc/lectures/lec09/unf_policy_improvement_gridworld.py`.


What you see
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The example shows the policy iteration algorithm on a simple (deterministic) gridworld with a living reward of :math:`-0.05` per step and no discount (:math:`\gamma = 1`). The goal is to get to the upper-right corner.

The algorithm will first evaluate the current policy for exactly :math:`20` steps. Then it will perform a single policy improvement step. That means the algorithm agrees with policy evaluation for the first 20 steps.

The algorithm begins with a random policy (where each action is taken with probability :math:`\pi(a|s) = \frac{1}{4}`). You can
change between the value-function and action-value function by pressing :kbd:`m`.

Note that the algorithm will show the current policy as arrows.

The algorithm will converge after a few policy-iteration steps and thereby compute both :math:`v^*(s)` and :math:`q^*(s, a)` (depending on the view-mode).  These represent the expected reward according to the (optimal!) policy :math:`\pi^*`.

How it works
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
As stated, you should see the policy-evaluation example :ref:`policy_evaluation` ./to understand how the algorithm works for the first 20 steps. After the (initially random) policy is evaluated, the algorithm will update the
current policy by being greedy with respect to the value/action-value function (the step is the same in either case). The policy is always indicated by the arrows.

After the update the process begins over with 20 more evaluations of the (updated) policy.