{% import macros as m with context  %}

.. _value_iteration:

Week 9: Value iteration
=============================================================================================================

{{ m.embed_game('week9_vi') }}

.. topic:: Controls
    :class: margin

    :kbd:`arrows`
        Move pacman and execute a step of the value iteration algorithm
    :kbd:`Space`
        Take a single action according to the current policy
    :kbd:`m`
        Change between the value function :math:`v` and action-value function :math:`q`.
    :kbd:`p`
        Follow the current policy
    :kbd:`r`
        Reset the game

    .. rubric:: Run locally

    :gitref:`../irlc/lectures/lec09/unf_vi_gridworld.py`.


What you see
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The example shows the value iteration algorithm on a simple (deterministic) gridworld with a living reward of :math:`-0.05` per step and no discount (:math:`\gamma = 1`). The goal is to get to the upper-right corner.

Every time you move pacman the game will execute a single update of the value-iteration algorithm. You can
change between the value-function and action-value function by pressing :kbd:`m`.

The algorithm will converge after about 20 steps and thereby compute both :math:`v^*(s)` and :math:`q^*(s, a)` (depending on the view-mode).  These represent the expected reward according to (optimal!) policy :math:`\pi^*`.

How it works
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When computing e.g. the value function :math:`q(s,a)`, which you can see by pressing :kbd:`m`, the algorithm will in each step, and for all states, perform the update:

.. math::

    Q(s,a) \leftarrow \mathbb{E}[R_{t+1} + \gamma \max_{a'} Q(S_{t+1},a') | S_{t} = s, A_{t} = a]

Where the expectation is with respect to the next state :math:`S_{t+1}` -- since the problem is deterministic this expression simplifies greatly. For instance, if the next state if we take action :math:`a` in state :math:`s` is :math:`s'`, then the expression in this example becomes:

.. math::

    Q(s,a) \leftarrow -0.05 + \max_{a'} Q(s',a')

If the problem was not deterministic we would need to compute the average over the next (possible) states as given by the MDP :math:`p(s', r | s, a)`.