{% import macros as m with context  %}

.. _nstep_game:


Week 11: N-step sarsa
=============================================================================================================

{{ m.embed_game('week11_nstep') }}

.. topic:: Controls
    :class: margin

    :kbd:`arrows`
        Move pacman and execute a step of Sarsa algorithm
    :kbd:`Space`
        Take a single action according to the current policy
    :kbd:`p`
        Follow the current policy
    :kbd:`r`
        Reset the game

    .. rubric:: Run locally

    :gitref:`../irlc/lectures/lec11/lecture_11_nstep_open.py`

What you see
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The example show the n-step sarsa algorithm applied to a gridworld. The Gridworld has a single, terminal reward. The n-step Sarsa method
will eventually learn the optimal policy.

How it works
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The n-step method is a bit annoying to implement, and I would refer to the pseudo code in (:cite:t:`sutton`, Section 7.2) for a complete picture.
However, the basic idea is that we define the n-step return:

.. math::

    G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} \gamma^n Q_{t+n-1}(S_{t+n}, A_{t+n} )

What this means is that at a given time :math:`t`, we need to know what happens :math:`n`-steps into the future, and knowing this, we can compute the above number.
So concretely, when Pacman is walking around and has taken at least :math:`n` steps, then in each subsequent step we can compute the corresponding :math:`n`-step return (in the past so to speak). Let's suppose that
This allows Pacman to update :math:`Q(S_t, A_t)` -- but remember that :math:`t` is n-steps ago so to speak, and :math:`S_t`, :math:`A_t`  is the state/action for pacman :math:`n` steps previously, and so on.

The update is then:

.. math::

    Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha (G_{t:t+n} - Q(S_t, A_t) )

This corresponds to regular Sarsa when :math:`n=1`. There is one small wrinkle (and quite frandkly, this is why the algorithm is annoying to implement): After the episode has terminated, we are 'missing' :math:`n` updates since we were updating the values :math:`n` steps in the past. These have to be applied,
which is why you see :math:`n` Q-values change in the last step. This is concretely implemented in the last :python:`if`-statement in the pseuco-code in (:cite:t:`sutton`, Section 7.2).