{% import macros as m with context  %}

.. _sarsalambda_game:


Week 12: Sarsa(Lambda)
=============================================================================================================

{{ m.embed_game('week12_sarsa_lambda') }}


.. topic:: Controls
    :class: margin

    :kbd:`arrows`
        Move pacman and execute a step of Sarsa(Lambda) algorithm
    :kbd:`Space`
        Take a single action according to the current policy
    :kbd:`p`
        Follow the current policy
    :kbd:`r`
        Reset the game

    .. rubric:: Run locally

    :gitref:`../irlc/lectures/lec12/lecture_12_sarsa_lambda_open.py`

What you see
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The example show the Sarsa-Lambda-algorithm on a gridworld. The living reward is 0, agent obtains a reward of +1 at the exit square.
The light-blue numbers are the value of the eligibility trace, i.e. :math:`E(s,a)`, which are used to update the :math:`Q`-values.

How it works
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When the agent takes action :math:`a` in a state :math:`s`, and then get an immediate reward of :math:`r` and move to a new state :math:`s'`, and here takes action :math:`a'`,
then the Q-values are updated according to the rule

.. math::
    :label: sarsalambda

    \delta & = r + \gamma Q(s', a') - Q(s, a) \\
    E(s, a) & = E(s, a) + 1

Then for *all* states and actions the following updates are performed:

.. math::

    Q(s, a) & = Q(s, a) + \alpha \delta E(s,a) \\
    E(s,a) & = \gamma \lambda E(s,a)

Actions are selected epsilon-greedy with respect to the Q-values.


.. warning::

    Similar to :ref:`sarsa_game`, the updates the to :math:`Q`-values seem to lack a step behind where Pacman is because we need to know the next action :math:`a'` in order to update :math:`Q(s, a)` (see above).

    For visualization purposes, the simulation shows the value of the eligibility trace just after the update :math:`Q(s, a) \leftarrow Q(s, a) + \alpha \delta E(s,a)`, since I think it is easier to understand
    the update when one can see the value of the eligibility trace used in the update and not after it has been exponentially decayed  :math:`E(s,a) \leftarrow \gamma \lambda E(s,a)`