{% import macros as m with context  %}

.. _tdlambda_game:


Week 12: TD(Lambda)
=============================================================================================================

{{ m.embed_game('week12_td_lambda') }}


.. topic:: Controls
    :class: margin

    :kbd:`arrows`
        Move pacman and execute a step of Sarsa algorithm
    :kbd:`Space`
        Take a single action according to the current policy
    :kbd:`p`
        Follow the current policy
    :kbd:`r`
        Reset the game

    .. rubric:: Run locally

    :gitref:`../irlc/lectures/lec12/lecture_12_td_lambda.py`

What you see
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The example show the TD(Lambda)-algorithm on a gridworld. The living reward is 0, agent obtains a reward of +1 at the exit square.
The light-blue numbers are the value of the eligibility trace, i.e. :math:`E(s)`, which are used to update the :math:`Q`-values.

How it works
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When the agent takes action :math:`a` in a state :math:`s`, and then get an immediate reward of :math:`r` and move to a new state :math:`s'`, the value function is updated as follows:

.. math::
    :label: tdlambda

    \delta & = r + \gamma V(s') - Q(s, a) \\
    E(s) & = E(s) + 1

Then for *all* states and actions the following updates are performed:

.. math::

    V(s) & = V(s) + \alpha \delta E(s) \\
    E(s) & = \gamma \lambda E(s)

In this implementation, the game is refreshed (updated)
right after :math:`V(s) \leftarrow V(s) + \alpha \delta E(s)`, but before the eligiblity trace is exponentially decayed. I think this is easier for visualization
purposes.