{% import macros as m with context  %}

.. _game_td0:

Week 10: TD-learning
=============================================================================================================

{{ m.embed_game('week10_td') }}

.. topic:: Controls
    :class: margin

    :kbd:`arrows`
        Move pacman and execute a step of TD(0)
    :kbd:`Space`
        Take a single action according to a random policy
    :kbd:`p`
        Follow the random policy
    :kbd:`r`
        Reset the game

    .. rubric:: Run locally

    :gitref:`../irlc/lectures/lec10/lecture_10_td_keyboard.py`.


What you see
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The example shows the TD(0) algorithm applied to a deterministic gridworld environment with a living reward of :math:`-0.05`
and a per step and no discount (:math:`\gamma = 1`). The goal is to get to the upper-right corner.

Every time you move pacman the game will execute a single update of the TD(0) algorithm.

It will take quite a few steps for the algorithm to converge, but after it has converged it will show the value function  :math:`v_pi(s)` for the current policy -- which by default is the random policy.
This means that the algorithm will compute the same result as policy evaluation seen in :ref:`policy_evaluation`

How it works
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When you transition from a state :math:`s` to :math:`s'` the algorithm iteratively update :math:`V(s)` according to the rule:

.. math::

    V(s) \leftarrow V(s) + \alpha (R_{t+1} + \gamma V(s') - V(s) )

Where :math:`\alpha=0.5` is a learning rate.