{% import macros as m with context  %}
{% set week = 12 %}
{{ m.exercise_head(week) }}

The main exercise today will be the tabular version of the :math:`\textrm{TD}(\lambda)` algorithm described
at `http://incompleteideas.net/book/first/ebook/node77.html <http://incompleteideas.net/book/first/ebook/node77.html>`__. The algorithm will
be described in Todays lectures before the version which uses function approximators.


Solutions to selected exercises
-------------------------------------------------------------------------------------------------------

{% if show_solution[week] %}


.. admonition:: Solution to the conceptual problem 12.1
    :class: dropdown

    First, the :math:`n`-step returns are defined in (:cite:t:`sutton`, Equation 12.1) as:

    .. math::

        G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n}  + v(S_{t+n}, w_{t+n-1} )

    From this we can see that

    .. math::

        G_{t+1:t+1} & = v(S_{t+1}, w_t) \\
        G_{t:t+n} & = R_{t+1} + \gamma G_{t+1:t+n}

    And we also need the power series: :math:`(1-\lambda)\sum_{k=1}^\infty \lambda^{k-1} = 1`.
    Given this, the result is a fairly straight-forward but tedious calculation:

    .. math::

        G_t^\lambda & = (1-\lambda)\sum_{n=1}^\infty \lambda^{n-1} G_{t:t+n} \\
                & = (1-\lambda) \sum_{n=1}^\infty \lambda^{n-1} (R_{t+1} + \gamma G_{t+1:t+1+(n-1) } ) \\
                & = R_{t+1} + (1-\lambda)  \sum_{n=1}^\infty \lambda^{n-1}  G_{t+1:t+1+(n-1) } \\
                & = R_{t+1} + (1-\lambda) \gamma ( G_{t+1:t+1} + \lambda \sum_{n=2}^\infty \lambda^{n-2}  G_{t+1:t+1+(n-1) } ) \\
                & = R_{t+1} + (1-\lambda) \gamma (v(S_{t+1}, w_t) + \lambda \sum_{m=1}^\infty \lambda^{m-1}  G_{t+1:t+1+m } ) \\
                & = R_{t+1} + (1-\lambda) \gamma v(S_{t+1}, w_t) + \lambda \gamma (1-\lambda) \sum_{m=1}^\infty \lambda^{m-1}  G_{t+1:t+1+m } \\
                & = R_{t+1} + (1-\lambda) \gamma v(S_{t+1}, w_t) + \lambda \gamma G_{t+1}^\lambda \\
                & = R_{t+1} + \gamma v(S_{t+1}, w_t) +  \lambda \gamma (G_{t+1}^\lambda - v(S_{t+1}, w_t) )

    Intuitively, the first two terms in this decomposition is the TD(0) error, whereas the expression in the last term will
    have mean 0 since both :math:`G_{t+1}^\lambda` and :math:`v(S_{t+1}, w_t)` are estimates of the value function at the next state.


{% endif %}


{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=dbb516d9-5a59-4376-bfe7-aff100ccc0b8', 'Problem 12.1: Tabular Sarsa(Lambda)', True) }}

{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=f562a730-b161-4454-88bf-aff100d10466', 'Problem 12.2: Tabular Sarsa(Lambda) and the open gridworld environment', True) }}

{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=bc7383ad-92cd-4960-9c22-aff100d530fa', 'Problem 12.3: Semi-gradient Sarsa(Lambda)', True) }}