{% import macros as m with context %} {% set week = 12 %} {{ m.exercise_head(week) }} The main exercise today will be the tabular version of the :math:`\textrm{TD}(\lambda)` algorithm described at `http://incompleteideas.net/book/first/ebook/node77.html `__. The algorithm will be described in Todays lectures before the version which uses function approximators. Solutions to selected exercises ------------------------------------------------------------------------------------------------------- {% if show_solution[week] %} .. admonition:: Solution to the conceptual problem 12.1 :class: dropdown First, the :math:`n`-step returns are defined in (:cite:t:`sutton`, Equation 12.1) as: .. math:: G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} + v(S_{t+n}, w_{t+n-1} ) From this we can see that .. math:: G_{t+1:t+1} & = v(S_{t+1}, w_t) \\ G_{t:t+n} & = R_{t+1} + \gamma G_{t+1:t+n} And we also need the power series: :math:`(1-\lambda)\sum_{k=1}^\infty \lambda^{k-1} = 1`. Given this, the result is a fairly straight-forward but tedious calculation: .. math:: G_t^\lambda & = (1-\lambda)\sum_{n=1}^\infty \lambda^{n-1} G_{t:t+n} \\ & = (1-\lambda) \sum_{n=1}^\infty \lambda^{n-1} (R_{t+1} + \gamma G_{t+1:t+1+(n-1) } ) \\ & = R_{t+1} + (1-\lambda) \sum_{n=1}^\infty \lambda^{n-1} G_{t+1:t+1+(n-1) } \\ & = R_{t+1} + (1-\lambda) \gamma ( G_{t+1:t+1} + \lambda \sum_{n=2}^\infty \lambda^{n-2} G_{t+1:t+1+(n-1) } ) \\ & = R_{t+1} + (1-\lambda) \gamma (v(S_{t+1}, w_t) + \lambda \sum_{m=1}^\infty \lambda^{m-1} G_{t+1:t+1+m } ) \\ & = R_{t+1} + (1-\lambda) \gamma v(S_{t+1}, w_t) + \lambda \gamma (1-\lambda) \sum_{m=1}^\infty \lambda^{m-1} G_{t+1:t+1+m } \\ & = R_{t+1} + (1-\lambda) \gamma v(S_{t+1}, w_t) + \lambda \gamma G_{t+1}^\lambda \\ & = R_{t+1} + \gamma v(S_{t+1}, w_t) + \lambda \gamma (G_{t+1}^\lambda - v(S_{t+1}, w_t) ) Intuitively, the first two terms in this decomposition is the TD(0) error, whereas the expression in the last term will have mean 0 since both :math:`G_{t+1}^\lambda` and :math:`v(S_{t+1}, w_t)` are estimates of the value function at the next state. {% endif %} {{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=dbb516d9-5a59-4376-bfe7-aff100ccc0b8', 'Problem 12.1: Tabular Sarsa(Lambda)', True) }} {{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=f562a730-b161-4454-88bf-aff100d10466', 'Problem 12.2: Tabular Sarsa(Lambda) and the open gridworld environment', True) }} {{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=bc7383ad-92cd-4960-9c22-aff100d530fa', 'Problem 12.3: Semi-gradient Sarsa(Lambda)', True) }}