Exercise 12: Eligibility traces and value-function approximations#
Note
The exercises material is divided into general information (found on this page) and the actual exercise instructions. You can download this weeks exercise instructions from here:
You are encouraged to prepare the homework problems 1 (indicated by a hand in the PDF file) at home and present your solution during the exercise session.
To get the newest version of the course material, please see Making sure your files are up to date
The main exercise today will be the tabular version of the \(\textrm{TD}(\lambda)\) algorithm described at http://incompleteideas.net/book/first/ebook/node77.html. The algorithm will be described in Todays lectures before the version which uses function approximators.
Solutions to selected exercises#
Solution to the conceptual problem 12.1
First, the \(n\)-step returns are defined in (Sutton and Barto [SB18], Equation 12.1) as:
From this we can see that
And we also need the power series: \((1-\lambda)\sum_{k=1}^\infty \lambda^{k-1} = 1\). Given this, the result is a fairly straight-forward but tedious calculation:
Intuitively, the first two terms in this decomposition is the TD(0) error, whereas the expression in the last term will have mean 0 since both \(G_{t+1}^\lambda\) and \(v(S_{t+1}, w_t)\) are estimates of the value function at the next state.
Problem 12.1: Tabular Sarsa(Lambda)
Problem 12.2: Tabular Sarsa(Lambda) and the open gridworld environment
Problem 12.3: Semi-gradient Sarsa(Lambda)