Exercise 12: Eligibility traces

Exercise 12: Eligibility traces#

Note

This page contains background information which may be useful in future exercises or projects. You can download this weeks exercise instructions from here:
- 02465ex12_Python.pdf
Slides: [1x] ([6x]). Reading: Chapter 10.2; 12-12.7, [SB18].
You are encouraged to prepare the homework problems 1 (indicated by a hand in the PDF file) at home and present your solution during the exercise session.
To get the newest version of the course material, please see Making sure your files are up to date

The main exercise today will be the tabular version of the $TD (λ)$ algorithm described at http://incompleteideas.net/book/first/ebook/node77.html. The algorithm will be described in Todays lectures before the version which uses function approximators.

Solutions to selected exercises#

Solution to the conceptual problem 12.1

First, the $n$ -step returns are defined in (Sutton and Barto [SB18], Equation 12.1) as:

G_{t : t + n} = R_{t + 1} + γ R_{t + 2} + \dots + γ^{n - 1} R_{t + n} + v (S_{t + n}, w_{t + n - 1})

From this we can see that

\begin{aligned} G_{t + 1 : t + 1} & = v (S_{t + 1}, w_{t}) \\ G_{t : t + n} & = R_{t + 1} + γ G_{t + 1 : t + n} \end{aligned}

And we also need the power series: $(1 - λ) \sum_{k = 1}^{\infty} λ^{k - 1} = 1$ . Given this, the result is a fairly straight-forward but tedious calculation:

\begin{aligned} G_{t}^{λ} & = (1 - λ) \sum_{n = 1}^{\infty} λ^{n - 1} G_{t : t + n} \\ = (1 - λ) \sum_{n = 1}^{\infty} λ^{n - 1} (R_{t + 1} + γ G_{t + 1 : t + 1 + (n - 1)}) \\ = R_{t + 1} + (1 - λ) \sum_{n = 1}^{\infty} λ^{n - 1} G_{t + 1 : t + 1 + (n - 1)} \\ = R_{t + 1} + (1 - λ) γ (G_{t + 1 : t + 1} + λ \sum_{n = 2}^{\infty} λ^{n - 2} G_{t + 1 : t + 1 + (n - 1)}) \\ = R_{t + 1} + (1 - λ) γ (v (S_{t + 1}, w_{t}) + λ \sum_{m = 1}^{\infty} λ^{m - 1} G_{t + 1 : t + 1 + m}) \\ = R_{t + 1} + (1 - λ) γ v (S_{t + 1}, w_{t}) + λ γ (1 - λ) \sum_{m = 1}^{\infty} λ^{m - 1} G_{t + 1 : t + 1 + m} \\ = R_{t + 1} + (1 - λ) γ v (S_{t + 1}, w_{t}) + λ γ G_{t + 1}^{λ} \\ = R_{t + 1} + γ v (S_{t + 1}, w_{t}) + λ γ (G_{t + 1}^{λ} - v (S_{t + 1}, w_{t})) \end{aligned}

Intuitively, the first two terms in this decomposition is the TD(0) error, whereas the expression in the last term will have mean 0 since both $G_{t + 1}^{λ}$ and $v (S_{t + 1}, w_{t})$ are estimates of the value function at the next state.

Problem 12.1: Tabular Sarsa(Lambda)

Problem 12.2: Tabular Sarsa(Lambda) and the open gridworld environment

Problem 12.3: Semi-gradient Sarsa(Lambda)

Exercise 12: Eligibility traces

Contents

Exercise 12: Eligibility traces#

Solutions to selected exercises#