Week 12: TD(Lambda)

Week 12: TD(Lambda)#

What you see#

The example show the TD(Lambda)-algorithm on a gridworld. The living reward is 0, agent obtains a reward of +1 at the exit square. The light-blue numbers are the value of the eligibility trace, i.e. \(E(s)\), which are used to update the \(Q\)-values.

How it works#

When the agent takes action \(a\) in a state \(s\), and then get an immediate reward of \(r\) and move to a new state \(s'\), the value function is updated as follows:

(1)#\[\begin{split}\delta & = r + \gamma V(s') - Q(s, a) \\ E(s) & = E(s) + 1\end{split}\]

Then for all states and actions the following updates are performed:

\[\begin{split}V(s) & = V(s) + \alpha \delta E(s) \\ E(s) & = \gamma \lambda E(s)\end{split}\]

In this implementation, the game is refreshed (updated) right after \(V(s) \leftarrow V(s) + \alpha \delta E(s)\), but before the eligiblity trace is exponentially decayed. I think this is easier for visualization purposes.