Week 12: Sarsa(Lambda)

Week 12: Sarsa(Lambda)#

What you see#

The example show the Sarsa-Lambda-algorithm on a gridworld. The living reward is 0, agent obtains a reward of +1 at the exit square. The light-blue numbers are the value of the eligibility trace, i.e. \(E(s,a)\), which are used to update the \(Q\)-values.

How it works#

When the agent takes action \(a\) in a state \(s\), and then get an immediate reward of \(r\) and move to a new state \(s'\), and here takes action \(a'\), then the Q-values are updated according to the rule

(1)#\[\begin{split}\delta & = r + \gamma Q(s', a') - Q(s, a) \\ E(s, a) & = E(s, a) + 1\end{split}\]

Then for all states and actions the following updates are performed:

\[\begin{split}Q(s, a) & = Q(s, a) + \alpha \delta E(s,a) \\ E(s,a) & = \gamma \lambda E(s,a)\end{split}\]

Actions are selected epsilon-greedy with respect to the Q-values.

Warning

Similar to Week 11: Sarsa, the updates the to \(Q\)-values seem to lack a step behind where Pacman is because we need to know the next action \(a'\) in order to update \(Q(s, a)\) (see above).

For visualization purposes, the simulation shows the value of the eligibility trace just after the update \(Q(s, a) \leftarrow Q(s, a) + \alpha \delta E(s,a)\), since I think it is easier to understand the update when one can see the value of the eligibility trace used in the update and not after it has been exponentially decayed \(E(s,a) \leftarrow \gamma \lambda E(s,a)\)