Week 11: Sarsa

Week 11: Sarsa#

Warning

A small note: Think about the update rule and what it means when we apply it to the first state \(s_0\):

  • First we need to select an action \(a_0\)

  • Then we go to a square \(s_1\)

  • Then finally we need the action in that square \(a_1\) (see (1))

In other words, we need to get action \(a_1\) to apply the update \(Q(s_0,a_0)\), and this is why the the updates look a bit sluggish when you play by keyboard.

There is another small issue: The algorithmic code in [SB18] assumes we can compute compute \(a_1\) from the policy when we update \(Q(s_0, a_0)\); this is fine when actions are determined by a policy we can compute at any time we like, however, it won’t work when the actions ade decided by keyboard inputs since obviously the computer cannot predict what key you will press next.

Therefore, the implementation shown on this page will wait to apply the Q-updates until the actions are pressed (thus the delay effect) whereas the version you asked to implement during the exercises follow the pseudo-code in [SB18].

However, both methods will compute the same Q-values (one will just do it one step later and thus be suitable for keyboard input!), so please don’t be confused by this point! As long as you understand the update rule, you should be all set.