Week 13: DynaQ

Week 13: DynaQ#

What you see#

The example show the Dyna-Q algorithm on a gridworld. The living reward is 0, agent obtains a reward of +1 on the exit square. The obstacles makes the problem harder to solve since exploration becomes very slow.

How it works#

When the agent takes action \(a\) in a state \(s\), and then get an immediate reward of \(r\) and move to a new state \(s'\), and here takes action \(a'\), then the Q-values are updated according to the usual Q-learning rule (see Week 11: Q-learning):

(1)#\[Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s,a) \right]\]

Then secondly, the agent pushes the triplet of \((s, a, r, sp)\) onto a list. Finally, since the four pieces of information above is all that is needed to perform a \(Q\)-update, the agent can select \(n=5\) such random triplets and thereby perform \(n\) Q-updates on past states – since \(Q\)-learning is off policy, this is perfectly fine from a convergence perspective.

This means that more than one \(Q(s, a)\)-value is updated in each step, which makes dyna-Q converge much faster after the first episode.