Week 9: Value iteration

Week 9: Value iteration#

What you see#

The example shows the value iteration algorithm on a simple (deterministic) gridworld with a living reward of \(-0.05\) per step and no discount (\(\gamma = 1\)). The goal is to get to the upper-right corner.

Every time you move pacman the game will execute a single update of the value-iteration algorithm. You can change between the value-function and action-value function by pressing m.

The algorithm will converge after about 20 steps and thereby compute both \(v^*(s)\) and \(q^*(s, a)\) (depending on the view-mode). These represent the expected reward according to (optimal!) policy \(\pi^*\).

How it works#

When computing e.g. the value function \(q(s,a)\), which you can see by pressing m, the algorithm will in each step, and for all states, perform the update:

\[Q(s,a) \leftarrow \mathbb{E}[R_{t+1} + \gamma \max_{a'} Q(S_{t+1},a') | S_{t} = s, A_{t} = a]\]

Where the expectation is with respect to the next state \(S_{t+1}\) – since the problem is deterministic this expression simplifies greatly. For instance, if the next state if we take action \(a\) in state \(s\) is \(s'\), then the expression in this example becomes:

\[Q(s,a) \leftarrow -0.05 + \max_{a'} Q(s',a')\]

If the problem was not deterministic we would need to compute the average over the next (possible) states as given by the MDP \(p(s', r | s, a)\).