Week 9: Value iteration

Week 9: Value iteration#

What you see#

The example shows the value iteration algorithm on a simple (deterministic) gridworld with a living reward of 0.05 per step and no discount (γ=1). The goal is to get to the upper-right corner.

Every time you move pacman the game will execute a single update of the value-iteration algorithm. You can change between the value-function and action-value function by pressing m.

The algorithm will converge after about 20 steps and thereby compute both v(s) and q(s,a) (depending on the view-mode). These represent the expected reward according to (optimal!) policy π.

How it works#

When computing e.g. the value function q(s,a), which you can see by pressing m, the algorithm will in each step, and for all states, perform the update:

Q(s,a)E[Rt+1+γmaxaQ(St+1,a)|St=s,At=a]

Where the expectation is with respect to the next state St+1 – since the problem is deterministic this expression simplifies greatly. For instance, if the next state if we take action a in state s is s, then the expression in this example becomes:

Q(s,a)0.05+maxaQ(s,a)

If the problem was not deterministic we would need to compute the average over the next (possible) states as given by the MDP p(s,r|s,a).