Week 9: Policy iteration#
What you see#
The example shows the policy iteration algorithm on a simple (deterministic) gridworld with a living reward of
The algorithm will first evaluate the current policy for exactly
The algorithm begins with a random policy (where each action is taken with probability
Note that the algorithm will show the current policy as arrows.
The algorithm will converge after a few policy-iteration steps and thereby compute both
How it works#
As stated, you should see the policy-evaluation example Week 9: Policy evaluation ./to understand how the algorithm works for the first 20 steps. After the (initially random) policy is evaluated, the algorithm will update the current policy by being greedy with respect to the value/action-value function (the step is the same in either case). The policy is always indicated by the arrows.
After the update the process begins over with 20 more evaluations of the (updated) policy.