{% import macros as m with context %} .. _value_iteration: Week 9: Value iteration ============================================================================================================= {{ m.embed_game('week9_vi') }} .. topic:: Controls :class: margin :kbd:`arrows` Move pacman and execute a step of the value iteration algorithm :kbd:`Space` Take a single action according to the current policy :kbd:`m` Change between the value function :math:`v` and action-value function :math:`q`. :kbd:`p` Follow the current policy :kbd:`r` Reset the game .. rubric:: Run locally :gitref:`../irlc/lectures/lec09/unf_vi_gridworld.py`. What you see ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The example shows the value iteration algorithm on a simple (deterministic) gridworld with a living reward of :math:`-0.05` per step and no discount (:math:`\gamma = 1`). The goal is to get to the upper-right corner. Every time you move pacman the game will execute a single update of the value-iteration algorithm. You can change between the value-function and action-value function by pressing :kbd:`m`. The algorithm will converge after about 20 steps and thereby compute both :math:`v^*(s)` and :math:`q^*(s, a)` (depending on the view-mode). These represent the expected reward according to (optimal!) policy :math:`\pi^*`. How it works ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When computing e.g. the value function :math:`q(s,a)`, which you can see by pressing :kbd:`m`, the algorithm will in each step, and for all states, perform the update: .. math:: Q(s,a) \leftarrow \mathbb{E}[R_{t+1} + \gamma \max_{a'} Q(S_{t+1},a') | S_{t} = s, A_{t} = a] Where the expectation is with respect to the next state :math:`S_{t+1}` -- since the problem is deterministic this expression simplifies greatly. For instance, if the next state if we take action :math:`a` in state :math:`s` is :math:`s'`, then the expression in this example becomes: .. math:: Q(s,a) \leftarrow -0.05 + \max_{a'} Q(s',a') If the problem was not deterministic we would need to compute the average over the next (possible) states as given by the MDP :math:`p(s', r | s, a)`.