{% import macros as m with context %} {% set week = 9 %} {{ m.exercise_head(week) }} Markov decision processes ------------------------------------------------------------------------- The Markov Decision Process (MDP) is a mathematical description of the problem solved by most reinforcement learning methods. It will be represented as the class :class:`~irlc.ex09.mdp.MDP` and will stand in the same relationship to methods such as value iteration as the :class:`~irlc.ex02.dp_model.DPModel` play to the dynamical programming method. Let's look at the MDP for the frozen-lake gridworld environment. It looks as follows: .. plot:: :include-source: # BookGridEnvironment from irlc.gridworld.gridworld_environments import FrozenLake from irlc import train, Agent, interactive env = FrozenLake(render_mode="human") # Pass render_mode='human' for visualization. env, agent = interactive(env, Agent(env)) # Make th env.reset() # You always need to call reset env.plot() # Plot the environment. env.close() We can instantiate the MDP class directly, or extract if from the gridworld environment as ``env.mdp``. A couple of examples: This code plots the initial state, all non-terminal states, and the actions in the initial state: .. runblock:: pycon >>> from irlc.gridworld.gridworld_environments import FrozenLake >>> mdp = FrozenLake().mdp # Get the MDP. >>> mdp.initial_state # States are tuples (x, y) >>> mdp.A(mdp.initial_state) # Get all actions in the initial state .. warning:: The actions available in the different states are **not** the same in the gridworld environment. Specifically, the winning (goal) state only has a single action, which takes the environment to the terminal state. Let's take a look: .. runblock:: pycon >>> from irlc.gridworld.gridworld_environments import FrozenLake >>> mdp = FrozenLake().mdp >>> mdp.A( (0, 3) ) # Actions in the starting state >>> mdp.A( (0, 0) ) # Actions the bottom-left state Transitions probabilities ******************************************************************** The transition probabilities: .. math:: P(s', r | s, a) Are represented as a function :func:`~irlc.ex09.mdp.MDP.Psr` that returns a dictionary of the form ``{(s1, r1): p1, (s2, r2): p2, ...}``, compare to :func:`irlc.ex02.dp_model.DPModel.Pw`. An example: .. runblock:: pycon >>> from irlc.gridworld.gridworld_environments import FrozenLake >>> mdp = FrozenLake().mdp >>> transitions = mdp.Psr(mdp.initial_state, 0) >>> for (sp, r), p in transitions.items(): ... print(f"P(s'={sp}, r={r} | s={mdp.initial_state}, a=0) = {p}") .. note:: It is easy to miss, but (:cite:t:`sutton`) distinguish between - The set of all states, :math:`\mathcal{S}^+` corresponding to :func:`~irlc.ex09.mdp.MDP.states` - The set of all non non-terminal states :math:`\mathcal{S}` corresponding to :func:`~irlc.ex09.mdp.MDP.nonterminal_states` The value-function for a terminal states. Both the policy, value-function, action-value function and transition probabilities :math:`P(s',r|s,a)` are **only** defined in non-terminal states, and it is a common mistake in reinforcement learning to update or evaluate them in terminal states. Let's take a look: .. runblock:: pycon >>> from irlc.gridworld.gridworld_environments import FrozenLake >>> mdp = FrozenLake().mdp # Get the MDP. >>> mdp.nonterminal_states >>> mdp.states >>> terminal_states = [s for s in mdp.states if s not in mdp.nonterminal_states] >>> terminal_states Building a MDP ************************************************************************ Building your own MDP is easy. Let's try to implement a MDP corresponding to a small gridworld example (see (:cite:t:`sutton`), Example 4.1) .. literalinclude:: ../../shared/output/small_gridworld.py :language: python Note that: - The states are tuples of row, col coordinates ``(i,j)`` - Each state has 4 actions, so the action-space function ``A(state)`` simply returns the list ``[0, 1, 2, 3]`` - The ``is_terminal`` function determines if we are in a corner (:math:`(0,0)` or :math:`(3,3)`) - The transition probabilities are deterministic. So the dictionary it returns has a single key ``(next_state, reward)`` and value of ``1`` Classes and functions ------------------------------------------------------------------------- .. autoclass:: irlc.ex09.mdp.MDP :show-inheritance: :members: Solutions to selected exercises ------------------------------------------------------------------------------------------------------- {% if show_solution[week] %} .. admonition:: Solution to Sutton problem 3.14 :class: dropdown The update is :math:`V(s) \leftarrow \sum_a \pi(a|s) \sum_{s', r} p(s', r |s, a) (r + \gamma V(s'))`. In our case the per-step reward is :math:`r=0`, :math:`\gamma = 0.9` and the policy is uniform :math:`\pi(a | s) = \frac{1}{4}`. Reading of the four values of the value function surrounding the center square we get: .. math:: V(s) \leftarrow \sum_{a=1}^4 \frac{1}{4} (0 + \gamma V(s') ) = \frac{0.9(2.3 + 0.7 - 0.4 + 0.4 ) }{4} = 0.675. .. admonition:: Solution to Sutton problem 4.1 :class: dropdown According to the relationships we saw at the beginning of Todays lecture the action-value function satisfy: .. math:: q_{\pi}(s,a) = \mathbb{E}[r + \gamma \sum_a \pi(a'|s') q_\pi(s', a') | s, a] = \mathbb{E}[-1 + \gamma v_\pi(s', a') | s, a] Furthermore, when we take an action, we know what the next state :math:`s'` is so the expectation disappear. We can now read of the value function from figure 4.1 and we get that in the first case :math:`q(11,down) = -1` and in the second that :math:`q(7,down) = -1 + \gamma v_\pi(11) = -1 - 14\gamma = -15`. .. admonition:: Solution to Problem 2 (Bellmans equations) :class: dropdown **Part a:** Recall Bellmans equation (it does not matter which) says: :math:`v(s) = \mathbb{E}[R + \gamma v(s') \mid S_t = s]`. Therefore: .. math:: v_1 = \frac{1}{2} \left( R + \gamma v_1 \right) + \frac{1}{2} \gamma v_2 = \frac{1}{2} \left( 1 + \gamma v_1 + \gamma v_2 \right) Solving we get that :math:`v_1 = \frac{1 + \gamma v_2}{2-\gamma} = 2`. **Part b:** We already got the first equation from the previous problem. Applying a similar argument to the second equation leaves us with: .. math:: -1 & = (\gamma - 2) v_1 + \gamma v_2 \\ - 2 & = (\gamma - 2) v_2 + \gamma v_1 This can be written as a linear system of equations as: .. math:: \begin{bmatrix} -1 \\ -2 \end{bmatrix} = \begin{bmatrix} \gamma - 2 & \gamma \\ \gamma & \gamma - 2 \end{bmatrix} \mathbf{v} .. tip:: During the exam, a few students correctly found the system of two equations (in my oppinion, the difficult part), but when they used Maple to re-write them as a linear system Maple obtained some very strange expressions probably due to the :math:`\gamma` -- when you use Maple to simplify or solve, try to check that you can re-produce the result by hand if you have time. **Bonus question:** The inverse of :math:`A` is :math:`A^{-1} = \begin{bmatrix} 2- \gamma & -\gamma \\ -\gamma & 2-\gamma \end{bmatrix}` (use google!). Then find :math:`\mathbf{v}` as .. math:: \mathbf{v} = A^{-1} b = \frac{1}{4 - 4\gamma} \begin{bmatrix} 2 + \gamma \\ 2 \end{bmatrix}. {% endif %} {{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=3c44f154-2419-48d4-acd2-b146015dfe76', 'Problem 9.1-9.3', True) }} {{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=2bf01550-197a-4a81-938c-b146015dfe76', 'Problem 9.4: Policy evaluation', True) }} {{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=349172af-f639-44c8-9698-b146015dfe76', 'Problem 9.5: Policy iteration', True) }} {{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=6c860e51-c2fd-4776-9d2c-b1460169dd7c', 'Problem 9.6: Value iteration', True) }} {{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=df72bb2e-2a18-4789-b10f-b1460169dd7c', 'Problem 9.8', True) }} {{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=566dd3c0-69ec-4e5f-8a53-b1460169dd60', 'Problem 9.9 - Gambler', True) }}