{% import macros as m with context %} {% set week = 10 %} {{ m.exercise_head(week) }} Tabular methods (Q-learning, Sarsa, etc.) ------------------------------------------------------------------------- As the name suggests, tabular methods requires us to maintain a table of :math:`Q`-values or state-values :math:`V`. The :math:`Q`-values in particular can be a bit tricky to keep track of and I have therefore made a helper class :class:`irlc.ex09.rl_agent.TabularAgent` which will hopefully simplify the process. .. note:: The main complication we need to deal with when representing the Q-values is when different states have different action spaces, i.e. when :math:`\mathcal{A}(s) \neq \mathcal{A}(s')`. Gymasiums ways of dealing with this situation is to use the ``info``-dictionary, e.g. so that :python:`s, info = env.reset()` will specify a :python:`info['mask']` variable which is a numpy ndarray so that a given action is available if :math:`info['mask'][a] == 1`. You can read more about this choice at `The gymnasium discrete space documentation `_. The :math:`Q`-values behave like a 2d numpy ndarray: .. runblock:: pycon >>> from irlc.ex09.rl_agent import TabularAgent >>> from irlc.gridworld.gridworld_environments import BookGridEnvironment >>> env = BookGridEnvironment() >>> agent = TabularAgent(env, epsilon=0.3) # Use epsilon-greedy exploration. >>> state, _ = env.reset() >>> state >>> agent.Q[state, 1] = 2 # Update a Q-value >>> agent.Q[state, 1] # Get a Q-value >>> agent.Q[state, 0] # Q-values are by default zero To implement masking, the :python:`agent.Q`-table has two special functions which requires the ``info``-dictionary. **As long as you stick to these two functions and pass the correct info dictionary you will not get into troubles**. - To get the optimal action use ``agent.Q.get_optimal_action(s, info_s)`` .. runblock:: pycon >>> from irlc.ex09.rl_agent import TabularAgent >>> from irlc.gridworld.gridworld_environments import BookGridEnvironment >>> env = BookGridEnvironment() >>> agent = TabularAgent(env) >>> state, info = env.reset() # Get the info-dictionary corresponding to s >>> agent.Q[state, 1] = 2.5 # Update a Q-value; action a=1 is now optimal. >>> agent.Q.get_optimal_action(state, info) # Note we pass along the info-dictionary corresopnding to this state - To get all Q-values corresponding to a state use ``agent.Q.get_Qs(s, info_s)`` .. runblock:: pycon >>> from irlc.ex09.rl_agent import TabularAgent >>> from irlc.gridworld.gridworld_environments import BookGridEnvironment >>> env = BookGridEnvironment() >>> agent = TabularAgent(env) >>> state, info = env.reset() # Get the info-dictionary corresponding to s >>> agent.Q[state, 1] = 2.5 # Update a Q-value; action a=1 is now optimal. >>> actions, Qs = agent.Q.get_Qs(state, info) # Note we pass along the info-dictionary corresopnding to this state >>> actions # All actions that are available in this state (after masking) >>> Qs # All Q-values available in this state (after masking) You can combine this functionality to get e.g. the maximal Q-value using ``agent.Q[s, agent.Q.get_optimal_action(s, info)]``. .. note:: The Q-table will remember the masking information for a given state and warn you if you are trying to access an action that has been previously masked. We often want to perform :math:`\varepsilon`-greedy exploration. To simplify this, the agent has the function ``agent.pi_eps``. Since this function uses the Q-values, it also requires an ``info``-dictionary: .. runblock:: pycon >>> from irlc.ex09.rl_agent import TabularAgent >>> from irlc.gridworld.gridworld_environments import BookGridEnvironment >>> env = BookGridEnvironment() >>> agent = TabularAgent(env, epsilon=0.1) # epsilon-greedy exploration >>> state, info = env.reset() # to get a state and info-dictionary >>> a = agent.pi_eps(state, info) # Epsilon-greedy action selection >>> a .. warning:: In the ``train(s, a, r, sp, done, info_s, info_sp)``-method, remember to use the ``info``-dictionary corresponding to the state. - use ``self.Q.get_Qs(s, info_s)`` and ``self.Q.get_Qs(sp, info_sp)`` - **never** use ``self.Q.get_Qs(s, info_sp)`` Classes and functions ------------------------------------------------------------------------- .. autoclass:: irlc.ex09.rl_agent.TabularAgent :show-inheritance: :members: .. autoclass:: irlc.ex09.rl_agent.TabularQ :show-inheritance: :members: Solutions to selected exercises ------------------------------------------------------------------------------------------------------- {% if show_solution[week] %} .. admonition:: Solution to Problem 1 - The single-state example :class: dropdown **Part a:** Comparing to the :math:`T=3` example clearly :math:`N^{\text{first} }(s_1) = 1` and the return is: .. math:: G_0 = R_1 + \gamma R_2 + \gamma^2 R_3 = 1 + \gamma + \gamma^2 = \frac{1 - \gamma^3}{1-\gamma} Therefore the accumulated reward is: .. math:: S^{\text{first}}(s_1) = G_0 = \sum_{t=0}^{T-1}\gamma^t R_{t+1} = \sum_{t=0}^{T-1}\gamma^t = \frac{1-\gamma^{T} }{1-\gamma} \equiv h(T). **Part b:** From the previous example, it is clear that the return only depends on the number of future times :math:`T` we visit a state. If we again consider the :math:`T=3` example we see that :math:`N^{\text{every} }(s_1) = T` since we visit :math:`s_1` a total of :math:`T` times. Each time we compute the (future) return :math:`G_t`; in the first visit this is equal to what we got before, in the next visit it is based on :math:`T-1` future rewards, and so on. For :math:`T=3`: .. math:: S^{\text{every}}(s_1) & = (R_1 + \gamma R_2 + \gamma^2 R_3) + ( R_2 + \gamma R_3) + (R_3) \\ & = h(3) + h(2) + h(1) So in general: .. math:: S^{\text{every}}(s_1) & = \sum_{t=1}^T h(t) = \sum_{t=1}^T \frac{1-\gamma^{t} }{1-\gamma} = \frac{T - \gamma \sum_{k=0}^{T-1} \gamma^k}{1-\gamma} \\ & = \frac{T - \gamma \frac{1-\gamma^T }{1-\gamma} }{1-\gamma} \\ & = \frac{T - \gamma h(T) }{1-\gamma} **Part c:** Given the above we can compute the two estimators for :math:`m=1` and see that they clearly give different results. Although they agree when :math:`T \rightarrow \infty`, this limit is not interesting since the value of :math:`T` is random not under our control. **Part d:** Given :math:`m` estimates clearly :math:`N^{\text{first} }(s_1) = m` and .. math:: S^{\text{first}}(s_1) = \sum_{k=1}^m h(T_k) = \frac{m - \sum_{k=1}^m \gamma^{T_k} }{1-\gamma} **Part e:** Again clearly :math:`N^{\text{every} }(s_1) = \sum_{k=1}^m T_k` and .. math:: S^{\text{every}}(s_1) = \sum_{k=1}^m \frac{T_k - \gamma h(T_k) }{1-\gamma} = \frac{\sum_{k=1}^m T_k - \gamma \sum_{k=1}^m h(T_k) }{1-\gamma} **Part f:** By calculation we get that: .. math:: \frac{m}{ N^{\text{every} } (s_1) } = \frac{1}{ \frac{1}{m } \sum_{k=1}^m T_k } \approx \frac{1}{ \mathbb{E}[T] } = \frac{1}{ \frac{1}{1-p} } = 1-p According to the law of large numbers the error in this approximation scale as :math:`\frac{1}{\sqrt{m}}`. **Part g:** Starting with first-visit we have that: .. math:: \frac{ S^{\text{first} } (s_1) }{ N^{\text{first} }(s_1) } & = \frac{1}{m} \sum_{k=1}^m h(T_k) = \frac{1 - \frac{1}{m} \sum_{k=1}^m \gamma^{T_k} }{1-\gamma} \\ & \approx \frac{1}{1-\gamma}\left[ 1 - \mathbb{E}[\gamma^T] \right] \\ & = \frac{1}{1-\gamma}\left[1 - \gamma \frac{ (1-p) }{ 1 - \gamma p } \right] \\ & = \frac{1}{1-\gamma p} For every visit we get that .. math:: \frac{S^{\text{every} } (s_1) }{ N^{\text{every} }(s_1) } & = \frac{1}{1-\gamma} \frac{1}{m} \left[ \sum_{k=1}^m T_k - \gamma \sum_{k=1}^m h(T_k) \right] \frac{m}{ N^{\text{every} }(s_1) } \\ & \approx \left[ \frac{1}{m} \sum_{k=1}^m T_k - \gamma \frac{1}{m} \sum_{k=1}^m h(T_k) \right] \frac{ 1-p }{1-\gamma} & = \left[ \frac{1}{1-p} - \gamma \frac{1}{1-\gamma p} \right] \frac{ 1-p }{1-\gamma} \\ & = \frac{1}{1-\gamma p} So we see that first and every visit agree when :math:`m` is large and the true return is :math:`v_\pi(s_1) = \frac{1}{1-\gamma p}`. .. admonition:: Bonus proofs - The chance of staying in :math:`s_1` is :math:`p`, the chance of staying :math:`T-1` times is :math:`p^{T-1}`, and the chance of jumping to :math:`s_2` is :math:`1-p`. Therefore :math:`p(T) = p^{T-1}(1-p)`. - .. math:: \mathbb{E}[T] & = (1-p)\sum_{T=0}^\infty T p^{T-1} = (1-p) \sum_{T=0}^\infty \frac{d}{dp} p^T \\ & = (1-p) \frac{d}{dp} \sum_{T=0}^\infty p^T = (1-p)\frac{d}{dp} \frac{1}{1-p} = (1-p)\frac{1}{(1-p)^2} = \frac{1}{1-p}. - .. math:: \mathbb{E}[\gamma^T] = (1-p)\sum_{T=1}^\infty T\gamma^T p^{T-1} = (1-p)\gamma \sum_{T=1}^\infty (\gamma p)^{T-1} = \frac{(1-p)\gamma}{1-\gamma p}. {% endif %} {{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=6e346c07-86a5-4675-8e86-b14f01472f3b', 'Problem 10.1-10.2: MC Value estimation', True) }} {{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=02e99fd4-3d50-464e-a089-afe30129c52d', 'Problem 10.3-10.4: MC control', True) }} {{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=3bb0db76-7868-40ba-8e0d-afe3012fecf9', 'Problem 10.5: TD Learning', True) }}