{% import macros as m with context %} {% set week = 11 %} {{ m.exercise_head(week) }} Linear function approximators ---------------------------------------------------------------------------------- The idea behind linear function approximation of :math:`Q`-values is that - We initialize (and eventually learn) a :math:`d`-dimensional weight vector :math:`w \in \mathbb{R}^d` - We assume there exists a function to compute a :math:`d`-dimensional feature vector :math:`x(s,a) \in \mathbb{R}^d` - The :math:`Q`-values are then represented as .. math:: Q(s,a) = x(s,a)^\top w Learning is therefore entirely about updating :math:`w`. We are going to use a class, :class:`~irlc.ex11.feature_encoder.LinearQEncoder`, to implement the tile-coding procedure for defining :math:`x(s,a)` as described in (:cite:t:`sutton`). The following example shows how you initialize the linear :math:`Q`-values and compute them in a given state: .. runblock:: pycon >>> import gymnasium as gym >>> env = gym.make('MountainCar-v0') >>> from irlc.ex11.feature_encoder import LinearQEncoder >>> Q = LinearQEncoder(env, tilings=8) # as in (:cite:t:`sutton`) >>> s, _ = env.reset() >>> a = env.action_space.sample() >>> Q(s,a) # Compute a Q-value. >>> Q.d # Get the number of dimensions >>> Q.x(s,a)[:4] # Get the first four coordinates of the x-vector >>> Q.w[:4] # Get the first four coordinates of the w-vector For learning, you can simply update :math:`w` as any other variable, and there is a convenience method to get the optimal action. The following example will illustrate a basic usage: .. runblock:: pycon >>> import gymnasium as gym >>> env = gym.make('MountainCar-v0') >>> from irlc.ex11.feature_encoder import LinearQEncoder >>> Q = LinearQEncoder(env, tilings=8) >>> s, _ = env.reset() >>> a = env.action_space.sample() >>> Q.w = Q.w + 2 * Q.w # w <-- 3*w >>> Q.get_optimal_action(s) # Get the optimal action in state s .. note:: Depending on how :math:`x(s,a)` is defined, the linear encoder can behave very differently. I have therefore included a few different classes in ``irlc.ex09.feature_encoder`` which only differ in how :math:`x(s,a)` is computed. I have chosen to focus this guide on the linear tile-encoder which is used in the MountainCar environment and is the main example in (:cite:t:`sutton`). The API for the other classes is entirely similar. Classes and functions ------------------------------------------------------------------------- .. autoclass:: irlc.ex11.feature_encoder.FeatureEncoder :show-inheritance: :members: .. autoclass:: irlc.ex11.feature_encoder.LinearQEncoder :show-inheritance: :members: Solutions to selected exercises ------------------------------------------------------------------------------------------------------- {% if show_solution[week] %} .. admonition:: Solution to the conceptual exam problem :class: dropdown **Part a:** Since the immediate reward is zero, the next :math:`Q`-value will be determined by the :math:`Q`-value associated with the state north of the agent and the action the agent generates in that state: .. math:: Q(s,a) = Q(s,a) + \alpha (r + \gamma Q(s', a') - Q(s,a) ) If the exploration rate is non-zero, all actions $a'$ may occur, giving rise to two different new values. This mean the $Q$-value can be updated to: .. math:: Q(s, \texttt{North} )= 0.0, 0.432 **Part b:** It is evident that we need to propagate the $Q$-value from the northern square to the $Q$-value we wish to update. To do this, we first go :math:`\texttt{north}`, but then to change the :math:`Q`-value we must select :math:`\texttt{east}` in that state. Therefore, we backtrack (:math:`\texttt{west}`, :math:`\texttt{south}`). Pacman will now be on the current state, and a single step :math:`\texttt{east}` will mean pacman updates the green :math:`Q`-value, and it can be updated to a non-zero value since the next action generated by Sarsa can select the :math:`Q`-value associated with the red square. The answer is therefore 5 and the actions are: .. math:: \texttt{north}, \texttt{east}, \texttt{west}, \texttt{south}, \texttt{east} **Part c:** After convergence, Sarsa will have learned the $Q$-values associated with the :math:`\varepsilon`-soft policy :math:`\pi`, i.e. to :math:`q_\pi`. It will clearly attempt to move the agent towards the goal square with a :math:`+1` reward, and since :math:`\gamma<1` it will attempt to do so quickly. The fastest way to do that is either north or south of the central pillar. The southern way, associate with :math:`Q_e`, takes the agent next to the dangerous square with a :math:`-1` reward. There is a chance of at least :math:`\frac{\varepsilon}{4}` of randomly falling into that square using the :math:`\varepsilon`-soft policy. We can therefore conclude this path is far more dangerous, and this must be reflected in the :math:`Q`-values. Hence, :math:`Q_e < Q_n`. Note that this will not be true for :math:`Q`-learning, where the two paths are the same. Note this argument is similar to the cliffwalking example we saw in the exercises and in (:cite:t:`sutton`). {% endif %} {{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=28374cb3-7857-4dfe-9326-afea011f039a', 'Problem 11.1: Q-learning agent', True) }} {{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=e87906f1-ce7b-44c0-8d03-afea0124acd6', 'Problem 11.2: Sarsa-learning agent', True) }} {{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=32fbdba5-3ba5-4c8f-a956-afea0129b143', 'Problem 11.3: Semi-gradient Q-agent', True) }}