Exercise 11: Model-Free Control with tabular and linear methods

Exercise 11: Model-Free Control with tabular and linear methods#

Note

The exercises material is divided into general information (found on this page) and the actual exercise instructions. You can download this weeks exercise instructions from here:
- 02465ex11_Python.pdf
You are encouraged to prepare the homework problems 1 (indicated by a hand in the PDF file) at home and present your solution during the exercise session.
To get the newest version of the course material, please see Making sure your files are up to date

Linear function approximators#

The idea behind linear function approximation of $Q$-values is that

We initialize (and eventually learn) a $d$-dimensional weight vector $w \in \mathbb{R}^d$
We assume there exists a function to compute a $d$-dimensional feature vector $x(s,a) \in \mathbb{R}^d$
The $Q$-values are then represented as

\[Q(s,a) = x(s,a)^\top w\]

Learning is therefore entirely about updating $w$. We are going to use a class, LinearQEncoder, to implement the tile-coding procedure for defining $x(s,a)$ as described in (Sutton and Barto [SB18]).

The following example shows how you initialize the linear $Q$-values and compute them in a given state:

>>> import gymnasium as gym
>>> env = gym.make('MountainCar-v0')
>>> from irlc.ex11.feature_encoder import LinearQEncoder
>>> Q = LinearQEncoder(env, tilings=8) # as in (:cite:t:`sutton`)
>>> s, _ = env.reset()
>>> a = env.action_space.sample()
>>> Q(s,a) # Compute a Q-value.
0.0
>>> Q.d             # Get the number of dimensions
2048
>>> Q.x(s,a)[:4]    # Get the first four coordinates of the x-vector
array([1., 1., 1., 1.])
>>> Q.w[:4]         # Get the first four coordinates of the w-vector
array([0., 0., 0., 0.])

For learning, you can simply update $w$ as any other variable, and there is a convenience method to get the optimal action. The following example will illustrate a basic usage:

>>> import gymnasium as gym
>>> env = gym.make('MountainCar-v0')
>>> from irlc.ex11.feature_encoder import LinearQEncoder
>>> Q = LinearQEncoder(env, tilings=8)
>>> s, _ = env.reset()
>>> a = env.action_space.sample()
>>> Q.w = Q.w + 2 * Q.w     # w <-- 3*w
>>> Q.get_optimal_action(s) # Get the optimal action in state s
2

Note

Depending on how $x(s,a)$ is defined, the linear encoder can behave very differently. I have therefore included a few different classes in irlc.ex09.feature_encoder which only differ in how $x(s,a)$ is computed. I have chosen to focus this guide on the linear tile-encoder which is used in the MountainCar environment and is the main example in (Sutton and Barto [SB18]). The API for the other classes is entirely similar.

Classes and functions#

class irlc.ex11.feature_encoder.FeatureEncoder(env)[source]#

Bases: object

The idea behind linear function approximation of $Q$-values is that

We initialize (and eventually learn) a $d$-dimensional weight vector $w \in \mathbb{R}^d$
We assume there exists a function to compute a $d$-dimensional feature vector $x(s,a) \in \mathbb{R}^d$
The $Q$-values are then represented as

\[Q(s,a) = x(s,a)^\top w\]

Learning is therefore entirely about updating $w$.

The following example shows how you initialize the linear $Q$-values and compute them in a given state:

>>> import gymnasium as gym
>>> from irlc.ex11.feature_encoder import LinearQEncoder
>>> env = gym.make('MountainCar-v0')
>>> Q = LinearQEncoder(env, tilings=8)
>>> s, _ = env.reset()
>>> a = env.action_space.sample()
>>> Q(s,a) # Compute a Q-value.
0.0
>>> Q.d             # Get the number of dimensions
2048
>>> Q.x(s,a)[:4]    # Get the first four coordinates of the x-vector
array([1., 1., 1., 1.])
>>> Q.w[:4]         # Get the first four coordinates of the w-vector
array([0., 0., 0., 0.])

__init__(env)[source]#

Initialize the feature encoder. It requires an environment to know the number of actions and dimension of the state space.

Parameters:: env – An openai Gym Env.

property d#

Get the number of dimensions of $w$

>>> import gymnasium as gym
>>> from irlc.ex11.feature_encoder import LinearQEncoder
>>> env = gym.make('MountainCar-v0')
>>> Q = LinearQEncoder(env, tilings=8) # as in (SB18)
>>> Q.d
2048

x(s, a)[source]#

Computes the $d$-dimensional feature vector $x(s,a)$

>>> import gymnasium as gym
>>> from irlc.ex11.feature_encoder import LinearQEncoder
>>> env = gym.make('MountainCar-v0')
>>> Q = LinearQEncoder(env, tilings=8) # as in (SB18)
>>> s, info = env.reset()
>>> x = Q.x(s, env.action_space.sample())

Parameters:

s – A state $s$
a – An action $a$

Returns:

Feature vector $x(s,a)$

get_Qs(state, info_s=None)[source]#

This is a helper function, it is only for internal use.

Parameters:

state
info_s

Returns:

get_optimal_action(state, info=None)[source]#

For a given state state, this function returns the optimal action for that state.

\[a^* = \arg\max_a Q(s,a)\]

An example:

>>> from irlc.ex09.rl_agent import TabularAgent
>>> class MyAgent(TabularAgent):
...     def pi(self, s, k, info=None):
...         a_star = self.Q.get_optimal_action(s, info)
... 

Parameters:

state – State to find the optimal action in $s$
info – The info-dictionary corresponding to this state

Returns:

The optimal action according to the Q-values $a^*$

class irlc.ex11.feature_encoder.LinearQEncoder(env, tilings=8, max_size=2048)[source]#

Bases: FeatureEncoder

__init__(env, tilings=8, max_size=2048)[source]#

Implements the tile-encoder described by (SB18)

Parameters:

env – The openai Gym environment we wish to solve.
tilings – Number of tilings (translations). Typically 8.
max_size – Maximum number of dimensions.

x(s, a)[source]#

Computes the $d$-dimensional feature vector $x(s,a)$

>>> import gymnasium as gym
>>> from irlc.ex11.feature_encoder import LinearQEncoder
>>> env = gym.make('MountainCar-v0')
>>> Q = LinearQEncoder(env, tilings=8) # as in (SB18)
>>> s, info = env.reset()
>>> x = Q.x(s, env.action_space.sample())

Parameters:

s – A state $s$
a – An action $a$

Returns:

Feature vector $x(s,a)$

property d#

Get the number of dimensions of $w$

>>> import gymnasium as gym
>>> from irlc.ex11.feature_encoder import LinearQEncoder
>>> env = gym.make('MountainCar-v0')
>>> Q = LinearQEncoder(env, tilings=8) # as in (SB18)
>>> Q.d
2048

Solutions to selected exercises#

Solution to the conceptual exam problem

Part a: Since the immediate reward is zero, the next $Q$-value will be determined by the $Q$-value associated with the state north of the agent and the action the agent generates in that state:

\[Q(s,a) = Q(s,a) + \alpha (r + \gamma Q(s', a') - Q(s,a) )\]

If the exploration rate is non-zero, all actions $a’$ may occur, giving rise to two different new values. This mean the $Q$-value can be updated to:

\[Q(s, \texttt{North} )= 0.0, 0.432\]

Part b: It is evident that we need to propagate the $Q$-value from the northern square to the $Q$-value we wish to update. To do this, we first go $\texttt{north}$, but then to change the $Q$-value we must select $\texttt{east}$ in that state. Therefore, we backtrack ($\texttt{west}$, $\texttt{south}$). Pacman will now be on the current state, and a single step $\texttt{east}$ will mean pacman updates the green $Q$-value, and it can be updated to a non-zero value since the next action generated by Sarsa can select the $Q$-value associated with the red square. The answer is therefore 5 and the actions are:

\[\texttt{north}, \texttt{east}, \texttt{west}, \texttt{south}, \texttt{east}\]

Part c: After convergence, Sarsa will have learned the $Q$-values associated with the $\varepsilon$-soft policy $\pi$, i.e. to $q_\pi$. It will clearly attempt to move the agent towards the goal square with a $+1$ reward, and since $\gamma<1$ it will attempt to do so quickly. The fastest way to do that is either north or south of the central pillar. The southern way, associate with $Q_e$, takes the agent next to the dangerous square with a $-1$ reward. There is a chance of at least $\frac{\varepsilon}{4}$ of randomly falling into that square using the $\varepsilon$-soft policy. We can therefore conclude this path is far more dangerous, and this must be reflected in the $Q$-values. Hence, $Q_e < Q_n$. Note that this will not be true for $Q$-learning, where the two paths are the same. Note this argument is similar to the cliffwalking example we saw in the exercises and in (Sutton and Barto [SB18]).

Problem 11.1: Q-learning agent

Problem 11.2: Sarsa-learning agent

Problem 11.3: Semi-gradient Q-agent

Exercise 11: Model-Free Control with tabular and linear methods

Contents

Exercise 11: Model-Free Control with tabular and linear methods#

Linear function approximators#

Classes and functions#

Solutions to selected exercises#