Exercise 11: Model-Free Control with tabular and linear methods#
Note
The exercises material is divided into general information (found on this page) and the actual exercise instructions. You can download this weeks exercise instructions from here:
You are encouraged to prepare the homework problems 1 (indicated by a hand in the PDF file) at home and present your solution during the exercise session.
To get the newest version of the course material, please see Making sure your files are up to date
Linear function approximators#
The idea behind linear function approximation of \(Q\)-values is that
We initialize (and eventually learn) a \(d\)-dimensional weight vector \(w \in \mathbb{R}^d\)
We assume there exists a function to compute a \(d\)-dimensional feature vector \(x(s,a) \in \mathbb{R}^d\)
The \(Q\)-values are then represented as
\[Q(s,a) = x(s,a)^\top w\]
Learning is therefore entirely about updating \(w\).
We are going to use a class, LinearQEncoder
, to implement the tile-coding procedure for defining \(x(s,a)\) as described in (Sutton and Barto [SB18]).
The following example shows how you initialize the linear \(Q\)-values and compute them in a given state:
>>> import gymnasium as gym
>>> env = gym.make('MountainCar-v0')
>>> from irlc.ex11.feature_encoder import LinearQEncoder
>>> Q = LinearQEncoder(env, tilings=8) # as in (:cite:t:`sutton`)
>>> s, _ = env.reset()
>>> a = env.action_space.sample()
>>> Q(s,a) # Compute a Q-value.
np.float64(0.0)
>>> Q.d # Get the number of dimensions
2048
>>> Q.x(s,a)[:4] # Get the first four coordinates of the x-vector
array([1., 1., 1., 1.])
>>> Q.w[:4] # Get the first four coordinates of the w-vector
array([0., 0., 0., 0.])
For learning, you can simply update \(w\) as any other variable, and there is a convenience method to get the optimal action. The following example will illustrate a basic usage:
>>> import gymnasium as gym
>>> env = gym.make('MountainCar-v0')
>>> from irlc.ex11.feature_encoder import LinearQEncoder
>>> Q = LinearQEncoder(env, tilings=8)
>>> s, _ = env.reset()
>>> a = env.action_space.sample()
>>> Q.w = Q.w + 2 * Q.w # w <-- 3*w
>>> Q.get_optimal_action(s) # Get the optimal action in state s
1
Note
Depending on how \(x(s,a)\) is defined, the linear encoder can behave very differently. I have therefore included
a few different classes in irlc.ex09.feature_encoder
which only differ in how \(x(s,a)\) is computed. I have chosen to focus this guide on the linear tile-encoder
which is used in the MountainCar environment and is the main example in (Sutton and Barto [SB18]). The API for the other classes is entirely similar.
Classes and functions#
- class irlc.ex11.feature_encoder.FeatureEncoder(env)[source]#
Bases:
object
The idea behind linear function approximation of \(Q\)-values is that
We initialize (and eventually learn) a \(d\)-dimensional weight vector \(w \in \mathbb{R}^d\)
We assume there exists a function to compute a \(d\)-dimensional feature vector \(x(s,a) \in \mathbb{R}^d\)
The \(Q\)-values are then represented as
\[Q(s,a) = x(s,a)^\top w\]
Learning is therefore entirely about updating \(w\).
The following example shows how you initialize the linear \(Q\)-values and compute them in a given state:
>>> import gymnasium as gym >>> from irlc.ex11.feature_encoder import LinearQEncoder >>> env = gym.make('MountainCar-v0') >>> Q = LinearQEncoder(env, tilings=8) >>> s, _ = env.reset() >>> a = env.action_space.sample() >>> Q(s,a) # Compute a Q-value. np.float64(0.0) >>> Q.d # Get the number of dimensions 2048 >>> Q.x(s,a)[:4] # Get the first four coordinates of the x-vector array([1., 1., 1., 1.]) >>> Q.w[:4] # Get the first four coordinates of the w-vector array([0., 0., 0., 0.])
- __init__(env)[source]#
Initialize the feature encoder. It requires an environment to know the number of actions and dimension of the state space.
- Parameters:
env – An openai Gym
Env
.
- property d#
Get the number of dimensions of \(w\)
>>> import gymnasium as gym >>> from irlc.ex11.feature_encoder import LinearQEncoder >>> env = gym.make('MountainCar-v0') >>> Q = LinearQEncoder(env, tilings=8) # Same encoding as Sutton & Barto >>> Q.d 2048
- x(s, a)[source]#
Computes the \(d\)-dimensional feature vector \(x(s,a)\)
>>> import gymnasium as gym >>> from irlc.ex11.feature_encoder import LinearQEncoder >>> env = gym.make('MountainCar-v0') >>> Q = LinearQEncoder(env, tilings=8) # Same encoding as Sutton & Barto >>> s, info = env.reset() >>> x = Q.x(s, env.action_space.sample())
- Parameters:
s – A state \(s\)
a – An action \(a\)
- Returns:
Feature vector \(x(s,a)\)
- get_Qs(state, info_s=None)[source]#
This is a helper function, it is only for internal use.
- Parameters:
state
info_s
- Returns:
- get_optimal_action(state, info=None)[source]#
For a given state
state
, this function returns the optimal action for that state.\[a^* = \arg\max_a Q(s,a)\]An example:
>>> from irlc.ex09.rl_agent import TabularAgent >>> class MyAgent(TabularAgent): ... def pi(self, s, k, info=None): ... a_star = self.Q.get_optimal_action(s, info) ...
- Parameters:
state – State to find the optimal action in \(s\)
info – The
info
-dictionary corresponding to this state
- Returns:
The optimal action according to the Q-values \(a^*\)
- class irlc.ex11.feature_encoder.LinearQEncoder(env, tilings=8, max_size=2048)[source]#
Bases:
FeatureEncoder
- __init__(env, tilings=8, max_size=2048)[source]#
Implements the tile-encoder described by (SB18)
- Parameters:
env – The openai Gym environment we wish to solve.
tilings – Number of tilings (translations). Typically 8.
max_size – Maximum number of dimensions.
- x(s, a)[source]#
Computes the \(d\)-dimensional feature vector \(x(s,a)\)
>>> import gymnasium as gym >>> from irlc.ex11.feature_encoder import LinearQEncoder >>> env = gym.make('MountainCar-v0') >>> Q = LinearQEncoder(env, tilings=8) # Same encoding as Sutton & Barto >>> s, info = env.reset() >>> x = Q.x(s, env.action_space.sample())
- Parameters:
s – A state \(s\)
a – An action \(a\)
- Returns:
Feature vector \(x(s,a)\)
- property d#
Get the number of dimensions of \(w\)
>>> import gymnasium as gym >>> from irlc.ex11.feature_encoder import LinearQEncoder >>> env = gym.make('MountainCar-v0') >>> Q = LinearQEncoder(env, tilings=8) # Same encoding as Sutton & Barto >>> Q.d 2048
Solutions to selected exercises#
Problem 11.1: Q-learning agent
Problem 11.2: Sarsa-learning agent
Problem 11.3: Semi-gradient Q-agent