Exercise 10: Monte-carlo methods and TD learning#

Note

  • The exercises material is divided into general information (found on this page) and the actual exercise instructions. You can download this weeks exercise instructions from here:

  • You are encouraged to prepare the homework problems 1 (indicated by a hand in the PDF file) at home and present your solution during the exercise session.

  • To get the newest version of the course material, please see Making sure your files are up to date

Tabular methods (Q-learning, Sarsa, etc.)#

As the name suggests, tabular methods requires us to maintain a table of \(Q\)-values or state-values \(V\). The \(Q\)-values in particular can be a bit tricky to keep track of and I have therefore made a helper class irlc.ex09.rl_agent.TabularAgent which will hopefully simplify the process.

Note

The main complication we need to deal with when representing the Q-values is when different states have different action spaces, i.e. when \(\mathcal{A}(s) \neq \mathcal{A}(s')\). Gymasiums ways of dealing with this situation is to use the info-dictionary, e.g. so that s, info = env.reset() will specify a info['mask'] variable which is a numpy ndarray so that a given action is available if info['mask'][a] == 1. You can read more about this choice at The gymnasium discrete space documentation.

The \(Q\)-values behave like a 2d numpy ndarray:

>>> from irlc.ex09.rl_agent import TabularAgent
>>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
>>> env = BookGridEnvironment()
>>> agent = TabularAgent(env, epsilon=0.3) # Use epsilon-greedy exploration.
>>> state, _ = env.reset()
>>> state
(0, 0)
>>> agent.Q[state, 1] = 2 # Update a Q-value
>>> agent.Q[state, 1]     # Get a Q-value
2
>>> agent.Q[state, 0]     # Q-values are by default zero
0

To implement masking, the agent.Q-table has two special functions which requires the info-dictionary. As long as you stick to these two functions and pass the correct info dictionary you will not get into troubles.

  • To get the optimal action use agent.Q.get_optimal_action(s, info_s)

    >>> from irlc.ex09.rl_agent import TabularAgent
    >>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
    >>> env = BookGridEnvironment()
    >>> agent = TabularAgent(env)
    >>> state, info = env.reset()               # Get the info-dictionary corresponding to s
    >>> agent.Q[state, 1] = 2.5                 # Update a Q-value; action a=1 is now optimal.
    >>> agent.Q.get_optimal_action(state, info) # Note we pass along the info-dictionary corresopnding to this state
    1
    
  • To get all Q-values corresponding to a state use agent.Q.get_Qs(s, info_s)

    >>> from irlc.ex09.rl_agent import TabularAgent
    >>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
    >>> env = BookGridEnvironment()
    >>> agent = TabularAgent(env)
    >>> state, info = env.reset()                  # Get the info-dictionary corresponding to s
    >>> agent.Q[state, 1] = 2.5                    # Update a Q-value; action a=1 is now optimal.
    >>> actions, Qs = agent.Q.get_Qs(state, info)  # Note we pass along the info-dictionary corresopnding to this state
    >>> actions                                    # All actions that are available in this state (after masking)
    (0, 1, 2, 3)
    >>> Qs                                         # All Q-values available in this state (after masking)
    (0, 2.5, 0, 0)
    

You can combine this functionality to get e.g. the maximal Q-value using agent.Q[s, agent.Q.get_optimal_action(s, info)].

Note

The Q-table will remember the masking information for a given state and warn you if you are trying to access an action that has been previously masked.

We often want to perform \(\varepsilon\)-greedy exploration. To simplify this, the agent has the function agent.pi_eps. Since this function uses the Q-values, it also requires an info-dictionary:

>>> from irlc.ex09.rl_agent import TabularAgent
>>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
>>> env = BookGridEnvironment()
>>> agent = TabularAgent(env, epsilon=0.1)  # epsilon-greedy exploration
>>> state, info = env.reset()               # to get a state and info-dictionary
>>> a = agent.pi_eps(state, info)           # Epsilon-greedy action selection
>>> a
0

Warning

In the train(s, a, r, sp, done, info_s, info_sp)-method, remember to use the info-dictionary corresponding to the state.

  • use self.Q.get_Qs(s, info_s) and self.Q.get_Qs(sp, info_sp)

  • never use self.Q.get_Qs(s, info_sp)

Classes and functions#

class irlc.ex09.rl_agent.TabularAgent(env, gamma=0.99, epsilon=0)[source]#

Bases: Agent

This helper class will simplify the implementation of most basic reinforcement learning. Specifically it provides:

  • A \(Q(s,a)\)-table data structure

  • An epsilon-greedy exploration method

The code for the class is very simple, and I think it is a good idea to at least skim it.

The Q-data structure can be used a follows:

>>> from irlc.ex09.rl_agent import TabularAgent
>>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
>>> env = BookGridEnvironment()
>>> agent = TabularAgent(env)
>>> state, info = env.reset()               # Get the info-dictionary corresponding to s
>>> agent.Q[state, 1] = 2.5                 # Update a Q-value; action a=1 is now optimal.
>>> agent.Q[state, 1]                       # Check it has indeed been updated.
2.5
>>> agent.Q[state, 0]                       # Q-values are 0 by default.
0
>>> agent.Q.get_optimal_action(state, info) # Note we pass along the info-dictionary corresopnding to this state
1

Note

The get_optimal_action-function requires an info dictionary. This is required since the info dictionary contains information about which actions are available. To read more about the Q-values, see TabularQ.

__init__(env, gamma=0.99, epsilon=0)[source]#

Initialize a tabular environment. For convenience, it stores the discount factor \(\gamma\) and exploration parameter \(\varepsilon\) for epsilon-greedy exploration. Access them as e.g. self.gamma

When you implement an agent and overwrite the __init__-method, you should include a call such as super().__init__(gamma, epsilon).

Parameters:
  • env – The gym environment

  • gamma – The discount factor \(\gamma\)

  • epsilon – Exploration parameter \(\varepsilon\) for epsilon-greedy exploration

pi_eps(s, info)[source]#

Performs \(\varepsilon\)-greedy exploration with \(\varepsilon =\) self.epsilon and returns the action. Recall this means that with probability \(\varepsilon\) it returns a random action, and otherwise it returns an action associated with a maximal Q-value (\(\arg\max_a Q(s,a)\)). An example:

>>> from irlc.ex09.rl_agent import TabularAgent
>>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
>>> env = BookGridEnvironment()
>>> agent = TabularAgent(env)
>>> state, info = env.reset()
>>> agent.pi_eps(state, info) # Note we pass along the info-dictionary corresopnding to this state
0

Note

The info dictionary is used to mask (exclude) actions that are not possible in the state. It is similar to the info dictionary in agent.pi(s,info).

Parameters:
  • s – A state \(s_t\)

  • info – The corresponding info-dictionary returned by the gym environment

Returns:

An action computed using \(\varepsilon\)-greedy action selection based the Q-values stored in the self.Q class.

class irlc.ex09.rl_agent.TabularQ(env)[source]#

Bases: object

This is a helper class for storing Q-values. It is used by the TabularAgent to store Q-values where it can be be accessed as self.Q[s,a].

__init__(env)[source]#

Initialize the table. It requires a gym environment to know how many actions there are for each state. :type env: :param env: A gym environment.

get_Qs(state, info_s=None)[source]#

Get a list of all known Q-values for this particular state. That is, in a given state, it will return the two lists:

\[\begin{split}\begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_k \end{bmatrix}, \quad \begin{bmatrix} Q(s,a_1) \\ Q(s,a_1) \\ \vdots \\ Q(s,a_k) \end{bmatrix} \\\end{split}\]

the info_s parameter will ensure actions are correctly masked. An example of how to use this function from a policy:

>>> from irlc.ex09.rl_agent import TabularAgent
>>> class MyAgent(TabularAgent):
...     def pi(self, s, k, info=None):
...         actions, q_values = self.Q.get_Qs(s, info)
... 
Parameters:
  • state – The state to query

  • info_s – The info-dictionary returned by the environment for this state. Used for action-masking.

Returns:

  • actions - A tuple containing all actions available in this state (a_1, a_2, ..., a_k)

  • Qs - A tuple containing all Q-values available in this state (Q[s,a1], Q[s, a2], ..., Q[s,ak])

get_optimal_action(state, info_s)[source]#

For a given state state, this function returns the optimal action for that state.

\[a^* = \arg\max_a Q(s,a)\]

An example: .. runblock:: pycon

>>> from irlc.ex09.rl_agent import TabularAgent
>>> class MyAgent(TabularAgent):
...     def pi(self, s, k, info=None):
...         a_star = self.Q.get_optimal_action(s, info)
Parameters:
  • state – State to find the optimal action in \(s\)

  • info_s – The info-dictionary corresponding to this state

Returns:

The optimal action according to the Q-table \(a^*\)

to_dict()[source]#

This helper function converts the known Q-values to a dictionary. This function is only used for visualization purposes in some of the examples.

Returns:

A dictionary q of all known Q-values of the form q[s][a]

Solutions to selected exercises#