Exercise 9: Bellmans equations and exact planning#
Note
This page contains background information which may be useful in future exercises or projects. You can download this weeks exercise instructions from here:
You are encouraged to prepare the homework problems 1, 2 (indicated by a hand in the PDF file) at home and present your solution during the exercise session.
To get the newest version of the course material, please see Making sure your files are up to date
Markov decision processes#
The Markov Decision Process (MDP) is a mathematical description of the problem solved by most reinforcement learning methods.
It will be represented as the class MDP
and will stand in the same relationship to methods such as value iteration as the DPModel
play to the dynamical programming method.
Let’s look at the MDP for the frozen-lake gridworld environment. It looks as follows:
# BookGridEnvironment
from irlc.gridworld.gridworld_environments import FrozenLake
from irlc import train, Agent, interactive
env = FrozenLake(render_mode="human") # Pass render_mode='human' for visualization.
env, agent = interactive(env, Agent(env)) # Make th
env.reset() # You always need to call reset
env.plot() # Plot the environment.
env.close()
(Source code
, png
, hires.png
, pdf
)

We can instantiate the MDP class directly, or extract if from the gridworld environment as env.mdp
. A couple of examples:
This code plots the initial state, all non-terminal states, and the actions in the initial state:
/builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
from pkg_resources import resource_stream, resource_exists
Warning
The actions available in the different states are not the same in the gridworld environment. Specifically, the winning (goal) state only has a single action, which takes the environment to the terminal state. Let’s take a look:
/builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
from pkg_resources import resource_stream, resource_exists
Transitions probabilities#
The transition probabilities:
Are represented as a function Psr()
that returns a dictionary of the form {(s1, r1): p1, (s2, r2): p2, ...}
, compare to irlc.ex02.dp_model.DPModel.Pw()
. An example:
/builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
from pkg_resources import resource_stream, resource_exists
Note
It is easy to miss, but (Sutton and Barto [SB18]) distinguish between
The set of all states,
corresponding tostates()
The set of all non non-terminal states
corresponding tononterminal_states()
The value-function for a terminal states. Both the policy, value-function, action-value function and transition probabilities
/builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
from pkg_resources import resource_stream, resource_exists
Building a MDP#
Building your own MDP is easy. Let’s try to implement a MDP corresponding to a small gridworld example (see (Sutton and Barto [SB18]), Example 4.1)
# small_gridworld.py
UP,RIGHT, DOWN, LEFT = 0, 1, 2, 3
class SmallGridworldMDP(MDP):
def __init__(self, rows=4, cols=4):
self.rows, self.cols = rows, cols # Number of rows, columns.
super().__init__(initial_state=(rows//2, cols//2) ) # Initial state is in the middle of the board.
def A(self, state):
return [UP, DOWN, RIGHT, LEFT] # All four directions available.
def Psr(self, state, action):
row, col = state # state is in the format state = (row, col)
if action == UP: row -= 1
if action == DOWN: row += 1
if action == LEFT: col += 1
if action == RIGHT: col -= 1
col = min(self.cols-1, max(col, 0)) # Check boundary conditions.
row = min(self.rows-1, max(row, 0))
reward = -1 # Always get a reward of -1
next_state = (row, col)
# Note that P(next_state, reward | state, action) = 1 because environment is deterministic
return {(next_state, reward): 1}
def is_terminal(self, state):
row, col = state
return (row == 0 and col == 0) or (row == self.rows-1 and col == self.cols-1)
Note that:
The states are tuples of row, col coordinates
(i,j)
Each state has 4 actions, so the action-space function
A(state)
simply returns the list[0, 1, 2, 3]
The
is_terminal
function determines if we are in a corner ( or )The transition probabilities are deterministic. So the dictionary it returns has a single key
(next_state, reward)
and value of1
Classes and functions#
- class irlc.ex09.mdp.MDP(initial_state=None, verbose=False)[source]#
Bases:
object
This class represents a Markov Decision Process. It defines three main components: - The actions available in a given state
- The transition probabilities - A terminal check to determine if a state is terminal - A way to specify the initial state:As a single state the MDP always begins in (most common)
As a general distribution
.
In addition to this it allows you to access either - The set of all states (including terminal states) as
mdp.states
- The set of all non-terminal states asmdp.non_terminal_states
Note
The
states
andnon_termianl_states
are computed lazily. This means that if you don’t access them, they won’t use memory. This allows you to specify MDPs with an infinite number of states without running out of memory.- __init__(initial_state=None, verbose=False)[source]#
Initialize the MDP. In the case where
initial_state
is set to a value , the initial state distribution will be- Parameters:
initial_state – An optional initial state.
verbose – If
True
, the class will print out debug information (useful for very large MDPs)
- is_terminal(state)[source]#
Determines if a state is terminal (i.e., the environment/model is complete). In (SB18), the terminal state is written as
./builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. from pkg_resources import resource_stream, resource_exists
- Parameters:
state – The state
to check- Return type:
- Returns:
True
if the state is terminal and otherwiseFalse
.
- Psr(state, action)[source]#
Represents the transition probabilities:
When called with state
state
and actionaction
, the function returns a dictionary of the form{(s1, r1): p1, (s2, r2): p2, ...}
, so thatp2
is the probability of transitioning tos2
(and obtaining rewardr2
) given we are in statestate
and take actionaction
:An example:
/builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. from pkg_resources import resource_stream, resource_exists
- Parameters:
state – The state to compute the transition probabilities in
action – The action to compute the transition probabilities in
- Return type:
- Returns:
A dictionary where the keys are state, reward pairs we will transition to,
, and the values are their probability.
- A(state)[source]#
Returns a list of actions available in the given state:
An example to get the actions in the initial state:
/builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. from pkg_resources import resource_stream, resource_exists
- Parameters:
state – State to compute the actions in
- Return type:
- Returns:
The list of available actions
- initial_state_distribution()[source]#
(Optional) specify the initial state distribution. Should return a dictionary of the form:
{s0: p0, s1: p1, ..., sn: pn}
, in which case .You will typically not overwrite this function but just set the initial state. In that case the initial state distribution is deterministic:
/builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. from pkg_resources import resource_stream, resource_exists
- Returns:
An initial state distribution as a dictionary, where the keys are states, and the valuse are their probability.
- property nonterminal_states#
The list of non-terminal states, i.e.
in (SB18)/builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. from pkg_resources import resource_stream, resource_exists
- Returns:
The list of non-terminal states
- property states#
The list of all states including terminal ones, i.e.
in (SB18). The terminal states are those whereis_terminal(state)
is true./builds/02465material/02465public/py312/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. from pkg_resources import resource_stream, resource_exists
- Returns:
The list all states
Solutions to selected exercises#
Solution to Sutton problem 3.14
The update is
Solution to Sutton problem 4.1
According to the relationships we saw at the beginning of Todays lecture the action-value function satisfy:
Furthermore, when we take an action, we know what the next state
Solution to Problem 2 (Bellmans equations)
Part a: Recall Bellmans equation (it does not matter which) says:
Solving we get that
Part b: We already got the first equation from the previous problem. Applying a similar argument to the second equation leaves us with:
This can be written as a linear system of equations as:
Tip
During the exam, a few students correctly found the system of two equations (in my oppinion, the difficult part),
but when they used Maple to re-write them as a linear system Maple obtained some very strange expressions probably due to the
Bonus question: The inverse of
Problem 9.1-9.3
Problem 9.4: Policy evaluation
Problem 9.5: Policy iteration
Problem 9.6: Value iteration
Problem 9.8
Problem 9.9 - Gambler