{% import macros as m with context  %}
{% set week = 3 %}
{{ m.exercise_head(week) }}


This section will begin our investigation of control theory. In control theory, the time :math:`t` is a continuous variable, and this will place additional requirements on the course software than
the discrete-time case we have seen up to now.

To manage this complexity, the code you need to implement in order to build or test a control environment has been kept as simple as possible, and in a format that matches what you have seen so far. In other words, when you
implement a new control problem, you only need to write a single class (:class:`~irlc.ex03.control_model.ControlModel`),
and let other code deal with the tedious issues of simulation and discretization.

In this week we will only focus on defining the environment, and then next week, we will introduce some
additional helper classes that will prove very helpful for building visualizations and making experiments, but which are **not** necessary to implement a control problem.

When we define a control environment we need to specify 3 things:

The simple bounds
    These provide restrictions on e.g. the possible values of the states and actions, i.e. :math:`\mathbf{x}_\text{low} \leq \mathbf{x} \leq \mathbf{x}_\text{high}` (see (:cite:t:`herlau`, Section 10.2))
The dynamics
    This is simply the differential equation :math:`\mathbf{f}` defined in (:cite:t:`herlau`, Section 10.1)

    .. math::

        \frac{ dx(t)}{dt} = f(x(t), u(t), t)

The cost function
    This measures how good (or bad) a control trajectory is and is defined in (:cite:t:`herlau`, eq. (10.12)), and
    is defined in the class :class:`~irlc.ex03.control_cost.SymbolicQRCost`.

These quantities are specified in the class :class:`~irlc.ex03.control_model.ControlModel`, and
parallel what you saw in the dynamical programming model.


Specifying a model
---------------------------------------------------------------------------------------------------------------
Specifying a model is easiest done by creating a class that inherits from :class:`~irlc.ex03.control_model.ControlModel` and implementing the bounds, dynamics and cost.
As an example, we will consider a simple version of the Pendulum model. Recall that in this environment the state is 2-dimensional, corresponding to the pendulum angle and angular velocity:

.. math::

    \mathbf{x} = \begin{bmatrix} \theta \\ \dot \theta \end{bmatrix}

and the control is 1-dimensional corresponding to a torque. The implementation of this model looks as follows:

.. literalinclude:: ../../shared/output/basic_pendulum_a.py

This is a fairly long listing, so lets go over each of the three components:

.. tip::
    :class: margin

    You are not required to specify bounds. If you do not specify bounds the system will default to no restrictions on all variables except for :math:`t_0` which is set to :math:`t_0 = 0`.

The dynamics:
    The dynamics is specified in the function  :class:`~irlc.ex03.control_model.ControlModel.sym_f` as a function that accepts two input arguments, ``x`` and ``u`` , and should return a new list where each element corresponds to the corresponding
    element in :math:`\mathbf{f}`. The output is therefore two-dimensional.
The Cost:
    The cost is defined in the function :class:`~irlc.ex03.control_model.ControlModel.get_cost`. It should return a
    :class:`~irlc.ex03.control_cost.SymbolicQRCost`. object, in this case it uses :math:`Q=I` and :math:`R=I`.
The bounds:
    There are two bounds specified. The first imposes a restriction on the controls:
    :math:`-10 \leq \mathbf{u}(t) \leq 10` and the second that the system starts in the hangin-down position: :math:`\begin{bmatrix} \pi \\ 0 \end{bmatrix} \leq \mathbf{x}(t_0) \leq \begin{bmatrix} \pi \\ 0 \end{bmatrix}`.
    Note that the functions should return a :class:`gymnasium.spaces.Box`-instance.

Once defined, you can access the bounds and the cost-matrices as follows:

.. runblock:: pycon

    >>> from irlc.ex03.basic_pendulum import BasicPendulumModel
    >>> model = BasicPendulumModel()
    >>> cost = model.get_cost() # Get the cost function object
    >>> print("The Q-matrix is")
    >>> print(cost.Q)
    >>> print("The upper and lower bounds on u is", model.u_bound().low, model.u_bound().high)
    >>> print("Unspecified bounds default to inf:", model.x_bound().low, model.x_bound().high)

.. note::
    :class: margin

    The dimensions of ``x`` and ``u`` are determined by the cost function. It is therefore important that you specify at least the :math:`Q` and :math:`R`-matries with the appropriate dimensions.


In the next week, it will turn out to be very convenient to be able to compute derivatives of the dynamics :math:`\mathbf{f}`. To do that effectively, we will use the library ``sympy``. Sympy is easy to work with,
but if you are unfamiliar with it there is a primer below. Once the model is defined, you can use :python:`print` to get an overview
of how it is defined -- I recommend doing this for verification. An example:

.. runblock:: pycon

    >>> from irlc.ex03.basic_pendulum import BasicPendulumModel
    >>> model = BasicPendulumModel()
    >>> print(model)


Simulating a policy
------------------------------------------------
The model contains a build in simulation function :func:`irlc.ex03.control_model.ControlModel.simulate`.
The function requires an action sequence, for which you have two choices:

- To pass a general policy function ``u(x,t)`` which will then be used to compute actions
- To pass a constant (numpy) value of the action sequence :math:`u_0`, in which case the policy is assumed constant :math:`u(x,t) = u_0`

Click on the **Source code** link to see how the following example simulate the policy :math:`u(x,t) = 3 \sin(2t)` and plot the outcome:

.. plot::

    import matplotlib.pyplot as plt
    import numpy as np
    from irlc.ex04.model_pendulum import PendulumModel

    model = PendulumModel()
    x0 = model.x0_bound().low                  # initial position (hanging down)

    def policy(x, t):
        return [3 * np.sin(2 * t)]

    xx, uu, tt = model.simulate(x0, policy, t0=0, tF=10)
    plt.plot(tt, xx[:, 0], label="$\\theta$")
    plt.plot(tt, uu[:, 0], label="$u$")


Defining bounds
-----------------------------------------------------------
The simple bounds serve a simple purposes, namely to tell us what values the states and actions can possibly take (see (:cite:t:`herlau`, Section 10.2) for a full list of bounds). The following table provides an overview and the last column is their default value. See also
:class:`gymnasium.spaces.Box`.

.. list-table:: Available bounds
   :header-rows: 1

   * - Bound
     - Function
     - Default value
   * - Bound on states: :math:`\mathbf{x}^\text{low} \leq \mathbf{x}(t) \leq \mathbf{x}^\text{high}`
     - :func:`~irlc.ex03.control_model.ControlModel.x_bound`
     - .. literalinclude:: ../../shared/output/control_model_bc_stripped.py
   * - Bound on actions: :math:`\mathbf{u}^\text{low} \leq \mathbf{u}(t) \leq \mathbf{u}^\text{high}`
     - :func:`~irlc.ex03.control_model.ControlModel.u_bound`
     - .. literalinclude:: ../../shared/output/control_model_bd_stripped.py
   * - Bound on start time: :math:`\mathbf{t_F}^\text{low} \leq t_F \leq \mathbf{t_F}^\text{high}`
     - :func:`~irlc.ex03.control_model.ControlModel.tF_bound`
     - .. literalinclude:: ../../shared/output/control_model_bf_stripped.py
   * - Bound on end time: :math:`\mathbf{t_0}^\text{low} \leq t_0 \leq \mathbf{t_0}^\text{high}`
     - :func:`~irlc.ex03.control_model.ControlModel.t0_bound`
     - .. literalinclude:: ../../shared/output/control_model_be_stripped.py
   * - Bound on initial state: :math:`\mathbf{x_0}^\text{low} \leq \mathbf{x}(t_0) \leq \mathbf{x_0}^\text{high}`
     - :func:`~irlc.ex03.control_model.ControlModel.x0_bound`
     - .. literalinclude:: ../../shared/output/control_model_ba_stripped.py
   * - Bound on terminal state: :math:`\mathbf{x_F}^\text{low} \leq \mathbf{x}(t_F) \leq \mathbf{x_F}^\text{high}`
     - :func:`~irlc.ex03.control_model.ControlModel.xF_bound`
     - .. literalinclude:: ../../shared/output/control_model_bb_stripped.py


.. warning::

    Bounds are mostly relevant for direct control methods. I recommend that you don't specify the bounds unless you specifically need them.

According to the default values the system has no restrictions except for the initial state which defaults to the equality constraint
:math:`\mathbf{x}(t_0) = \mathbf{0}` and the initial time :math:`t_0 = 0`. When the initial state is set using an equality
constraint, you can access the initial state as follows:

.. runblock:: pycon

    >>> from irlc.ex03.basic_pendulum import BasicPendulumModel
    >>> model = BasicPendulumModel()
    >>> print("The initial state is x0 =", model.x0_bound().low)


States and actions
*******************************************************************************************
States and actions are represented as vectors. You can get the dimension :math:`n` of the state :math:`x` and dimension :math:`d` of the actions :math:`u` as follows:

.. runblock:: pycon

   >>> from irlc.ex04.model_pendulum import PendulumModel
   >>> model = PendulumModel()
   >>> print("n =", model.state_size, "and d =", model.action_size)


Computing the dynamics
*******************************************************************************************
The dynamics is specified as a symbolic expression
:func:`~irlc.ex04.control_model.ContiniousTimeSymbolicModel.f` is specified as a
``sympy`` object (:cite:t:`herlau`, Section 10.1).
However, the :class:`ControlModel` will automatically generate a numpy equivalent which
is what you will use in the exercises. In other words, you can evaluate it in two ways:

As a numpy expression:
    Use ``model.f``. An example:

    .. runblock:: pycon

        >>> from irlc.ex04.model_pendulum import PendulumModel
        >>> model = PendulumModel()
        >>> x = model.x0_bound().low # Get the initial state
        >>> u = [1]                    # A torque of unit
        >>> model.f(x, u, t=0) # Evaluate the dynamics

As a sympy expression:
    Use ``model.sym_f``. In order to call this function, we must create symbolic variables ``x``, ``u`` and ``t`` of the right dimensions:

    .. runblock:: pycon

        >>> from irlc.ex04.model_pendulum import PendulumModel
        >>> import sympy as sym
        >>> model = PendulumModel()
        >>> x = sym.symbols('x0:2')  # Generates a list with 2 symbolic variables [x_0, x_1]
        >>> u = sym.symbols('u0:1')  # Create a list of one symbolic variable [u0]
        >>> t = sym.symbols('t')     # the time variable (required but not used)
        >>> model.sym_f(x, u, t)     # Returns a symbolic expression corresponding to f(x, u)cd
    .. tip::
        Note that we represent vectors as lists of symbolic expressions. The function will therefore return a length 2 list with 2 symbolic expressions, each corresponding to a coordinate of :math:`\dot f`.

The cost-function
***************************************************************************
The model must also specify a cost-function (:cite:t:`herlau`, Subsection 10.3.1). I have chosen to de-emphasize the cost-function in this course since
the treatment is similar (but conceptually simpler) than the implementation of the dynamics, but the implementation
tends to be more messy since the cost function will typically involve many terms. For now, simply note that you can access the cost function
and print the cost matrices, and the value of the cost function, as follows:

.. runblock:: pycon

    >>> from irlc.ex04.model_pendulum import PendulumModel
    >>> model = PendulumModel()
    >>> cost = model.get_cost() # This function returns a cost-object.
    >>> cost.Q # Get access to a single cost matrix.
    >>> print(cost)   # Print the cost function in a nicer format.


Background: Symbolic expressions and sympy
------------------------------------------------
Sympy is a symbolic math library for python. You can think about it as a greatly simplified version of pytorch, or a python-version of Maple. Mat1 will in fact replace Maple with sympy in 2023.
Using sympy involves this procedure:

- Specify a sympy expression using the ``sympy`` python-library
- Perform operations on the expression, such as substitutions, algebraic operations or computation of derivatives
- Transform the sympy expression into a python function. This function will take numerical values as input and return numerical values.

Let's look at an example. This code computes the derivative of :math:`x^3`:

.. runblock:: pycon

    >>> import sympy as sym
    >>> x = sym.symbols("x") # Create a sympy expression. Compare this to a torch.Tensor
    >>> y = x ** 3           # Note that y is also a sympy expression.
    >>> print(y)
    >>> dydx = sym.diff(y, x) # Compute the derivative; this too is a sympy expression.
    >>> print(dydx)

The final expression is still a sympy object. To turn it into a numpy expression we would do the following:

.. runblock:: pycon

    >>> import sympy as sym
    >>> x = sym.symbols("x")                        # Create a sympy expression. Compare this to a torch.Tensor
    >>> dydx = sym.diff(x**3, x)                    # Compute the derivative; this too is a sympy expression.
    >>> y_numpy = sym.lambdify(x, dydx)             # y_numpy is now a numpy function.
    >>> value = y_numpy(2)                          # Note: This function takes numbers as input and returns numbers.
    >>> print("dx^3/dx evaluated in x=2 is", value)

You might notice that the function ``sym.lambdify`` returns a function. This is called a **higher-order function** if you are curious.
The bottom line is that the above recipe (starting from sympy variables, somehow turning them into another sympy expression, and finally ending with ``sym.lambdify``) will be representative of how we use sympy throughout this course.


Classes and functions
-------------------------------------------------------------------------

.. autoclass:: irlc.ex03.control_model.ControlModel
  :members:

.. autoclass:: irlc.ex03.control_cost.SymbolicQRCost
  :members:


Solutions to selected exercises
-------------------------------------------------------------------------------------------------------

{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=95b4bc7c-1691-4ff7-83e5-b10f012616d6', 'Problem 3.5: Exact evaluation of DP', True) }}

{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=0f8a74f8-fffa-4c64-8977-b10f011b141a', 'Exercise 3.6: Toy 2d control', True) }}


{% if show_solution[week] %}

.. admonition:: Solution to problem 1
    :class: dropdown

    **Part a:** First insert and integrate to get

    .. math::

        \dot x(t) = e^{-x(t)} u(t) = e^{-x(t)} 2t

    A standard strategy is to isolate the function we wish to solve for. This gives:

    .. math::

        2t = \dot x(t) e^{x(t)} = \frac{d}{dt} e^{x(t)}

    Integrating to to get:

    .. math::

        t^2 + k = e^{x(t)}

    The boundary condition gives :math:`x(t) = \log(t^2 + e)`

    **Part b:**  Inserting into the cost function and integrate gives:

    .. math::

        \int_{0}^{t_F} e^{x(t)} dt = \int_{0}^{t_F} (e + t^2) = t_F e + \frac{1}{3}t_F^3

    **Comment:** The purpose of this exercise is both to familiarize you with the notation, and more fundamentally to simply observe that for a given choice of
    actions, it is in principle possible to determine the cost-function exactly. From here on we could in principle try to optimize (solve) for the optimal action sequence.
    For instance, you can try to re-do the problem using a sequence :math:`u(t) = at` and try to determine the optimal value of :math:`a`.

{% endif %}

{% if show_solution[week] %}

.. admonition:: Solution to problem 4: Pen-and-paper DP
    :class: dropdown

    **Part a:** This expression is relevant because :math:`J_k(x_k) = \min_u Q(x_k, u)` -- and it was in fact the expression
    we computed when we implemented when we implemented the DP algorithm last week.
    Looking ahead, an expression :math:`Q` (with a slightly adapted definition) will play a  key role in reinforcement learning.

    **Part b:** If we insert the expressions for :math:`g_k` and :math:`f_k` we get:

    .. math::

        Q(x, u) & = \mathbb{E}[(x + w - u)^2 + \log 2 e^{(x+w-u)^2} ] \\
                & = \log 2 + 2\mathbb{E}[ (x + w - u)^2]

    When dealing with expectations, a good strategy is to expand and use linearity. If we expand the square we get:

    .. math::

        Q(x, u) & = \log 2 + 2\mathbb{E}[ (x - u)^2 + w^2 - 2w(x-u)]  \\
                & = \log 2 + 2 (x - u)^2 + 2\mathbb{E}[w^2] - 4(x-u) \mathbb{E}[w]

    Then simply recall that :math:`\mathbb{E}[w] = 0`
    and :math:`\mathbb{E}[w^2] = 1`
    since :math:`w` follows a standard normal distribution (mean 0 and unit variance) and we get:

    .. math::

        Q(x, u) = \log 2 + 2 + 2 (x - u)^2

    **Part c+d:**  The optimal policy is found by minimizing over actions -- clearly this occurs when :math:`x = u` so:

    .. math::

        \mu_{N-1}(x_{N-1}) = \arg\min_{u} Q(x_k, u) = x_k

    and

    .. math::

        J_{N-1}(x_{N-1}) = \min_{u} Q(x_k, u) = \log 2 + 2


    **Part f:** The only thing that would change is that :math:`\mathbb{E}[w^2] = \sigma^2`. Therefore

    .. math::

        Q_\sigma(x, u) = \log 2 + 2\sigma^2 + 2 (x - u)^2 = Q(x, u) - 2 + 2\sigma^2

    I.e., the only change is a constant factor that does not affect the optimal policy.

    This is perhaps rather strange -- adding noise, regardless of how much, does not change what we should do,
    it just jiggles us around a bit and affects the expected cost.

    **Comment:** We will re-do something similar to this calculation when we derive discrete LQR, but the math will be
    slighly more tedious. The last result, that noise does not affect the optimal policy, is mainly a consequence
    of the noise being symmetric around 0 -- and the same lesson will carry over when we consider LQR.

{% endif %}