{% import macros as m with context  %}

.. _inventory_game:


Week 1: The Inventory-control game
=============================================================================================================

{{ m.embed_game('week1_inventory') }}

.. topic:: Controls                                                                                                                                                                                                                                                                                                                                         
    :class: margin

    :kbd:`0`, :kbd:`1`, :kbd:`2`
        Buy the given number of items.
    :kbd:`Space`
        Take a random action
    :kbd:`p`
        Automatically take random actions
    :kbd:`r`
        Reset the game

    .. rubric:: Run locally

    :gitref:`../irlc/lectures/lec01/lecture_01_inventory.py`


{#
.. raw:: html

   <iframe src="../apps/week1_inventory/index.html" style="width:650px; height:500px;"></iframe>

.. topic:: Controls

    :kbd:`0`, :kbd:`1`, :kbd:`2`
        Buy the given number of items.
    :kbd:`Space`
        Take a random action
    :kbd:`p`
        Automatically take random actions
    :kbd:`r`
        Reset the game

#}

.. topic:: What you see

    The example showcase the inventory-control environment which you implemented in week 1. The cars are the items delivered (i.e. your actions)
    and the noise terms are the customers.
    In the example, you can select actions yourself and the game will display the reward (recall that reward is minus the cost, i.e. :math:`r_k = -g(x_k, u_k, w_k)`). Your task is to get as much average reward as possible. If you press
    space, the game will buy random amounts of inventory and eventually compute the average reward for this policy -- you can perhaps do better than that? (better always means higher average reward!)

    Here are the rules:

    - The inventory can hold 0, 1, or 2 items.
    - You can order 0, 1, or 2 items
    - Customers buy 0, 1, or 2 items
    - Excess inventory is discarded at the end of the day!

    The cost at each step is :math:`g_k(x_k, u_k, w_k) = u_k + (x_k + u_k - w_k)^2`.


Purpose
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The inventory-control problem is one we will study in depth during this course, and I will invite you to compare it to Pacman. Note in particular

- In pacman, a state is the full configuration on the screen (i.e., Pacmans location and which pellets are eaten or not). In the inventory-control game, the state is just a number (0, 1, or 2), denoting the size of the inventory
- Note that the visualization shows all the states that have been computed (by contrast, Pacman just show a single state)
- Try to manually compute the cost-function :math:`g_k` and verify it is implemented correctly. The simulation shown above works exactly as the one you have seen in the lectures!
- A policy is in this case a function which accept an integer (0, 1, or 2) and return another integer (0, 1, or 2). Try a couple of policies multiple times (for instance, compare a policy which always press 1 vs. one which always press 2). Note that the average reward, computed over multiple episodes, is different. Your task is to find the policy with the highest average reward.