# Outlook
In this notebook, using BBRL, we code a simple version of the DQN algorithm
without a replay buffer nor a target network so as to better understand the
inner mechanisms.

To understand this code, you need to know more about 
[the BBRL interaction model](https://github.com/osigaud/bbrl/blob/master/docs/overview.md)
Then you should run [a didactical example](https://github.com/osigaud/bbrl/blob/master/docs/notebooks/02-multi_env_noautoreset.student.ipynb)
to see how agents interact in BBRL when autoreset=False.

The DQN algorithm is explained in [this
video](https://www.youtube.com/watch?v=CXwvOMJujZk) and you can also read [the
corresponding slides](http://pages.isir.upmc.fr/~sigaud/teach/dqn.pdf).

## Installation and Imports

### Installation

The BBRL library is [here](https://github.com/osigaud/bbrl).

We use OmegaConf to that makes it possible that by just defining the `def
run_dqn(cfg):` function and then executing a long `params = {...}` variable at
the bottom of this colab, the code is run with the parameters without calling
an explicit main.

More precisely, the code is run by calling

`config=OmegaConf.create(params)`

`run_dqn(config)`

at the very bottom of the colab, after starting tensorboard.

Below, we import standard python packages, pytorch packages and gymnasium
environments.

In [None]:
# Installs the necessary Python and system libraries
try:
    from easypip import easyimport, easyinstall, is_notebook
except ModuleNotFoundError as e:
    get_ipython().run_line_magic("pip", "install easypip")
    from easypip import easyimport, easyinstall, is_notebook

easyinstall("bbrl>=0.2.2")
easyinstall("swig")
easyinstall("bbrl_gymnasium>=0.2.0")
easyinstall("bbrl_gymnasium[box2d]")
easyinstall("bbrl_gymnasium[classic_control]")
easyinstall("tensorboard")
easyinstall("moviepy")
easyinstall("box2d-kengz")

In [None]:
import os
import sys
from pathlib import Path
import math

from moviepy.editor import ipython_display as video_display
import time
from tqdm.auto import tqdm
from typing import Tuple, Optional
from functools import partial

from omegaconf import OmegaConf
import torch
import bbrl_gymnasium

import copy
from abc import abstractmethod, ABC
import torch.nn as nn
import torch.nn.functional as F
from time import strftime

OmegaConf.register_new_resolver(
    "current_time", lambda: strftime("%Y%m%d-%H%M%S"), replace=True
)

In [None]:
# Imports all the necessary classes and functions from BBRL
from bbrl.agents.agent import Agent
from bbrl import get_arguments, get_class, instantiate_class

# The workspace is the main class in BBRL, this is where all data is collected and stored
from bbrl.workspace import Workspace

# Agents(agent1,agent2,agent3,...) executes the different agents the one after the other
# TemporalAgent(agent) executes an agent over multiple timesteps in the workspace,
# or until a given condition is reached
from bbrl.agents import Agents, TemporalAgent

# ParallelGymAgent is an agent able to execute a batch of gymnasium environments
# with auto-resetting. These agents produce multiple variables in the workspace:
# ’env/env_obs’, ’env/reward’, ’env/timestep’, ’env/terminated’,
# 'env/truncated', 'env/done', ’env/cumulated_reward’, ...
#
# When called at timestep t=0, the environments are automatically reset. At
# timestep t>0, these agents will read the ’action’ variable in the workspace at
# time t − 1
from bbrl.agents.gymnasium import GymAgent, ParallelGymAgent, make_env, record_video

# Replay buffers are useful to store past transitions when training
from bbrl.utils.replay_buffer import ReplayBuffer

In [None]:
# Utility function for launching tensorboard
# For Colab - otherwise, it is easier and better to launch tensorboard from
# the terminal
def setup_tensorboard(path):
    path = Path(path)
    answer = ""
    if is_notebook():
        if get_ipython().__class__.__module__ == "google.colab._shell":
            answer = "y"
        while answer not in ["y", "n"]:
            answer = input(
                f"Do you want to launch tensorboard in this notebook [y/n] "
            ).lower()

    if answer == "y":
        get_ipython().run_line_magic("load_ext", "tensorboard")
        get_ipython().run_line_magic("tensorboard", f"--logdir {path.absolute()}")
    else:
        import sys
        import os
        import os.path as osp

        print(
            f"Launch tensorboard from the shell:\n{osp.dirname(sys.executable)}/tensorboard --logdir={path.absolute()}"
        )

## Definition of agents

The [DQN](https://daiwk.github.io/assets/dqn.pdf) algorithm is a critic only algorithm.
Thus we just need a Critic agent (which is also used to output actions) and an Environment agent.

### The critic agent

The critic agent is an instance of the `DiscreteQAgent` class.
We first build a deterministic neural network that takes the state as input (so it has one input neuron per state variable)
and that outputs the Q-value of each action in that state (so it has one output neuron per action).

The function below builds a multi-layer perceptron where the size of each layer is given in the `size` list.
We also specify the activation function of neurons at each layer and optionally a different activation function for the final layer.

In [None]:
import torch.nn as nn
def build_mlp(sizes, activation, output_activation=nn.Identity()):
    """Helper function to build a multi-layer perceptron (function from $\mathbb R^n$ to $\mathbb R^p$)
    
    Args:
        sizes (List[int]): the number of neurons at each layer
        activation (nn.Module): a PyTorch activation function (after each layer but the last)
        output_activation (nn.Module): a PyTorch activation function (last layer)
    """
    layers = []
    for j in range(len(sizes) - 1):
        act = activation if j < len(sizes) - 2 else output_activation
        layers += [nn.Linear(sizes[j], sizes[j + 1]), act]
    return nn.Sequential(*layers)

As any BBRL agent, the DiscreteQAgent has a `forward()` function that takes a time state as input.
This `forward()` function outputs the Q-values of all actions at the corresponding time step.
Additionally, if the critic is used to choose an action, it also outputs the chosen action at the same time step.

In [None]:
class DiscreteQAgent(Agent):
    """BBRL agent (discrete actions) based on a MLP"""
    def __init__(self, state_dim, hidden_layers, action_dim):
        super().__init__()
        self.model = build_mlp(
            [state_dim] + list(hidden_layers) + [action_dim], activation=nn.ReLU()
        )

    def forward(self, t: int, **kwargs):
        """An Agent can use self.workspace"""

        # Retrieves the observation from the environment at time t
        obs = self.get(("env/env_obs", t))

        # Computes the critic (Q) values for the observation
        q_values = self.model(obs)

        # ... and sets the q-values (one for each possible action)
        self.set(("q_values", t), q_values)

#### Greedily choosing the action

The ArgmaxActionSelector is in charge of choosing the action whose Q-value is the highest given the Q-values of all actions.
We may use it when we do not want to explore. 

In [None]:
class ArgmaxActionSelector(Agent):
    """BBRL agent that selects the best action based on Q(s,a)"""
    def forward(self, t: int, **kwargs):
        q_values = self.get(("q_values", t))
        action = q_values.argmax(1)
        self.set(("action", t), action)

### Creating an Exploration method

As Q-learning, DQN needs some exploration to prevent too early convergence.
Here we use the simple $\epsilon$-greedy exploration method.
It is implemented as an agent which chooses an action based on the Q-values.

In [None]:
class EGreedyActionSelector(Agent):
    def __init__(self, epsilon):
        super().__init__()
        self.epsilon = epsilon

    def forward(self, t: int, **kwargs):
        # Retrieves the q values 
        # (matrix nb. of episodes x nb. of actions)
        q_values = self.get(("q_values", t))
        size, nb_actions = q_values.size()

        # Flag 
        is_random = torch.rand(size).lt(self.epsilon).float()
        random_action = torch.randint(low=0, high=nb_actions, size=(size,))
        max_action = q_values.max(1)[1]

        # Choose the action based on the is_random flag
        action = is_random * random_action + (1 - is_random) * max_action

        # Sets the action at time t
        self.set(("action", t), action.long())

### The Logger class

The logger is in charge of collecting statistics during the training
process.

Having logging provided under the hood is one of the features allowing you
to save time when using RL libraries like BBRL.

In these notebooks, the logger is defined as `bbrl.utils.logger.TFLogger` so as
to use a tensorboard visualisation (see the parameters part `params = { "logger":{ ...` below).

Note that the BBRL Logger is also saving the log in a readable format such
that you can use `Logger.read_directories(...)` to read multiple logs, create
a dataframe, and analyze many experiments afterward in a notebook for
instance. The code for the different kinds of loggers is available in the
[bbrl/utils/logger.py](https://github.com/osigaud/bbrl/blob/master/src/bbrl/utils/logger.py)
file.

`instantiate_class` is an inner BBRL mechanism. The
`instantiate_class`function is available in the
[`bbrl/__init__.py`](https://github.com/osigaud/bbrl/blob/master/src/bbrl/__init__.py)
file.

In [None]:
from bbrl import instantiate_class

class Logger():

    def __init__(self, cfg):
        self.logger = instantiate_class(cfg.logger)

    def add_log(self, log_string, loss, steps):
        self.logger.add_scalar(log_string, loss.item(), steps)

    # A specific function for RL algorithms having a critic, an actor and an entropy losses
    def log_losses(self, critic_loss, entropy_loss, actor_loss, steps):
        self.add_log("critic_loss", critic_loss, steps)
        self.add_log("entropy_loss", entropy_loss, steps)
        self.add_log("actor_loss", actor_loss, steps)

    def log_reward_losses(self, rewards, nb_steps):
        self.add_log("reward/mean", rewards.mean(), nb_steps)
        self.add_log("reward/max", rewards.max(), nb_steps)
        self.add_log("reward/min", rewards.min(), nb_steps)
        self.add_log("reward/median", rewards.median(), nb_steps)

## Heart of the algorithm
### Computing the critic loss
The role of the `compute_critic_loss` function is to implement the Bellman
backup rule. In Q-learning, this rule was written:

$$Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha [ r(s_t,a_t) + \gamma \max_a
Q(s_{t+1},a) - Q(s_t,a_t)]$$

In DQN, the update rule $Q \leftarrow Q + \alpha [\delta] $ is replaced by a
gradient descent step over the Q-network. 

We first compute a target value: $ target = r(s_t,a_t) + \gamma \max_a
Q(s_{t+1},a)$ from a set of samples.

Then we get a TD error $\delta$ by substracting $Q(s_t,a_t)$ for these samples, 
and we use the squared TD error as a loss function: $ loss = (target -
Q(s_t,a_t))^2$.

To implement the above calculation in BBRL, the difficulty is to
properly deal with time indexes.

The `compute_critic_loss` function receives rewards, q_values and actions as
tensors that have been computed over a complete episode.

We need to take `reward[1:]`, which means all the rewards except the first one
because the reward from $(s_t, a_t)$ is $r_{t+1}$.
Similarly, to get $\max_a Q(s_{t+1}, a)$, we need to ignore the first of the
max_q values, using `max_q[1:]`.

Do not forget to apply .detach() when computing the values of $\max_a Q(s_{t+1}, a)$, as **we do not
want to apply gradient descent on this $\max_a Q(s_{t+1}, a)$**, we only
apply gradient descent to $Q(s_t, a_t)$ according to this target value.
In practice, `x.detach()` detaches a computation graph from a tensor,
so it avoids computing a gradient over this tensor.

The `must_bootstrap` tensor is used as a trick to deal with terminal states,
as explained [here](https://github.com/osigaud/bbrl/blob/master/docs/time_limits.md)
In practice, `must_bootstrap` is the logical negation of `terminated`.
In the autoreset=False version we use full episodes, thus `must_bootstrap` is
always True for all steps but the last one.

To compute $Q(s_t,a_t)$ we use the `torch.gather()` function. This function is
a little tricky to use, see [this page](https://github.com/osigaud/bbrl/blob/master/docs/using_gather.md)
for useful explanations.

In particular, the q_vals output that we get is not properly conditioned,
hence the need for the `qval[:-1]` (we ignore the last dimension). Finally we
just need to compute the difference target - qvals, square it, take the mean
and send it back as the loss.

In [None]:
def compute_critic_loss(cfg, reward: torch.Tensor, must_bootstrap: torch.Tensor, q_values: torch.Tensor, action: torch.LongTensor) -> torch.Tensor:
    """Compute the temporal difference loss from a dataset to 
    update a critic

    For the tensor dimensions:
    
    - T = maximum number of time steps
    - B = number of episodes run in parallel 
    - A = action space dimension

    :param cfg: The configuration
    :param reward: A (T x B) tensor containing the rewards 
    :param must_bootstrap: a (T x B) tensor containing 0 if the episode is
        completed at time $t$ 
    :param q_values: a (T x B x A) tensor containing the Q-values at each
        time step
    :param action: a (T x B) long tensor containing the chosen action

    :return: The DQN loss
    """
    # We compute the max of Q-values over all actions and detach (so that
    # this part of the computation graph is not included in the gradient
    # backpropagation)

    # Compute the loss

    assert False, 'Not implemented yet'


    return critic_loss

## Main training loop

Note that everything about the shared workspace between all the agents is
completely hidden under the hood. This results in a gain of productivity, at
the expense of having to dig into the BBRL code if you want to understand the
details, change the multiprocessing model, etc.

The next cell defines a `EpisodicDQN` that deals with various part of the training
loop:

- `__init__` takes care of initializing the train and evaluation policies

In [None]:
from bbrl import get_arguments, get_class
from itertools import chain

def setup_optimizer(cfg_optimizer, *agents):
    """Setup an optimizer for a list of agents"""
    optimizer_args = get_arguments(cfg_optimizer)
    parameters = [agent.parameters() for agent in agents]
    optimizer = get_class(cfg_optimizer)(chain(*parameters), **optimizer_args)
    return optimizer

def copy_parameters(model_a, model_b):
    """Copy parameters from a model a to model_b"""
    for model_a_p, model_b_p in zip(model_a.parameters(), model_b.parameters()):
        model_b_p.data.copy_(model_a_p)

# Learning environment

To setup a common learning environment for RL algorithms, we use the `RLBase`
class. This class:

1. Initializes the environment (random seed, logger, evaluation environment)
2. Defines a `evaluate` method that keeps the best agent so far
3. Defines a `visualize_best` method that displays the behavior of the best agent

Subclasses need to define `self.train_policy` and `self.eval_policy`, two
BBRL agents that respectively choose actions when training and evaluating.

The behavior of `RLBase` is controlled by the following configuration
variables:

- `base_dir` defines the directory subpath used when outputing losses during
training as well as other outputs (serialized agent, global statistics, etc.)
- `algorithm.seed` defines the random seed used (to initialize the agent and
  the environment)
- `gym_env` defines the gymnasium environment, and in particular
`gym_env.env_name` the name of the gymansium environment
- `logger` defines what type of logger is used to log the different values
associated with learning
- `algorithm.eval_interval` defines the number of observed transitions between
each evaluation of the agent

In [None]:
import numpy as np
from typing import Any
import logging
from abc import ABC
from functools import cached_property

class RLBase(ABC):
    """Base class for Reinforcement learning algorithms
    
    This class deals with common processing:

    - defines the logger, the train and evaluation agents
    - defines how to evaluate a policy
    """

    #: The configuration
    cfg: Any

    #: The evaluation environment deals with the last action, and produces a new
    # state of the environment
    eval_env: Agent

    #: The training policy
    train_policy: Agent

    #: The evaluation policy (if not defined, uses the training policy)
    eval_policy: Agent

    def __init__(self, cfg):
        # Basic initialization
        self.cfg = cfg
        torch.manual_seed(cfg.algorithm.seed)

        # Sets the base directory and logger directory
        base_dir = Path(self.cfg.base_dir)
        self.base_dir = Path("outputs") / Path(self.cfg.base_dir)
        self.base_dir.mkdir(parents=True, exist_ok=True)

        # Initialize the logger class
        if not hasattr(cfg.logger, "log_dir"):
            cfg.logger.log_dir = str(Path("outputs") / "tblogs" / base_dir)
        self.logger = Logger(cfg)

        # Subclasses have to define the training and eval policies
        self.train_policy = None
        self.eval_policy = None

        # Sets up the evaluation environment
        self.eval_env = ParallelGymAgent(
            partial(make_env, cfg.gym_env.env_name), 
            cfg.algorithm.nb_evals
        ).seed(cfg.algorithm.seed)

        # Initialize values
        self.last_eval_step = 0
        self.nb_steps = 0
        self.best_policy = None
        self.best_reward = -torch.inf

        # Records the rewards
        self.eval_rewards = []

    @cached_property
    def train_agent(self):
        """Returns the training agent

        The agent is composed of a policy agent and the training environment.      
        This method supposes that `self.train_policy` has been setup
        """
        assert self.train_policy is not None, "The train_policy property is not defined before the policy is set"
        return TemporalAgent(Agents(self.train_env, self.train_policy))

    @cached_property
    def eval_agent(self):
        """Returns the evaluation agent 
        
        The agent is composed of a policy agent and the evaluation environment
        
        Uses `self.eval_policy` (or `self.train_policy` if not defined)

        """
        assert self.eval_policy is not None or self.train_policy is not None, "eval_agent property is not defined before the policy is set"
        return TemporalAgent(Agents(self.eval_env, self.eval_policy if self.eval_policy is not None else self.train_policy))

    def evaluate(self):
        """Evaluate the current policy `self.eval_policy`
        
        Evaluation is conducted every `cfg.algorithm.eval_interval` steps, and
        we keep a copy of the best agent so far in `self.best_policy`
        
        Returns True if the current policy is the best so far
        """
        if (self.nb_steps - self.last_eval_step) > self.cfg.algorithm.eval_interval:
            self.last_eval_step = self.nb_steps
            eval_workspace = Workspace() 
            self.eval_agent(
                eval_workspace,
                t=0,
                stop_variable="env/done"
            )
            rewards = eval_workspace["env/cumulated_reward"][-1]
            self.logger.log_reward_losses(rewards, self.nb_steps)

            if getattr(self.cfg, "collect_stats", False):
                self.eval_rewards.append(rewards)

            rewards_mean = rewards.mean()
            if rewards_mean > self.best_reward:
                self.best_policy = copy.deepcopy(self.eval_policy)
                self.best_reward = rewards_mean
                return True

    def save_stats(self):
        """Save reward statistics into `stats.npy`"""
        if getattr(self.cfg, "collect_stats", False) and self.eval_rewards:
            data = torch.stack(self.eval_rewards, axis=-1) 
            with (self.base_dir / "stats.npy").open("wt") as fp:
                np.savetxt(fp, data.numpy())

    def visualize_best(self):
        """Visualize the best agent"""
        env = make_env(self.cfg.gym_env.env_name, render_mode="rgb_array")
        path = self.base_dir / "best_agent.mp4"
        print(f"Video of best agent recorded in {path}")
        record_video(env, self.best_policy, path)
        return video_display(str(path.absolute()))

The `EpisodicAlgo` defines the environment when using episodes. In particular,
it defines `self.train_env` which is the environment used for training. As
`algorithm.n_envs` are used in parallel, when a episode ends, we don't stop
the other episodes. To cater for this:

1. the workspace variable `env/done` is set to `True` for all the next time
steps
2. The variable `env/reward` is set to 0 for all the steps 

The behavior of `EpisodicAlgo` is controlled by the following configuration
variables:

- `gym_env.env_name` defines the gymnasium environment
- `algorithm.n_envs` defines the number of parallel environments
- `algorithm.seed` defines the random seed used (to initialize the agent and
  the environment)

In [None]:
class EpisodicAlgo(RLBase):
    """Base class for RL experiments with full episodes"""
    def __init__(self, cfg, autoreset=False):
        super().__init__(cfg)

        self.train_env = ParallelGymAgent(
            partial(make_env, cfg.gym_env.env_name, autoreset=autoreset), 
            cfg.algorithm.n_envs,
        ).seed(cfg.algorithm.seed)

`iter_episodes` and `iter_partial_episodes` (autoreset) allow
to iterate over the train workspace by sampling

In [None]:
def iter_episodes(algo: EpisodicAlgo):
    pbar = tqdm(range(algo.cfg.algorithm.max_epochs))

    train_workspace = Workspace()

    for algo.epoch in pbar:
        # Collect samples
        train_workspace = Workspace()
        algo.train_agent(train_workspace, t=0, stop_variable="env/done")

        # Update the number of steps
        algo.nb_steps += int((~train_workspace["env/done"]).sum())

        # Perform a learning step
        yield train_workspace

        # Eval
        pbar.set_description(f"nb_steps: {algo.nb_steps}, best reward: {algo.best_reward:.2f}")


def iter_partial_episodes(algo: EpisodicAlgo, episode_steps: int):
    pbar = tqdm(range(algo.cfg.algorithm.max_epochs))
    train_workspace = Workspace()

    for algo.epoch in pbar:
        if algo.epoch > 0:
            train_workspace.zero_grad()
            train_workspace.copy_n_last_steps(1)
            algo.train_agent(
                train_workspace, t=1, n_steps=episode_steps-1, stochastic=True
            )
        else:
            algo.train_agent(
                train_workspace, t=0, n_steps=episode_steps, stochastic=True
            )

        algo.nb_steps += int((~train_workspace["env/done"]).sum())
        yield train_workspace

        pbar.set_description(f"nb_steps: {algo.nb_steps}, best reward: {algo.best_reward:.2f}")

In [None]:
class EpisodicDQN(EpisodicAlgo):
    def __init__(self, cfg):
        super().__init__(cfg)
            
        # Get the observation / action state space dimensions
        obs_size, act_size = self.train_env.get_obs_and_actions_sizes()

        # Our discrete Q-Agent
        self.q_agent = DiscreteQAgent(obs_size, cfg.algorithm.architecture.hidden_size, act_size)

        # The e-greedy strategy (when training)
        explorer = EGreedyActionSelector(cfg.algorithm.epsilon)

        # The training agent combines the Q agent
        self.train_policy = Agents(self.q_agent, explorer)

        # The optimizer for the Q-Agent parameters
        self.optimizer = setup_optimizer(self.cfg.optimizer, self.q_agent)

        # ...and the evaluation policy (select the most likely action)
        self.eval_policy = Agents(self.q_agent, ArgmaxActionSelector())

    def run(self):
        for train_workspace in iter_episodes(self):
            q_values, terminated, reward, action = train_workspace[
                "q_values", "env/terminated", "env/reward", "action"
            ]
        
            # Determines whether values of the critic should be propagated
            # True if the episode reached a time limit or if the task was not done
            # See https://github.com/osigaud/bbrl/blob/master/docs/time_limits.md
            must_bootstrap = ~terminated
        
            # Compute critic loss
            critic_loss = compute_critic_loss(self.cfg, reward, must_bootstrap, q_values, action)

            # Store the loss for tensorboard display
            self.logger.add_log("critic_loss", critic_loss, self.nb_steps)

            # Gradient step
            self.optimizer.zero_grad()
            critic_loss.backward()
            torch.nn.utils.clip_grad_norm_(
                self.q_agent.parameters(), self.cfg.algorithm.max_grad_norm
            )
            self.optimizer.step()

            # Evaluate the current policy (if needed)
            self.evaluate()

In [None]:
# We setup tensorboard before running DQN
setup_tensorboard("./outputs/tblogs")

In [None]:
params={
  "save_best": False,
  "base_dir": "./outputs/${gym_env.env_name}/dqn-simple-S${algorithm.seed}_${current_time:}",
  "collect_stats": True,
  "logger": {
    "classname": "bbrl.utils.logger.TFLogger",
    "log_dir": "./outputs/tblogs/${gym_env.env_name}/dqn-simple-S${algorithm.seed}_${current_time:}",
    "cache_size": 10000,
    "every_n_seconds": 10,
    "verbose": False,    
    },
  "algorithm":{
    "seed": 3,
    "max_grad_norm": 0.5,
    "epsilon": 0.02,
    "n_envs": 8,
    "eval_interval": 2000,
    "max_epochs": 500,
    "nb_evals": 10,
    "discount_factor": 0.99,
    "architecture":{"hidden_size": [128, 128]},
  },
  "gym_env":{
    "env_name": "CartPole-v1",
  },
  "optimizer":
  {
    "classname": "torch.optim.Adam",
    "lr": 2e-3,
  }
}

dqn = EpisodicDQN(OmegaConf.create(params))

In [None]:
# Run and visualize the best agent
dqn.run()
dqn.visualize_best()

## What's next?

To get a full DQN, we need to do the following:
- Add a replay buffer. We can add a replay buffer independently from the
  target network. The version with a replay buffer and no target network
  corresponds to [the NQF
  algorithm](https://link.springer.com/content/pdf/10.1007/11564096_32.pdf).
  This will be the aim of the next notebook.
- Before adding the replay buffer, we will first move to a version of DQN
  which uses the AutoResetGymAgent. This will be the aim of the next notebook
  too.
- We should also add a few extra-mechanisms which are present in the full DQN
  version: starting to learn once the replay buffer is full enough, decreasing
  the exploration rate epsilon...
<!-- - We could also add visualization tools to visualize the learned Q network, by using the `plot_critic` function available in [`bbrl.visu.visu_critics`](https://github.com/osigaud/bbrl/blob/master/src/bbrl/visu/visu_critics.py#L13) -->