 Copyright Â© Sorbonne University.

 This source code is licensed under the MIT license found in the LICENSE file
 in the root directory of this source tree.

# Outlook

In this notebook, we implement a simple version of the A2C algorithm using
BBRL.

To understand this code, you need to know more about [the BBRL interaction
model](https://github.com/osigaud/bbrl/blob/master/docs/overview.md) Then you
should run [a didactical
example](https://github.com/osigaud/bbrl/blob/master/docs/notebooks/02-multi_env_noautoreset.student.ipynb)
to see how agents interact in BBRL when autoreset=False.

The A2C algorithm is explained in [this
video](https://www.youtube.com/watch?v=BUmsTlIgrBI) and you can also read [the
corresponding slides](http://pages.isir.upmc.fr/~sigaud/teach/a2c.pdf). Here,
we use it with autoreset=False.

In [None]:
try:
    from easypip import easyimport
except ModuleNotFoundError:
    from subprocess import run

    assert (
        run(["pip", "install", "easypip"]).returncode == 0
    ), "Could not install easypip"
    from easypip import easyimport

easyimport("swig")
easyimport("bbrl_utils").setup()

import copy
import os

import torch
import torch.nn as nn
from bbrl.agents import Agent, Agents, TemporalAgent
from bbrl_utils.algorithms import EpisodicAlgo
from bbrl_utils.nn import build_mlp, setup_optimizer
from bbrl_utils.notebook import setup_tensorboard
from omegaconf import OmegaConf

# Learning environment

## Configuration

The learning environment is controlled by a configuration that define a few
important things as described in the example below. This configuration can
hold as many extra information as you need, the example below is the minimal
one.

```python
params = {
    # This defines the a path for logs and saved models
    "base_dir": "${gym_env.env_name}/myalgo_${current_time:}",

    # The Gymnasium environment
    "gym_env": {
        "env_name": "CartPoleContinuous-v1",
    },

    # Algorithm
    "algorithm": {
        # Seed used for the random number generator
        "seed": 1023,

        # Number of parallel training environments
        "n_envs": 8,
                
        # Minimum number of steps between two evaluations
        "eval_interval": 500,
        
        # Number of parallel evaluation environments
        "nb_evals": 10,

        # Number of epochs (loops)
        "max_epochs": 40000,

    },
}

# Creates the configuration object, i.e. cfg.algorithm.nb_evals is 10
cfg = OmegaConf.create(params)
```

## The RL algorithm

In this notebook, the RL algorithm is based on `EpisodicAlgo`, that defines
the algorithm environment when using episodes. To use such environment, we
just need to subclass `EpisodicAlgo` and to define two things, namely the
`train_policy` and the `eval_policy`. Both are BBRL agents that, given the
environment state, select the action to perform.

```py
  class MyAlgo(EpisodicAlgo):
      def __init__(self, cfg):
          super().__init__(cfg)

          # Define the train and evaluation policies
          # (the agents compute the workspace `action` variable)
          self.train_policy = MyPolicyAgent(...)
          self.eval_policy = MyEvalAgent(...)

algo = MyAlgo(cfg)
```

The `EpisodicAlgo` defines useful objects:

- `algo.cfg` is the configuration
- `algo.nb_steps` (integer) is the number of steps since the training began
- `algo.logger` is a logger that can be used to collect statistics during training:
    - `algo.logger.add_log("critic_loss", critic_loss, algo.nb_steps)` registers the `critic_loss` value on tensorboard
- `algo.evaluate()` evaluates the current `eval_policy` if needed, and keeps the
agent if it was the best so far (average cumulated reward);
- `algo.visualize_best()` runs the best agent on one episode, and displays the video



Besides, it also defines an `iter_episodes` is simple:

```py
  # With episodes
  for workspace in rl_algo.iter_episodes():
      # workspace is a workspace containing transitions
      # Episodes shorter than the longer one contain duplicated
      # transitions (with `env/done` set to true)
      ...
```

## Definition of agents

The [A2C](http://proceedings.mlr.press/v48/mniha16.pdf) algorithm is an
actor-critic algorithm. Thus we need an Actor agent, a Critic agent and an
Environment agent. Thus we need an Actor agent, a Critic agent and an
Environment agent. The actor agents are built on an intermediate `ProbAgent`.
Two agents that use the output of `ProbAgent` are defined below:
- `ArgmaxActorAgent` that selects the action with the highest probability
- `StochasticActorAgent` that selects the action using the probability
  distribution

In [None]:
class ProbAgent(Agent):
    # Computes the distribution $p(a_t|s_t)$

    def __init__(self, state_dim, hidden_layers, n_action, name="prob_agent"):
        super().__init__(name)
        self.model = build_mlp(
            [state_dim] + list(hidden_layers) + [n_action], activation=nn.ReLU()
        )

    def forward(self, t, **kwargs):
        # Get $s_t$
        observation = self.get(("env/env_obs", t))
        # Compute the distribution over actions
        scores = self.model(observation)
        action_probs = torch.softmax(scores, dim=-1)
        assert not torch.any(torch.isnan(action_probs)), "NaN Here"

        self.set(("action_probs", t), action_probs)
        entropy = torch.distributions.Categorical(action_probs).entropy()
        self.set(("entropy", t), entropy)


class StochasticActorAgent(Agent):
    """Sample an action according to $p(a_t|s_t)$"""

    def forward(self, t: int, **kwargs):
        probs = self.get(("action_probs", t))
        action = torch.distributions.Categorical(probs).sample()
        self.set(("action", t), action)


class ArgmaxActorAgent(Agent):
    """Choose an action $a$ that maximizes $p(a_t|s_t)"""

    def forward(self, t: int, *, stochastic: bool = None, **kwargs):
        probs = self.get(("action_probs", t))
        action = probs.argmax(1)
        self.set(("action", t), action)

### CriticAgent

To implement the critic, A2C uses a value function $V(s)$. We thus call upon
the `CriticAgent` class. The CriticAgent below is a one hidden layer neural
network which takes an observation as input and whose output is the value of
this observation. It thus implements a $V(s)$ function.

It would be straightforward to define another CriticAgent (call it a
CriticQAgent by contrast to a CriticAgent) that would take an observation and
an action as input.

In [None]:
class CriticAgent(Agent):
    def __init__(self, observation_size, hidden_size):
        super().__init__()
        self.critic_model = nn.Sequential(
            nn.Linear(observation_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, 1),
        )

    def forward(self, t, **kwargs):
        # Get the observation (shape B x O)
        observation = self.get(("env/env_obs", t))
        # The model outputs a matrix (shape B x 1) before squeeze transforms it
        # to a vector (shape B)
        critic = self.critic_model(observation).squeeze(-1)
        self.set(("critic", t), critic)

### Create the A2C environment

In the piece of code below, we define the agents and other objects needed
when training A2C (e.g. optimizers).

In [None]:
# Create the A2C Agent
class A2CAlgorithm(EpisodicAlgo):
    def __init__(self, cfg):
        super().__init__(cfg)
        observation_size, n_actions = self.train_env.get_obs_and_actions_sizes()

        self.prob_agent = ProbAgent(
            observation_size, cfg.algorithm.architecture.actor_hidden_size, n_actions
        )
        self.critic_agent = CriticAgent(
            observation_size, cfg.algorithm.architecture.critic_hidden_size
        )

        # Define the train and evaluation agents
        self.train_policy = Agents(self.prob_agent, StochasticActorAgent())
        self.eval_policy = Agents(self.prob_agent, ArgmaxActorAgent())

### Compute critic loss

In this basic version, the critic loss is computed by estimating the advantage
as an expectation over the temporal difference error $\delta$. This is not
what the standard A2C algorithm does.

You should use the `.detach()` in the computation of the temporal difference
target. The idea is that we compute this target as a function of $V(s_{t+1})$,
but we do not want to apply gradient descent on this $V(s_{t+1})$, we will
only apply gradient descent to the $V(s_t)$ according to this target value.

In practice, `x.detach()` detaches a computation graph from a tensor, so it
avoids computing a gradient over this tensor.

Note also the trick to deal with terminal states. If the state is terminal,
$V(s_{t+1})$ does not make sense. Thus we need to ignore this term. So we
multiply the term by `must_bootstrap`: if `must_bootstrap` is True (converted
into an int, it becomes a 1), we get the term. If `must_bootstrap` is False
(=0), we are at a terminal state, so we ignore the term. This trick is used in
many RL libraries, e.g. SB3.

In [None]:
def compute_critic_loss(cfg, reward, must_bootstrap, critic):
    """Returns a couple  TD(0) error ($\delta_t$)

    :param cfg: The configuration
    :param reward: The reward (tensor 2xB)
    :param must_bootstrap: The must bootstrap flag (tensor 2xB)
    :param critic: The critic value (tensor 2xB)
    :return: A couple (critic loss, $\delta_t$)
    """
    # Compute temporal difference

    delta = ...
    assert False, 'Not implemented yet'


    # Compute critic loss
    critic_loss = (delta**2).mean()
    return critic_loss, delta

# Main training loop

This version uses an AutoResetGymAgent. If you haven't done so yet, read
[the BBRL documentation](https://github.com/osigaud/bbrl/blob/master/docs/overview.md)
which explains a lot of details.

Note that we `optimizer.zero_grad()`, `loss.backward()` and `optimizer.step()`
lines. Several things need to be explained here.
- `optimizer.zero_grad()` is necessary to cancel all the gradients computed at
  the previous iterations
- note that we sum all the losses, both for the critic and the actor, before
applying back-propagation with `loss.backward()`. At first glance, summing
these losses may look weird, as the actor and the critic receive different
updates with different parts of the loss. This mechanism relies on the central
property of tensor manipulation libraries like TensorFlow and pytorch. In
pytorch, each loss tensor comes with its own graph of computation for
back-propagating the gradient, in such a way that when you back-propagate the
loss, the adequate part of the loss is applied to the adequate parameters.
These mechanisms are partly explained
[here](https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html).
- since the optimizer has been set to work with both the actor and critic
  parameters, `optimizer.step()` will optimize both agents and pytorch ensure
  that each will receive its own part of the gradient.

In [None]:
def run_a2c(a2c: A2CAlgorithm):
    cfg = a2c.cfg

    # 4) Create the temporal critic agent to compute critic values over the workspace
    t_critic_agent = TemporalAgent(a2c.critic_agent)

    # 5) Configure the optimizer over the a2c agent
    optimizer = setup_optimizer(cfg.optimizer, a2c.prob_agent, a2c.critic_agent)

    for train_workspace in a2c.iter_episodes():
        # Compute the critic value over the whole workspace
        t_critic_agent(train_workspace, n_steps=train_workspace.time_size())

        # Transform the episodes into transitions
        transition_workspace = train_workspace.get_transitions()

        # Get relevant tensors (size are T x B x ....)
        critic, reward, action, action_probs, terminated = transition_workspace[
            "critic",
            "env/reward",
            "action",
            "action_probs",
            "env/terminated",
        ]

        # Determines whether values of the critic should be propagated
        # True if the episode reached a time limit or if the task was not done
        # See https://github.com/osigaud/bbrl/blob/master/docs/time_limits.md
        must_bootstrap = ~terminated

        # [[STUDENT]]...

        # Compute the A2C loss
        assert False, 'Not implemented yet'



        # Evaluate if needed
        a2c.evaluate()

## Definition of the parameters

The logger is defined as `bbrl.utils.logger.TFLogger` so as to use a
tensorboard visualisation.

In [None]:
params = {
    "base_dir": "${gym_env.env_name}/a2c-S${algorithm.seed}_${current_time:}",
    "algorithm": {
        "seed": 432,
        "n_envs": 10,
        "n_steps": 16,
        "nb_measures": 500,
        "nb_evals": 20,
        "discount_factor": 0.95,
        # Number of transitions between two evaluations
        "eval_interval": 1000,
        "max_epochs": 800,
        "critic_coef": 1.0,
        "actor_coef": 0.1,
        "architecture": {
            "actor_hidden_size": [32],
            "critic_hidden_size": 32,
        },
    },
    "gym_env": {
        "env_name": "CartPole-v1",
    },
    "optimizer": {
        "classname": "torch.optim.Adam",
        "lr": 0.01,
    },
}

In [None]:
# Show tensorboard
setup_tensorboard("./outputs/tblogs")

In [None]:
a2c = A2CAlgorithm(OmegaConf.create(params))
run_a2c(a2c)
a2c.visualize_best()

With the parameters provided in this notebook, you should observe that the reward
is collapsing after 471 episodes

## And now... add some entropy

To encourage the agent to explore more (or, said otherwise, to let the policy
converge less quickly), you can add some entropy-based regularization.

$$ \mathcal{L}_{entropy} = \mathbb{E}_{s \sim \pi_s}\left( p(a | s) \right) $$

where $\pi_s$ corresponds to the stationnary distribution according to the
current policy $\pi$ and the underlying MDP.

You can use
[`torch.distributions.Categorical`](https://pytorch.org/docs/stable/distributions.html#categorical)
to quickly compute entropy for a categorical distribution.

In [None]:
# Modify A2C to add an entropic loss

assert False, 'Not implemented yet'
