# Outlook

In this notebook we code one version of the [Proximal Policy Optimization
(PPO)](https://arxiv.org/pdf/1707.06347.pdf) algorithms using BBRL. More
precisely, the version here is the one that uses the KL penalty as a
regularization term when optimizing the policy gradient.

The PPO algorithm is superficially explained in [this
video](https://www.youtube.com/watch?v=uRNL93jV2HE) and you can also read [the
corresponding slides](http://pages.isir.upmc.fr/~sigaud/teach/ps/10_ppo.pdf).

It is also a good idea to have a look at the [spinning up
documentation](https://spinningup.openai.com/en/latest/algorithms/ppo.html).

This version of PPO works, but it incorrectly samples minibatches randomly
from the rollouts without making sure that each sample is used once and only
once See:
https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/ for a
full description of all the coding tricks that should be integrated

## Installation and Imports

### Installation

The BBRL library is [here](https://github.com/osigaud/bbrl).

We use OmegaConf to that makes it possible that by just defining the `def
run_dqn(cfg):` function and then executing a long `params = {...}` variable at
the bottom of this colab, the code is run with the parameters without calling
an explicit main.

More precisely, the code is run by calling

`config=OmegaConf.create(params)`

`run_dqn(config)`

at the very bottom of the colab, after starting tensorboard.

Below, we import standard python packages, pytorch packages and gymnasium
environments.

In [None]:
# Installs the necessary Python and system libraries
try:
    from easypip import easyimport, easyinstall, is_notebook
except ModuleNotFoundError as e:
    get_ipython().run_line_magic("pip", "install easypip")
    from easypip import easyimport, easyinstall, is_notebook

easyinstall("bbrl>=0.2.2")
easyinstall("swig")
easyinstall("bbrl_gymnasium>=0.2.0")
easyinstall("bbrl_gymnasium[box2d]")
easyinstall("bbrl_gymnasium[classic_control]")
easyinstall("tensorboard")
easyinstall("moviepy")
easyinstall("box2d-kengz")

In [None]:
import os
import sys
from pathlib import Path
import math

from moviepy.editor import ipython_display as video_display
import time
from tqdm.auto import tqdm
from typing import Tuple, Optional
from functools import partial

from omegaconf import OmegaConf
import torch
import bbrl_gymnasium

import copy
from abc import abstractmethod, ABC
import torch.nn as nn
import torch.nn.functional as F
from time import strftime

OmegaConf.register_new_resolver(
    "current_time", lambda: strftime("%Y%m%d-%H%M%S"), replace=True
)

In [None]:
# Imports all the necessary classes and functions from BBRL
from bbrl.agents.agent import Agent
from bbrl import get_arguments, get_class, instantiate_class

# The workspace is the main class in BBRL, this is where all data is collected and stored
from bbrl.workspace import Workspace

# Agents(agent1,agent2,agent3,...) executes the different agents the one after the other
# TemporalAgent(agent) executes an agent over multiple timesteps in the workspace,
# or until a given condition is reached
from bbrl.agents import Agents, TemporalAgent

# ParallelGymAgent is an agent able to execute a batch of gymnasium environments
# with auto-resetting. These agents produce multiple variables in the workspace:
# ’env/env_obs’, ’env/reward’, ’env/timestep’, ’env/terminated’,
# 'env/truncated', 'env/done', ’env/cumulated_reward’, ...
#
# When called at timestep t=0, the environments are automatically reset. At
# timestep t>0, these agents will read the ’action’ variable in the workspace at
# time t − 1
from bbrl.agents.gymnasium import GymAgent, ParallelGymAgent, make_env, record_video

# Replay buffers are useful to store past transitions when training
from bbrl.utils.replay_buffer import ReplayBuffer

In [None]:
# Utility function for launching tensorboard
# For Colab - otherwise, it is easier and better to launch tensorboard from
# the terminal
def setup_tensorboard(path):
    path = Path(path)
    answer = ""
    if is_notebook():
        if get_ipython().__class__.__module__ == "google.colab._shell":
            answer = "y"
        while answer not in ["y", "n"]:
            answer = input(
                f"Do you want to launch tensorboard in this notebook [y/n] "
            ).lower()

    if answer == "y":
        get_ipython().run_line_magic("load_ext", "tensorboard")
        get_ipython().run_line_magic("tensorboard", f"--logdir {path.absolute()}")
    else:
        import sys
        import os
        import os.path as osp

        print(
            f"Launch tensorboard from the shell:\n{osp.dirname(sys.executable)}/tensorboard --logdir={path.absolute()}"
        )

In [None]:
# Plot a policy and a critic as a 2D map
from bbrl.visu.plot_policies import plot_policy
from bbrl.visu.plot_critics import plot_critic

### The Logger class

The logger is in charge of collecting statistics during the training
process.

Having logging provided under the hood is one of the features allowing you
to save time when using RL libraries like BBRL.

In these notebooks, the logger is defined as `bbrl.utils.logger.TFLogger` so as
to use a tensorboard visualisation (see the parameters part `params = { "logger":{ ...` below).

Note that the BBRL Logger is also saving the log in a readable format such
that you can use `Logger.read_directories(...)` to read multiple logs, create
a dataframe, and analyze many experiments afterward in a notebook for
instance. The code for the different kinds of loggers is available in the
[bbrl/utils/logger.py](https://github.com/osigaud/bbrl/blob/master/src/bbrl/utils/logger.py)
file.

`instantiate_class` is an inner BBRL mechanism. The
`instantiate_class`function is available in the
[`bbrl/__init__.py`](https://github.com/osigaud/bbrl/blob/master/src/bbrl/__init__.py)
file.

In [None]:
from bbrl import instantiate_class

class Logger():

    def __init__(self, cfg):
        self.logger = instantiate_class(cfg.logger)

    def add_log(self, log_string, loss, steps):
        self.logger.add_scalar(log_string, loss.item(), steps)

    # A specific function for RL algorithms having a critic, an actor and an entropy losses
    def log_losses(self, critic_loss, entropy_loss, actor_loss, steps):
        self.add_log("critic_loss", critic_loss, steps)
        self.add_log("entropy_loss", entropy_loss, steps)
        self.add_log("actor_loss", actor_loss, steps)

    def log_reward_losses(self, rewards, nb_steps):
        self.add_log("reward/mean", rewards.mean(), nb_steps)
        self.add_log("reward/max", rewards.max(), nb_steps)
        self.add_log("reward/min", rewards.min(), nb_steps)
        self.add_log("reward/median", rewards.median(), nb_steps)

In [None]:
from bbrl import get_arguments, get_class
from itertools import chain

def setup_optimizer(cfg_optimizer, *agents):
    """Setup an optimizer for a list of agents"""
    optimizer_args = get_arguments(cfg_optimizer)
    parameters = [agent.parameters() for agent in agents]
    optimizer = get_class(cfg_optimizer)(chain(*parameters), **optimizer_args)
    return optimizer

def copy_parameters(model_a, model_b):
    """Copy parameters from a model a to model_b"""
    for model_a_p, model_b_p in zip(model_a.parameters(), model_b.parameters()):
        model_b_p.data.copy_(model_a_p)

# Learning environment

To setup a common learning environment for RL algorithms, we use the `RLBase`
class. This class:

1. Initializes the environment (random seed, logger, evaluation environment)
2. Defines a `evaluate` method that keeps the best agent so far
3. Defines a `visualize_best` method that displays the behavior of the best agent

Subclasses need to define `self.train_policy` and `self.eval_policy`, two
BBRL agents that respectively choose actions when training and evaluating.

The behavior of `RLBase` is controlled by the following configuration
variables:

- `base_dir` defines the directory subpath used when outputing losses during
training as well as other outputs (serialized agent, global statistics, etc.)
- `algorithm.seed` defines the random seed used (to initialize the agent and
  the environment)
- `gym_env` defines the gymnasium environment, and in particular
`gym_env.env_name` the name of the gymansium environment
- `logger` defines what type of logger is used to log the different values
associated with learning
- `algorithm.eval_interval` defines the number of observed transitions between
each evaluation of the agent

In [None]:
import numpy as np
from typing import Any
import logging
from abc import ABC
from functools import cached_property

class RLBase(ABC):
    """Base class for Reinforcement learning algorithms
    
    This class deals with common processing:

    - defines the logger, the train and evaluation agents
    - defines how to evaluate a policy
    """

    #: The configuration
    cfg: Any

    #: The evaluation environment deals with the last action, and produces a new
    # state of the environment
    eval_env: Agent

    #: The training policy
    train_policy: Agent

    #: The evaluation policy (if not defined, uses the training policy)
    eval_policy: Agent

    def __init__(self, cfg):
        # Basic initialization
        self.cfg = cfg
        torch.manual_seed(cfg.algorithm.seed)

        # Sets the base directory and logger directory
        base_dir = Path(self.cfg.base_dir)
        self.base_dir = Path("outputs") / Path(self.cfg.base_dir)
        self.base_dir.mkdir(parents=True, exist_ok=True)

        # Initialize the logger class
        if not hasattr(cfg.logger, "log_dir"):
            cfg.logger.log_dir = str(Path("outputs") / "tblogs" / base_dir)
        self.logger = Logger(cfg)

        # Subclasses have to define the training and eval policies
        self.train_policy = None
        self.eval_policy = None

        # Sets up the evaluation environment
        self.eval_env = ParallelGymAgent(
            partial(make_env, cfg.gym_env.env_name), 
            cfg.algorithm.nb_evals
        ).seed(cfg.algorithm.seed)

        # Initialize values
        self.last_eval_step = 0
        self.nb_steps = 0
        self.best_policy = None
        self.best_reward = -torch.inf

        # Records the rewards
        self.eval_rewards = []

    @cached_property
    def train_agent(self):
        """Returns the training agent

        The agent is composed of a policy agent and the training environment.      
        This method supposes that `self.train_policy` has been setup
        """
        assert self.train_policy is not None, "The train_policy property is not defined before the policy is set"
        return TemporalAgent(Agents(self.train_env, self.train_policy))

    @cached_property
    def eval_agent(self):
        """Returns the evaluation agent 
        
        The agent is composed of a policy agent and the evaluation environment
        
        Uses `self.eval_policy` (or `self.train_policy` if not defined)

        """
        assert self.eval_policy is not None or self.train_policy is not None, "eval_agent property is not defined before the policy is set"
        return TemporalAgent(Agents(self.eval_env, self.eval_policy if self.eval_policy is not None else self.train_policy))

    def evaluate(self):
        """Evaluate the current policy `self.eval_policy`
        
        Evaluation is conducted every `cfg.algorithm.eval_interval` steps, and
        we keep a copy of the best agent so far in `self.best_policy`
        
        Returns True if the current policy is the best so far
        """
        if (self.nb_steps - self.last_eval_step) > self.cfg.algorithm.eval_interval:
            self.last_eval_step = self.nb_steps
            eval_workspace = Workspace() 
            self.eval_agent(
                eval_workspace,
                t=0,
                stop_variable="env/done"
            )
            rewards = eval_workspace["env/cumulated_reward"][-1]
            self.logger.log_reward_losses(rewards, self.nb_steps)

            if getattr(self.cfg, "collect_stats", False):
                self.eval_rewards.append(rewards)

            rewards_mean = rewards.mean()
            if rewards_mean > self.best_reward:
                self.best_policy = copy.deepcopy(self.eval_policy)
                self.best_reward = rewards_mean
                return True

    def save_stats(self):
        """Save reward statistics into `stats.npy`"""
        if getattr(self.cfg, "collect_stats", False) and self.eval_rewards:
            data = torch.stack(self.eval_rewards, axis=-1) 
            with (self.base_dir / "stats.npy").open("wt") as fp:
                np.savetxt(fp, data.numpy())

    def visualize_best(self):
        """Visualize the best agent"""
        env = make_env(self.cfg.gym_env.env_name, render_mode="rgb_array")
        path = self.base_dir / "best_agent.mp4"
        print(f"Video of best agent recorded in {path}")
        record_video(env, self.best_policy, path)
        return video_display(str(path.absolute()))

The `EpisodicAlgo` defines the environment when using episodes. In particular,
it defines `self.train_env` which is the environment used for training. As
`algorithm.n_envs` are used in parallel, when a episode ends, we don't stop
the other episodes. To cater for this:

1. the workspace variable `env/done` is set to `True` for all the next time
steps
2. The variable `env/reward` is set to 0 for all the steps 

The behavior of `EpisodicAlgo` is controlled by the following configuration
variables:

- `gym_env.env_name` defines the gymnasium environment
- `algorithm.n_envs` defines the number of parallel environments
- `algorithm.seed` defines the random seed used (to initialize the agent and
  the environment)

In [None]:
class EpisodicAlgo(RLBase):
    """Base class for RL experiments with full episodes"""
    def __init__(self, cfg, autoreset=False):
        super().__init__(cfg)

        self.train_env = ParallelGymAgent(
            partial(make_env, cfg.gym_env.env_name, autoreset=autoreset), 
            cfg.algorithm.n_envs,
        ).seed(cfg.algorithm.seed)

`iter_episodes` and `iter_partial_episodes` (autoreset) allow
to iterate over the train workspace by sampling

In [None]:
def iter_episodes(algo: EpisodicAlgo):
    pbar = tqdm(range(algo.cfg.algorithm.max_epochs))

    train_workspace = Workspace()

    for algo.epoch in pbar:
        # Collect samples
        train_workspace = Workspace()
        algo.train_agent(train_workspace, t=0, stop_variable="env/done")

        # Update the number of steps
        algo.nb_steps += int((~train_workspace["env/done"]).sum())

        # Perform a learning step
        yield train_workspace

        # Eval
        pbar.set_description(f"nb_steps: {algo.nb_steps}, best reward: {algo.best_reward:.2f}")


def iter_partial_episodes(algo: EpisodicAlgo, episode_steps: int):
    pbar = tqdm(range(algo.cfg.algorithm.max_epochs))
    train_workspace = Workspace()

    for algo.epoch in pbar:
        if algo.epoch > 0:
            train_workspace.zero_grad()
            train_workspace.copy_n_last_steps(1)
            algo.train_agent(
                train_workspace, t=1, n_steps=episode_steps-1, stochastic=True
            )
        else:
            algo.train_agent(
                train_workspace, t=0, n_steps=episode_steps, stochastic=True
            )

        algo.nb_steps += int((~train_workspace["env/done"]).sum())
        yield train_workspace

        pbar.set_description(f"nb_steps: {algo.nb_steps}, best reward: {algo.best_reward:.2f}")

## Functions to build networks

The function below builds a multi-layer perceptron where the size of each layer is given in the `size` list.
We also specify the activation function of neurons at each layer and optionally a different activation function for the final layer.
The layers are initialized orthogonal way, which is known to provide better performance in PPO

In [None]:
import numpy as np
import torch.nn as nn

def ortho_init(layer, std=np.sqrt(2), bias_const=0.0):
    """
    Function used for orthogonal inialization of the layers
    Taken from here in the cleanRL library: https://github.com/vwxyzjn/ppo-implementation-details/blob/main/ppo.py
    """
    nn.init.orthogonal_(layer.weight, std)
    nn.init.constant_(layer.bias, bias_const)
    return layer

In [None]:
def build_ortho_mlp(sizes, activation, output_activation=nn.Identity()):
    """Helper function to build a multi-layer perceptron (function from $\mathbb R^n$ to $\mathbb R^p$)
    with orthogonal initialization
    
    Args:
        sizes (List[int]): the number of neurons at each layer
        activation (nn.Module): a PyTorch activation function (after each layer but the last)
        output_activation (nn.Module): a PyTorch activation function (last layer)
    """
    layers = []
    for j in range(len(sizes) - 1):
        act = activation if j < len(sizes) - 2 else output_activation
        layers += [ortho_init(nn.Linear(sizes[j], sizes[j + 1])), act]
    return nn.Sequential(*layers)

## Definition of PPO agents

### Critic agent

As A2C, PPO uses a value function $V(s)$. We thus call upon the `VAgent`
class,  which takes an observation as input and whose output is the value of
this observation.

In [None]:
class VAgent(Agent):
    def __init__(self, state_dim, hidden_layers, name="critic"):
        super().__init__()
        self.is_q_function = False
        self.model = build_ortho_mlp(
            [state_dim] + list(hidden_layers) + [1], activation=nn.ReLU()
        )
        self.name = name

    def set_name(self, name):
        self.name = name
        return self

    def forward(self, t, **kwargs):
        observation = self.get(("env/env_obs", t))
        critic = self.model(observation).squeeze(-1)
        self.set((f"{self.name}/v_values", t), critic)

### KL penalty agent

When computing the KL penalty, we need to compute the KL divergence at every
time step. The KLAgent is specific to the KL regularization version of PPO. It
is used to compute the KL divergence between the current and the past policy.

In [None]:
class KLAgent(Agent):
    def __init__(self, model_1, model_2):
        super().__init__()
        self.model_1 = model_1
        self.model_2 = model_2

    def forward(self, t, **kwargs):
        obs = self.get(("env/env_obs", t))
        dist_1 = self.model_1.dist(obs)
        dist_2 = self.model_2.dist(obs)
        kl = torch.distributions.kl.kl_divergence(dist_1, dist_2)
        self.set(("kl", t), kl)

### The DiscretePolicy

The DiscretePolicy was already used in A2C to deal with discrete actions, but
we have added the possibility to only predict the probability of an action
using the ```predict_proba``` variable in the ```forward()``` function. The
code is as follows.

In [None]:
class BasePolicy(Agent):
    def copy_parameters(self, other):
        """Copy parameters from other agent"""
        for self_p, other_p in zip(self.parameters(), other.parameters()):
            self_p.data.copy_(other_p)

In [None]:
class DiscretePolicy(BasePolicy):
    def __init__(self, state_dim, hidden_size, n_actions, name="policy"):
        super().__init__()
        self.model = build_ortho_mlp(
            [state_dim] + list(hidden_size) + [n_actions], activation=nn.ReLU()
        )
        self.set_name(name)

    def set_name(self, name):
        self.name = name
        
    def dist(self, obs):
        scores = self.model(obs)
        probs = torch.softmax(scores, dim=-1)
        return torch.distributions.Categorical(probs)

    def forward(self, t, *, stochastic=True, predict_proba=False, compute_entropy=False, **kwargs):
        """
        Compute the action given either a time step (looking into the workspace)
        or an observation (in kwargs)
        """
        if "observation" in kwargs:
            observation = kwargs["observation"]
        else:
            observation = self.get(("env/env_obs", t))
        scores = self.model(observation)
        probs = torch.softmax(scores, dim=-1)

        if predict_proba:
            action = self.get(("action", t))
            log_prob = probs[torch.arange(probs.size()[0]), action].log()
            self.set((f"{self.name}/logprob_predict", t), log_prob)
        else:
            if stochastic:
                action = torch.distributions.Categorical(probs).sample()
            else:
                action = scores.argmax(1)

            log_probs = probs[torch.arange(probs.size()[0]), action].log()

            self.set(("action", t), action)
            self.set((f"{self.name}/action_logprobs", t), log_probs)

        if compute_entropy:
            entropy = torch.distributions.Categorical(probs).entropy()
            self.set(("entropy", t), entropy)

### Compute advantage function

In [None]:
from bbrl.utils.functional import gae

def compute_advantage(cfg, reward, must_bootstrap, v_value):
    # Compute temporal difference with GAE
    reward = reward[1]
    next_val = v_value[1]
    current_val = v_value[0]
    advantage = gae(reward, next_val, must_bootstrap, current_val, cfg.algorithm.discount_factor, cfg.algorithm.gae)
    return advantage

### Main PPO agent

Create the PPO Agent

In [None]:
class PPOPenalty(EpisodicAlgo):
    def __init__(self, cfg):
        super().__init__(cfg, autoreset=True)
        
        obs_size, act_size = self.train_env.get_obs_and_actions_sizes()
        self.train_policy = globals()[cfg.algorithm.policy_type](
            obs_size, cfg.algorithm.architecture.actor_hidden_size, act_size, name="current_policy"
        )
        self.critic_agent = VAgent(obs_size, cfg.algorithm.architecture.critic_hidden_size)
        self.old_critic_agent = copy.deepcopy(self.critic_agent).set_name("old_critic")
        self.all_critics = TemporalAgent(Agents(self.critic_agent, self.old_critic_agent))

        self.old_policy = copy.deepcopy(self.train_policy)
        self.old_policy.set_name("old_policy")
        self.kl_agent = TemporalAgent(KLAgent(self.old_policy, self.train_policy))

### Compute critic loss

In [None]:
def compute_critic_loss(advantage):
    td_error = advantage**2
    critic_loss = td_error.mean()
    return critic_loss

### Compute penalty term in the policy loss

In [None]:
def compute_penalty_policy_loss(cfg, advantage, ratio, kl_loss):
    """Computes the PPO loss including KL regularization"""
    # Compute the policy loss

    assert False, 'Not implemented yet'

    return policy_loss

### Main loop

In [None]:
def run_ppo_penalty(ppo: PPOPenalty):
    cfg = ppo.cfg
    
    # The old_policy params must be wrapped into a TemporalAgent
    t_old_policy = TemporalAgent(ppo.old_policy)

    train_workspace = Workspace()
    optimizer = setup_optimizer(cfg.optimizer, ppo.train_agent, ppo.critic_agent)

    # Training loop
    pbar = tqdm(range(cfg.algorithm.max_epochs))
    first = True
    for train_workspace in iter_partial_episodes(ppo, cfg.algorithm.n_steps):
        # Handles continuation
        delta_t = 0
        if first:
            first = False
            delta_t = 1

        # Run the current policy and evaluate the proba of its action according to the old policy
        # The old_policy can be run after the train_agent on the same workspace
        # because it writes a logprob_predict and not an action.
        # That is, it does not determine the action of the old_policy,
        # it just determines the proba of the action of the current policy given its own probabilities

        with torch.no_grad():
            # Recompute
            ppo.train_agent(
                train_workspace,
                t=delta_t,
                n_steps=cfg.algorithm.n_steps,
                stochastic=True,
                predict_proba=False,
                compute_entropy=False,
            )
            t_old_policy(
                train_workspace,
                t=delta_t,
                n_steps=cfg.algorithm.n_steps,
                # Just computes the probability of the old policy's action
                # to get the ratio of probabilities
                predict_proba=True,
                compute_entropy=False,
            )

        # Compute the critic value over the whole workspace
        ppo.all_critics(train_workspace, t=delta_t, n_steps=cfg.algorithm.n_steps)

        transition_workspace = train_workspace.get_transitions()

        terminated, reward, action, v_value, old_v_value = transition_workspace[
            "env/terminated",
            "env/reward",
            "action",
            "critic/v_values",
            "old_critic/v_values",
        ]

        # Determines whether values of the critic should be propagated
        # See https://github.com/osigaud/bbrl/blob/master/docs/time_limits.md
        must_bootstrap = ~terminated[1]

        if cfg.algorithm.clip_range_vf > 0:
            # Clip the difference between old and new values
            # NOTE: this depends on the reward scaling
            v_value = old_v_value + torch.clamp(
                v_value - old_v_value,
                -cfg.algorithm.clip_range_vf,
                cfg.algorithm.clip_range_vf,
            )

        # then we compute the advantage using the clamped critic values
        advantage = compute_advantage(cfg, reward, must_bootstrap, v_value)

        # We store the advantage into the transition_workspace
        transition_workspace.set("advantage", 1, advantage)

        critic_loss = compute_critic_loss(advantage)
        loss_critic = cfg.algorithm.critic_coef * critic_loss

        optimizer.zero_grad()
        loss_critic.backward()
        torch.nn.utils.clip_grad_norm_(
            ppo.critic_agent.parameters(), cfg.algorithm.max_grad_norm
        )
        optimizer.step()

        # We start several optimization epochs on mini_batches
        for opt_epoch in range(cfg.algorithm.opt_epochs):
            if cfg.algorithm.batch_size > 0:
                sample_workspace = transition_workspace.select_batch_n(
                    cfg.algorithm.batch_size
                )
            else:
                sample_workspace = transition_workspace

            # Compute the policy loss

            # Compute the KL divergence
            kl = ...
            # Compute the probability of the played actions according to the current policy
            # We do not replay the action: we use the one stored into the dataset
            # Note that the policy is not wrapped into a TemporalAgent, but we use a single step
            #Compute the ratio of action probabilities
            # Compute the policy loss
            # (using compute_penalty_policy_loss)
            assert False, 'Not implemented yet'


            loss_policy = -cfg.algorithm.policy_coef * policy_loss

            # Entropy loss favors exploration
            # Note that the standard PPO algorithms do not have an entropy term, they don't need it
            # because the KL term is supposed to deal with exploration
            # So, to run the standard PPO algorithm, you should set cfg.algorithm.entropy_coef=0
            assert len(entropy) == 1, f"{entropy.shape}"
            entropy_loss = entropy[0].mean()
            loss_entropy = -cfg.algorithm.entropy_coef * entropy_loss

            # Store the losses for tensorboard display
            ppo.logger.log_losses(critic_loss, entropy_loss, policy_loss, ppo.nb_steps)
            ppo.logger.add_log("advantage", policy_advantage.mean(), ppo.nb_steps)

            loss = loss_policy + loss_entropy

            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(
                ppo.train_policy.parameters(), cfg.algorithm.max_grad_norm
            )
            optimizer.step()

        ppo.old_policy.copy_parameters(ppo.train_policy)
        ppo.all_critics.agent.agents[1] = copy.deepcopy(ppo.critic_agent).set_name("old_critic")

        ppo.evaluate()

## Definition of the parameters

In [None]:
env_name = "CartPole-v1" # TODO: c'est moche de devoir sortir env_name pour le récup dans log_dir ci-dessous...
base_dir = f"./tblogs/{env_name}/ppo-penalty/"

params={
    "save_best": False,
    "base_dir": base_dir,
    "logger":{
        "classname": "bbrl.utils.logger.TFLogger",
        "log_dir": f"{base_dir}" + str(time.time()),
        "cache_size": 10000,
        "every_n_seconds": 10,
        "verbose": False,    
    },

  "algorithm":{
      "seed": 12,
      "max_grad_norm": 0.5,
      "n_envs": 1,
      "n_steps": 50,
      "eval_interval": 1000,
      "nb_evals": 10,
      "gae": 0.95,
      "discount_factor": 0.9,
      "opt_epochs": 3,
      "batch_size": 16,
      "beta": 5.,
      "clip_range_vf": 0,
      "entropy_coef": 2e-7,
      "critic_coef": 0.4,
      "policy_coef": 1,
      "policy_type": "DiscretePolicy",
      "architecture":{
          "actor_hidden_size": [64, 64],
          "critic_hidden_size": [64, 64],
      },
  },
    "gym_env":{
        "env_name": "CartPole-v1",
    },
    "optimizer":
    {
        "classname": "torch.optim.Adam",
        "lr": 1e-3,
        "eps": 1e-5,
    }
}

params2={
    "save_best": False,
    "base_dir": base_dir,
    "logger":{
        "classname": "bbrl.utils.logger.TFLogger",
        "log_dir": f"{base_dir}" + str(time.time()),
        "cache_size": 10000,
        "every_n_seconds": 10,
        "verbose": False,    
    },

  "algorithm":{
      "seed": 12,
      "max_grad_norm": 0.5,
      "n_envs": 10,
      "n_steps": 250,
      "eval_interval": 4_000,
      "nb_evals": 10,
      "gae": 0.7,
      "discount_factor": 0.9,
      "opt_epochs": 3,
      "batch_size": 256,
      "beta": 10.,
      "clip_range_vf": 0,
      "entropy_coef": 2e-7,
      "critic_coef": 0.5,
      "policy_coef": 0.8,
      "policy_type": "DiscretePolicy",
      "architecture":{
          "actor_hidden_size": [256, 256],
          "critic_hidden_size": [256, 256],
      },
  },
    "gym_env":{
        "env_name": "LunarLander-v2",
    },
    "optimizer":
    {
        "classname": "torch.optim.Adam",
        "lr": 4e-3,
        "eps": 5e-5,
    }
}

### Launching tensorboard to visualize the results

In [None]:
setup_tensorboard("./tblogs")

In [None]:
config=OmegaConf.create(params)
torch.manual_seed(config.algorithm.seed)
ppo=PPOPenalty(config)
run_ppo_penalty(ppo)

config=OmegaConf.create(params2)
torch.manual_seed(config.algorithm.seed)
ppo=PPOPenalty(config)
run_ppo_penalty(ppo)