# Outlook

In this notebook we code the Soft Actor-Critic (SAC) algorithm using BBRL.
This algorithm is described in [this
paper](http://proceedings.mlr.press/v80/haarnoja18b/haarnoja18b.pdf) and [this
paper](https://arxiv.org/pdf/1812.05905.pdf).

To understand this code, you need to know more about 
[the BBRL interaction model](https://github.com/osigaud/bbrl/blob/master/docs/overview.md)
Then you should run [a didactical example](https://github.com/osigaud/bbrl/blob/master/docs/notebooks/03-multi_env_autoreset.student.ipynb)
to see how agents interact in BBRL when autoreset=True.

The algorithm is explained in [this
video](https://www.youtube.com/watch?v=U20F-MvThjM) and you can also read [the
corresponding slides](http://pages.isir.upmc.fr/~sigaud/teach/ps/12_sac.pdf).

## Installation and Imports

### Installation

The BBRL library is [here](https://github.com/osigaud/bbrl).

We use OmegaConf to that makes it possible that by just defining the `def
run_dqn(cfg):` function and then executing a long `params = {...}` variable at
the bottom of this colab, the code is run with the parameters without calling
an explicit main.

More precisely, the code is run by calling

`config=OmegaConf.create(params)`

`run_dqn(config)`

at the very bottom of the colab, after starting tensorboard.

Below, we import standard python packages, pytorch packages and gymnasium
environments.

In [None]:
# Installs the necessary Python and system libraries
try:
    from easypip import easyimport, easyinstall, is_notebook
except ModuleNotFoundError as e:
    get_ipython().run_line_magic("pip", "install easypip")
    from easypip import easyimport, easyinstall, is_notebook

easyinstall("bbrl>=0.2.2")
easyinstall("swig")
easyinstall("bbrl_gymnasium>=0.2.0")
easyinstall("bbrl_gymnasium[box2d]")
easyinstall("bbrl_gymnasium[classic_control]")
easyinstall("tensorboard")
easyinstall("moviepy")
easyinstall("box2d-kengz")

In [None]:
import os
import sys
from pathlib import Path
import math

from moviepy.editor import ipython_display as video_display
import time
from tqdm.auto import tqdm
from typing import Tuple, Optional
from functools import partial

from omegaconf import OmegaConf
import torch
import bbrl_gymnasium

import copy
from abc import abstractmethod, ABC
import torch.nn as nn
import torch.nn.functional as F
from time import strftime

OmegaConf.register_new_resolver(
    "current_time", lambda: strftime("%Y%m%d-%H%M%S"), replace=True
)

In [None]:
# Imports all the necessary classes and functions from BBRL
from bbrl.agents.agent import Agent
from bbrl import get_arguments, get_class, instantiate_class

# The workspace is the main class in BBRL, this is where all data is collected and stored
from bbrl.workspace import Workspace

# Agents(agent1,agent2,agent3,...) executes the different agents the one after the other
# TemporalAgent(agent) executes an agent over multiple timesteps in the workspace,
# or until a given condition is reached
from bbrl.agents import Agents, TemporalAgent

# ParallelGymAgent is an agent able to execute a batch of gymnasium environments
# with auto-resetting. These agents produce multiple variables in the workspace:
# ’env/env_obs’, ’env/reward’, ’env/timestep’, ’env/terminated’,
# 'env/truncated', 'env/done', ’env/cumulated_reward’, ...
#
# When called at timestep t=0, the environments are automatically reset. At
# timestep t>0, these agents will read the ’action’ variable in the workspace at
# time t − 1
from bbrl.agents.gymnasium import GymAgent, ParallelGymAgent, make_env, record_video

# Replay buffers are useful to store past transitions when training
from bbrl.utils.replay_buffer import ReplayBuffer

In [None]:
# Utility function for launching tensorboard
# For Colab - otherwise, it is easier and better to launch tensorboard from
# the terminal
def setup_tensorboard(path):
    path = Path(path)
    answer = ""
    if is_notebook():
        if get_ipython().__class__.__module__ == "google.colab._shell":
            answer = "y"
        while answer not in ["y", "n"]:
            answer = input(
                f"Do you want to launch tensorboard in this notebook [y/n] "
            ).lower()

    if answer == "y":
        get_ipython().run_line_magic("load_ext", "tensorboard")
        get_ipython().run_line_magic("tensorboard", f"--logdir {path.absolute()}")
    else:
        import sys
        import os
        import os.path as osp

        print(
            f"Launch tensorboard from the shell:\n{osp.dirname(sys.executable)}/tensorboard --logdir={path.absolute()}"
        )

### Functions to build networks

We define a few utilitary functions to build neural networks

The function below builds a multi-layer perceptron where the size of each layer is given in the `size` list.
We also specify the activation function of neurons at each layer and optionally a different activation function for the final layer.

In [None]:
import torch.nn as nn
def build_mlp(sizes, activation, output_activation=nn.Identity()):
    """Helper function to build a multi-layer perceptron (function from $\mathbb R^n$ to $\mathbb R^p$)
    
    Args:
        sizes (List[int]): the number of neurons at each layer
        activation (nn.Module): a PyTorch activation function (after each layer but the last)
        output_activation (nn.Module): a PyTorch activation function (last layer)
    """
    layers = []
    for j in range(len(sizes) - 1):
        act = activation if j < len(sizes) - 2 else output_activation
        layers += [nn.Linear(sizes[j], sizes[j + 1]), act]
    return nn.Sequential(*layers)

In [None]:
def build_backbone(sizes, activation):
    layers = []
    for j in range(len(sizes) - 1):
        layers += [nn.Linear(sizes[j], sizes[j + 1]), activation]
    return layers

In [None]:
from bbrl import get_arguments, get_class
from itertools import chain

def setup_optimizer(cfg_optimizer, *agents):
    """Setup an optimizer for a list of agents"""
    optimizer_args = get_arguments(cfg_optimizer)
    parameters = [agent.parameters() for agent in agents]
    optimizer = get_class(cfg_optimizer)(chain(*parameters), **optimizer_args)
    return optimizer

def copy_parameters(model_a, model_b):
    """Copy parameters from a model a to model_b"""
    for model_a_p, model_b_p in zip(model_a.parameters(), model_b.parameters()):
        model_b_p.data.copy_(model_a_p)

### The Logger class

The logger is in charge of collecting statistics during the training
process.

Having logging provided under the hood is one of the features allowing you
to save time when using RL libraries like BBRL.

In these notebooks, the logger is defined as `bbrl.utils.logger.TFLogger` so as
to use a tensorboard visualisation (see the parameters part `params = { "logger":{ ...` below).

Note that the BBRL Logger is also saving the log in a readable format such
that you can use `Logger.read_directories(...)` to read multiple logs, create
a dataframe, and analyze many experiments afterward in a notebook for
instance. The code for the different kinds of loggers is available in the
[bbrl/utils/logger.py](https://github.com/osigaud/bbrl/blob/master/src/bbrl/utils/logger.py)
file.

`instantiate_class` is an inner BBRL mechanism. The
`instantiate_class`function is available in the
[`bbrl/__init__.py`](https://github.com/osigaud/bbrl/blob/master/src/bbrl/__init__.py)
file.

In [None]:
from bbrl import instantiate_class

class Logger():

    def __init__(self, cfg):
        self.logger = instantiate_class(cfg.logger)

    def add_log(self, log_string, loss, steps):
        self.logger.add_scalar(log_string, loss.item(), steps)

    # A specific function for RL algorithms having a critic, an actor and an entropy losses
    def log_losses(self, critic_loss, entropy_loss, actor_loss, steps):
        self.add_log("critic_loss", critic_loss, steps)
        self.add_log("entropy_loss", entropy_loss, steps)
        self.add_log("actor_loss", actor_loss, steps)

    def log_reward_losses(self, rewards, nb_steps):
        self.add_log("reward/mean", rewards.mean(), nb_steps)
        self.add_log("reward/max", rewards.max(), nb_steps)
        self.add_log("reward/min", rewards.min(), nb_steps)
        self.add_log("reward/median", rewards.median(), nb_steps)

# Learning environment

To setup a common learning environment for RL algorithms, we use the `RLBase`
class. This class:

1. Initializes the environment (random seed, logger, evaluation environment)
2. Defines a `evaluate` method that keeps the best agent so far
3. Defines a `visualize_best` method that displays the behavior of the best agent

Subclasses need to define `self.train_policy` and `self.eval_policy`, two
BBRL agents that respectively choose actions when training and evaluating.

The behavior of `RLBase` is controlled by the following configuration
variables:

- `base_dir` defines the directory subpath used when outputing losses during
training as well as other outputs (serialized agent, global statistics, etc.)
- `algorithm.seed` defines the random seed used (to initialize the agent and
  the environment)
- `gym_env` defines the gymnasium environment, and in particular
`gym_env.env_name` the name of the gymansium environment
- `logger` defines what type of logger is used to log the different values
associated with learning
- `algorithm.eval_interval` defines the number of observed transitions between
each evaluation of the agent

In [None]:
import numpy as np
from typing import Any
import logging
from abc import ABC
from functools import cached_property

class RLBase(ABC):
    """Base class for Reinforcement learning algorithms
    
    This class deals with common processing:

    - defines the logger, the train and evaluation agents
    - defines how to evaluate a policy
    """

    #: The configuration
    cfg: Any

    #: The evaluation environment deals with the last action, and produces a new
    # state of the environment
    eval_env: Agent

    #: The training policy
    train_policy: Agent

    #: The evaluation policy (if not defined, uses the training policy)
    eval_policy: Agent

    def __init__(self, cfg):
        # Basic initialization
        self.cfg = cfg
        torch.manual_seed(cfg.algorithm.seed)

        # Sets the base directory and logger directory
        base_dir = Path(self.cfg.base_dir)
        self.base_dir = Path("outputs") / Path(self.cfg.base_dir)
        self.base_dir.mkdir(parents=True, exist_ok=True)

        # Initialize the logger class
        if not hasattr(cfg.logger, "log_dir"):
            cfg.logger.log_dir = str(Path("outputs") / "tblogs" / base_dir)
        self.logger = Logger(cfg)

        # Subclasses have to define the training and eval policies
        self.train_policy = None
        self.eval_policy = None

        # Sets up the evaluation environment
        self.eval_env = ParallelGymAgent(
            partial(make_env, cfg.gym_env.env_name), 
            cfg.algorithm.nb_evals
        ).seed(cfg.algorithm.seed)

        # Initialize values
        self.last_eval_step = 0
        self.nb_steps = 0
        self.best_policy = None
        self.best_reward = -torch.inf

        # Records the rewards
        self.eval_rewards = []

    @cached_property
    def train_agent(self):
        """Returns the training agent

        The agent is composed of a policy agent and the training environment.      
        This method supposes that `self.train_policy` has been setup
        """
        assert self.train_policy is not None, "The train_policy property is not defined before the policy is set"
        return TemporalAgent(Agents(self.train_env, self.train_policy))

    @cached_property
    def eval_agent(self):
        """Returns the evaluation agent 
        
        The agent is composed of a policy agent and the evaluation environment
        
        Uses `self.eval_policy` (or `self.train_policy` if not defined)

        """
        assert self.eval_policy is not None or self.train_policy is not None, "eval_agent property is not defined before the policy is set"
        return TemporalAgent(Agents(self.eval_env, self.eval_policy if self.eval_policy is not None else self.train_policy))

    def evaluate(self):
        """Evaluate the current policy `self.eval_policy`
        
        Evaluation is conducted every `cfg.algorithm.eval_interval` steps, and
        we keep a copy of the best agent so far in `self.best_policy`
        
        Returns True if the current policy is the best so far
        """
        if (self.nb_steps - self.last_eval_step) > self.cfg.algorithm.eval_interval:
            self.last_eval_step = self.nb_steps
            eval_workspace = Workspace() 
            self.eval_agent(
                eval_workspace,
                t=0,
                stop_variable="env/done"
            )
            rewards = eval_workspace["env/cumulated_reward"][-1]
            self.logger.log_reward_losses(rewards, self.nb_steps)

            if getattr(self.cfg, "collect_stats", False):
                self.eval_rewards.append(rewards)

            rewards_mean = rewards.mean()
            if rewards_mean > self.best_reward:
                self.best_policy = copy.deepcopy(self.eval_policy)
                self.best_reward = rewards_mean
                return True

    def save_stats(self):
        """Save reward statistics into `stats.npy`"""
        if getattr(self.cfg, "collect_stats", False) and self.eval_rewards:
            data = torch.stack(self.eval_rewards, axis=-1) 
            with (self.base_dir / "stats.npy").open("wt") as fp:
                np.savetxt(fp, data.numpy())

    def visualize_best(self):
        """Visualize the best agent"""
        env = make_env(self.cfg.gym_env.env_name, render_mode="rgb_array")
        path = self.base_dir / "best_agent.mp4"
        print(f"Video of best agent recorded in {path}")
        record_video(env, self.best_policy, path)
        return video_display(str(path.absolute()))

The `EpochBasedAlgo` defines the environment when using replay buffers. In
particular, it defines `self.train_env` which is the environment used for
training. This environment uses *autoreset*, i.e. when reaching a terminal
state, a new environment is created.

`EpochBasedAlgo` stores all the transitions $(s_t, a_t, r_t, s_{t+1}, ...)$
into `self.replay_buffer`.

The behavior of `EpochBasedAlgo` is controlled by the following configuration
variables:

- `gym_env.env_name` defines the gymnasium environment
- `algorithm.n_envs` defines the number of parallel environments
- `algorithm.seed` defines the random seed used (to initialize the agent and
  the environment)
- `algorithm.buffer_size` is the maximum number of transitions stored into the
replay buffer

In [None]:
class EpochBasedAlgo(RLBase):
    """RL environment when using transition buffers"""
    
    train_agent: TemporalAgent
    
    """Base class for RL experiments with full episodes"""
    def __init__(self, cfg):
        super().__init__(cfg)

        # We use a non-autoreset workspace
        self.train_env = ParallelGymAgent(
            partial(make_env, cfg.gym_env.env_name, autoreset=True), 
            cfg.algorithm.n_envs
        ).seed(cfg.algorithm.seed)

        # Configure the workspace to the right dimension
        # Note that no parameter is needed to create the workspace.
        self.replay_buffer = ReplayBuffer(max_size=cfg.algorithm.buffer_size)

`iter_replay_buffers` provides an easy access to the replay buffer when
learning. Its behavior depends on several configuration values:

- `cfg.algorithm.max_epochs` defines the number of times the agent is used to
collect transitions
- `cfg.algorithm.learning_starts` defines the number of transitions before
learning starts
 
Using `iter_replay_buffers` is simple:

```py
  class MyAlgo(EpochBasedAlgo):
      def __init__(self, cfg):
          super().__init__(cfg)

          # Define the train and evaluation policies
          # (the agents compute the workspace `action` variable)
          self.train_policy = ...
          self.eval_policy = ...

  rl_algo = MyAlgo(cfg)
  for rb in iter_replay_buffers(rl_algo):
      # rb is a workspace containing transitions
      ...
```

In [None]:
def iter_replay_buffers(algo: EpochBasedAlgo):
    """Loop over transition buffers"""
    train_workspace = Workspace()

    epochs_pb = tqdm(range(algo.cfg.algorithm.max_epochs))
    for epoch in epochs_pb:
        
        # This is the tricky part with transition buffers. The difficulty lies in the
        # copy of the last step and the way to deal with the n_steps return.
        #
        # The call to `train_agent(workspace, t=1, n_steps=cfg.algorithm.n_timesteps -
        # 1, stochastic=True)` makes the agent run a number of steps in the workspace.
        # In practice, it calls the
        # [`__call__(...)`](https://github.com/osigaud/bbrl/blob/master/src/bbrl/agents/agent.py#L59)
        # function which makes a forward pass of the agent network using the workspace
        # data and updates the workspace accordingly.
        #
        # Now, if we start at the first epoch (`epoch=0`), we start from the first step
        # (`t=0`). But when subsequently we perform the next epochs (`epoch>0`), we must
        # not forget to cover the transition at the border between the previous epoch
        # and the current epoch. To avoid this risk, we copy the information from the
        # last time step of the previous epoch into the first time step of the next
        # epoch. This is explained in more details in [a previous
        # notebook](https://colab.research.google.com/drive/1W9Y-3fa6LsPeR6cBC1vgwBjKfgMwZvP5).
        if epoch == 0:
            # First run: we start from scratch
            algo.train_agent(
                train_workspace, t=0, n_steps=algo.cfg.algorithm.n_steps, stochastic=True
            )
        else:
            # Other runs: we copy the last step and start from there
            train_workspace.zero_grad()
            train_workspace.copy_n_last_steps(1)
            algo.train_agent(
                train_workspace, t=1, n_steps=algo.cfg.algorithm.n_steps-1, stochastic=True
            )
        
        algo.nb_steps += algo.cfg.algorithm.n_steps * algo.cfg.algorithm.n_envs

        # Add transitions to buffer
        transition_workspace = train_workspace.get_transitions()
        algo.replay_buffer.put(transition_workspace)
        if algo.replay_buffer.size() > algo.cfg.algorithm.learning_starts:
            yield algo.replay_buffer

        # Eval
        epochs_pb.set_description(
            f"nb_steps: {algo.nb_steps}, "
            f"best reward: {algo.best_reward:.2f}"
        )

### Soft parameter updates

To update the target critic, one uses the following equation:
$\theta' \leftarrow \tau \theta + (1- \tau) \theta'$
where $\theta$ is the vector of parameters of the critic, and $\theta'$ is the vector of parameters of the target critic.
The `soft_update_params(...)` function is in charge of performing this soft update.

In [None]:
def soft_update_params(net, target_net, tau):
    for param, target_param in zip(net.parameters(), target_net.parameters()):
        target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)

## The SquashedGaussianActor

SAC works better with a Squashed Gaussian actor, which enables the
reparametrization trick. Note that our attempts to use a
`TunableVarianceContinuousActor` as we did for instance in the [notebook about
PPO](http://master-dac.isir.upmc.fr/rld/rl/07-1-ppo_penalty.student.ipynb)
completely failed. Such failure is also documented in the [OpenAI spinning up
documentation page about
SAC](https://spinningup.openai.com/en/latest/algorithms/sac.html).

The code of the `SquashedGaussianActor` actor is below.

The fact that we use the reparametrization trick is hidden inside the code of
this distribution. 

In [None]:
from torch.distributions import Normal, Independent, TransformedDistribution, TanhTransform

class SquashedGaussianActor(Agent):
    def __init__(self, state_dim, hidden_layers, action_dim, min_std=1e-4):
        """Creates a new Squashed Gaussian actor

        :param state_dim: The dimension of the state space
        :param hidden_layers: Hidden layer sizes
        :param action_dim: The dimension of the action space
        :param min_std: The minimum standard deviation, defaults to 1e-4
        """
        super().__init__()
        self.min_std = min_std
        backbone_dim = [state_dim] + list(hidden_layers)
        self.layers = build_backbone(backbone_dim, activation=nn.ReLU())
        self.backbone = nn.Sequential(*self.layers)
        self.last_mean_layer = nn.Linear(hidden_layers[-1], action_dim)
        self.last_std_layer = nn.Linear(hidden_layers[-1], action_dim)
        # std must be positive
        self.softplus = nn.Softplus()
        # cache_size avoids numerical infinites or NaNs when
        # computing log probabilities
        self.tanh_transform = TanhTransform(cache_size=1)

    def normal_dist(self, obs: torch.Tensor):
        # Compute normal distribution given observation(s)
        backbone_output = self.backbone(obs)
        mean = self.last_mean_layer(backbone_output)
        std_out = self.last_std_layer(backbone_output)
        std = self.softplus(std_out) + self.min_std
        # Independent ensures that we have a multivariate 
        # Gaussian with a diagonal covariance matrix (given as
        # a vector `std`)
        return Independent(Normal(mean, std), 1)

    def forward(self, t, stochastic=True):
        normal_dist = self.normal_dist(self.get(("env/env_obs", t)))
        action_dist = TransformedDistribution(normal_dist, [self.tanh_transform])
        if stochastic:
            # Uses the re-parametrization trick
            action = action_dist.rsample()
        else:
            action = self.tanh_transform(normal_dist.mode)
            
        log_prob = action_dist.log_prob(action)
        # This line allows to deepcopy the actor...
        self.tanh_transform._cached_x_y = [None, None]
        self.set(("action", t), action)
        self.set(("action_logprobs", t), log_prob)
    
class DeterministicWrapper(Agent):
    """Use the above agent with a deterministic policy"""
    def __init__(self, actor: SquashedGaussianActor):
        super().__init__()
        self.actor = actor
        
    def forward(self, t, **args):
        self.actor(self.workspace, t=t, stochastic=False)

### CriticAgent

As critics and target critics, SAC uses several instances of ContinuousQAgent
class, as DDPG and TD3.
See the [DDPG notebook](http://master-dac.isir.upmc.fr/rld/rl/04-ddpg-td3.student.ipynb) for details.

In [None]:
class ContinuousQAgent(Agent):
    def __init__(self, state_dim, hidden_layers, action_dim):
        super().__init__()
        self.is_q_function = True
        self.model = build_mlp(
            [state_dim + action_dim] + list(hidden_layers) + [1], activation=nn.ReLU()
        )

    def forward(self, t):
        obs = self.get(("env/env_obs", t))
        action = self.get(("action", t))
        obs_act = torch.cat((obs, action), dim=1)
        q_value = self.model(obs_act).squeeze(-1)
        self.set((f"{self.prefix}q_value", t), q_value)

### Building the complete training and evaluation agents

In the code below we create the Squashed Gaussian actor, two critics and the
corresponding target critics. Beforehand, we checked that the environment
takes continuous actions (otherwise we would need a different code).

In [None]:
# Create the SAC Agent
class SACAgents(EpochBasedAlgo):
    def __init__(self, cfg):
        super().__init__(cfg)
        
        obs_size, act_size = self.train_env.get_obs_and_actions_sizes()
        assert (
            self.train_env.is_continuous_action()
        ), "SAC code dedicated to continuous actions"

        # We need an actor
        self.actor= SquashedGaussianActor(
            obs_size, cfg.algorithm.architecture.actor_hidden_size, act_size
        )

        # Builds the critics
        self.critic_1 = ContinuousQAgent(
            obs_size, cfg.algorithm.architecture.critic_hidden_size, act_size, 
        ).with_prefix("critic-1/")
        self.target_critic_1 = copy.deepcopy(self.critic_1).with_prefix("target-critic-1/")
        
        self.critic_2 = ContinuousQAgent(
            obs_size, cfg.algorithm.architecture.critic_hidden_size, act_size,
        ).with_prefix("critic-2/")
        self.target_critic_2 = copy.deepcopy(self.critic_2).with_prefix("target-critic-2/")
        
        # Train and evaluation policies
        self.train_policy = self.actor
        self.eval_policy = DeterministicWrapper(self.actor)

For the entropy coefficient optimizer, the code is as follows. Note the trick
which consists in using the log of this entropy coefficient. This trick was
taken from the Stable baselines3 implementation of SAC, which is explained in
[this
notebook](https://colab.research.google.com/drive/12LER1_ShWOa_UhOL1nlX-LX_t5KQK9LV?usp=sharing).

Tuning $\alpha$ in SAC is an option. To chose to tune it, the `target_entropy`
argument in the parameters should be `auto`. The initial value is given
through the `entropy_coef` parameter. For any other value than `auto`, the
value of $\alpha$ will stay constant and correspond to the `entropy_coef`
parameter.

In [None]:
def setup_entropy_optimizers(cfg):
    if cfg.algorithm.entropy_mode == "auto":
        entropy_coef_optimizer_args = get_arguments(cfg.entropy_coef_optimizer)
        # Note: we optimize the log of the entropy coef which is slightly different from the paper
        # as discussed in https://github.com/rail-berkeley/softlearning/issues/37
        # Comment and code taken from the SB3 version of SAC
        log_entropy_coef = torch.log(
            torch.ones(1) * cfg.algorithm.init_entropy_coef
        ).requires_grad_(True)
        entropy_coef_optimizer = get_class(cfg.entropy_coef_optimizer)(
            [log_entropy_coef], **entropy_coef_optimizer_args
        )
        return entropy_coef_optimizer, log_entropy_coef
    else:
        return None, None

### Compute the critic loss

With the notations of my slides, the equation corresponding to Eq. (5) and (6)
in [this paper](https://arxiv.org/pdf/1812.05905.pdf) becomes:

$$ loss_Q({\boldsymbol{\theta}}) = {\rm I\!E}_{(\mathbf{s}_t, \mathbf{a}_t,
\mathbf{s}_{t+1}) \sim \mathcal{D}}\left[\left( r(\mathbf{s}_t, \mathbf{a}_t)
+ \gamma {\rm I\!E}_{\mathbf{a} \sim
\pi_{\boldsymbol{\theta}}(.|\mathbf{s}_{t+1})}\left[\hat{Q}^{\pi_{\boldsymbol{\theta}}}_{\boldsymbol{\phi}}(\mathbf{s}_{t+1},
\mathbf{a}) - \alpha
\log{\pi_{\boldsymbol{\theta}}(\mathbf{a}|\mathbf{s}_{t+1})} \right] -
\hat{Q}^{\pi_{\boldsymbol{\theta}}}_{\boldsymbol{\phi}}(\mathbf{s}_t,
\mathbf{a}_t) \right)^2 \right] $$

An important information in the above equation and the one about the actor
loss below is the index of the expectations. These indexes tell us where the
data should be taken from. In the above equation, one can see that the index
of the outer expectation is over samples taken from the replay buffer, whereas
in the inner expectation we consider actions from the current actor at the
next state.

Thus, to compute the inner expectation, one needs to determine what actions
the current actor would take in the next state of each sample. This is what
the line 

`t_actor(rb_workspace, t=1, n_steps=1, stochastic=True)`

does. The parameter `t=1` (instead of 0) ensures that we consider the next
state.

Once we have determined these actions, we can determine their Q-values and
their log probabilities, to compute the inner expectation.

Note that at this stage, we only determine the log probabilities corresponding
to actions taken at the next time step, by contrast with what we do for the
actor in the `compute_actor_loss(...)` function later on.

Finally, once we have computed the $$
\hat{Q}^{\pi_{\boldsymbol{\theta}}}_{\boldsymbol{\phi}}(\mathbf{s}_{t+1},
\mathbf{a}) $$ for both critics, we take the min and store it into
`post_q_values`. By contrast, the Q-values corresponding to the last term of
the equation are taken from the replay buffer, they are computed in the
beginning of the function by applying the Q agents to the replay buffer
*before* changing the action to that of the current actor.

An important remark is that, if the entropy coefficient $\alpha$ corresponding
to the `ent_coef` variable is set to 0, then we retrieve exactly the critic
loss computation function of the TD3 algorithm. As we will see later, this is
also true of the actor loss computation.

This remark proved very useful in debugging the SAC code. We have set
`ent_coef` to 0 and ensured the behavior was strictly the same as the behavior
of TD3.

In [None]:
def compute_critic_loss(
    cfg, reward, must_bootstrap,
    t_actor, 
    q_agents, 
    target_q_agents, 
    rb_workspace,
    ent_coef
):
    """Computes the critic loss for a set of $S$ transition samples

    Args:
        cfg: The experimental configuration
        reward: Tensor (2xS) of rewards
        must_bootstrap: Tensor (S) of indicators
        t_actor: The actor agent (as a TemporalAgent)
        q_agents: The critics (as a TemporalAgent)
        target_q_agents: The target of the critics (as a TemporalAgent)
        rb_workspace: The transition workspace
        ent_coef: The entropy coefficient

    Returns:
        Tuple[torch.Tensor, torch.Tensor]: The two critic losses (scalars)
    """

    # Compute q_values from both critics with the actions present in the buffer:
    # at t, we have Q(s,a) from the (s,a) in the RB
    q_agents(rb_workspace, t=0, n_steps=1)

    with torch.no_grad():
        # Replay the current actor on the replay buffer to get actions of the
        # current actor
        t_actor(rb_workspace, t=1, n_steps=1, stochastic=True)
        action_logprobs_next = rb_workspace["action_logprobs"]

        # Compute target q_values from both target critics: at t+1, we have
        # Q(s+1,a+1) from the (s+1,a+1) where a+1 has been replaced in the RB
        target_q_agents(rb_workspace, t=1, n_steps=1)

    q_values_rb_1, q_values_rb_2, post_q_values_1, post_q_values_2 = rb_workspace[
        "critic-1/q_value", "critic-2/q_value", "target-critic-1/q_value", "target-critic-2/q_value"
    ]

    # Compute temporal difference

    assert False, 'Not implemented yet'


    return critic_loss_1, critic_loss_2

### Compute the actor Loss

With the notations of my slides, the equation of the actor loss corresponding
to Eq. (7) in [this paper](https://arxiv.org/pdf/1812.05905.pdf) becomes:

$$ loss_\pi({\boldsymbol{\theta}}) = {\rm I\!E}_{\mathbf{s}_t \sim
\mathcal{D}}\left[ {\rm I\!E}_{\mathbf{a}_t\sim
\pi_{\boldsymbol{\theta}}(.|\mathbf{s}_t)} \left[ \alpha
\log{\pi_{\boldsymbol{\theta}}(\mathbf{a}_t|\mathbf{s}_t) -
\hat{Q}^{\pi_{\boldsymbol{\theta}}}_{\boldsymbol{\phi}}(\mathbf{s}_t,
\mathbf{a}_t)} \right] \right] $$

Note that [the paper](https://arxiv.org/pdf/1812.05905.pdf) mistakenly writes
$Q_\theta(s_t,s_t)$

As for the critic loss, we have two expectations, one over the states from the
replay buffer, and one over the actions of the current actor. Thus we need to
apply again the current actor to the content of the replay buffer.

But this time, we consider the current state, thus we parametrize it with
`t=0` and `n_steps=1`. This way, we get the log probabilities and Q-values at
the current step.

A nice thing is that this way, there is no overlap between the log probability
data used to update the critic and the actor, which avoids having to 'retain'
the computation graph so that it can be reused for the actor and the critic.

This small trick is one of the features that makes coding SAC the most
difficult.

Again, once we have computed the Q values over both critics, we take the min
and put it into `current_q_values`.

As for the critic loss, if we set `ent_coef` to 0, we retrieve the actor loss
function of DDPG and TD3, which simply tries to get actions that maximize the
Q values (by minimizing -Q).

In [None]:
def compute_actor_loss(ent_coef, t_actor, q_agents, rb_workspace):
    """
    Actor loss computation
    :param ent_coef: The entropy coefficient $\alpha$
    :param t_actor: The actor agent (temporal agent)
    :param q_agents: The critics (as temporal agent)
    :param rb_workspace: The replay buffer (2 time steps, $t$ and $t+1$)
    """
    
    # Recompute the q_values from the current actor, not from the actions in the buffer

    # Recompute the action with the current actor (at $a_t$)

    assert False, 'Not implemented yet'


    # Compute Q-values

    assert False, 'Not implemented yet'

    current_q_values = torch.min(q_values_1, q_values_2)

    # Compute the actor loss

    actor_loss =
    assert False, 'Not implemented yet'


    return actor_loss.mean()

## Main training loop

In [None]:
import numpy as np

def run_sac(agents: SACAgents):
    cfg = agents.cfg
    logger = agents.logger

    # init_entropy_coef is the initial value of the entropy coef alpha. 
    ent_coef = cfg.algorithm.init_entropy_coef
    tau = cfg.algorithm.tau_target

    # Creates the temporal actors
    t_actor = TemporalAgent(agents.train_policy)
    q_agents = TemporalAgent(Agents(agents.critic_1, agents.critic_2))
    target_q_agents = TemporalAgent(Agents(agents.target_critic_1, agents.target_critic_2))

    # Configure the optimizer
    actor_optimizer = setup_optimizer(cfg.actor_optimizer, agents.actor)
    critic_optimizer = setup_optimizer(cfg.critic_optimizer, agents.critic_1, agents.critic_2)
    entropy_coef_optimizer, log_entropy_coef = setup_entropy_optimizers(cfg)

    # If entropy_mode is not auto, the entropy coefficient ent_coef will remain fixed
    if cfg.algorithm.entropy_mode == "auto":
        # target_entropy is \mathcal{H}_0 in the SAC and aplications paper. 
        target_entropy = -np.prod(agents.train_env.action_space.shape).astype(np.float32)

    for rb in iter_replay_buffers(agents):
        rb_workspace = rb.get_shuffled(cfg.algorithm.batch_size)

        terminated, reward = rb_workspace[
            "env/terminated", "env/reward"
        ]
        if entropy_coef_optimizer is not None:
            ent_coef = torch.exp(log_entropy_coef.detach())

        # Critic update part ###############################
        critic_optimizer.zero_grad()

        (
            critic_loss_1, critic_loss_2
        ) = compute_critic_loss(
            cfg, 
            reward, 
            ~terminated[1],
            t_actor,
            q_agents,
            target_q_agents,
            rb_workspace,
            ent_coef
        )

        logger.add_log("critic_loss_1", critic_loss_1, agents.nb_steps)
        logger.add_log("critic_loss_2", critic_loss_2, agents.nb_steps)
        critic_loss = critic_loss_1 + critic_loss_2
        critic_loss.backward()
        torch.nn.utils.clip_grad_norm_(
            agents.critic_1.parameters(), cfg.algorithm.max_grad_norm
        )
        torch.nn.utils.clip_grad_norm_(
            agents.critic_2.parameters(), cfg.algorithm.max_grad_norm
        )
        critic_optimizer.step()


        # Actor update part ###############################
        actor_optimizer.zero_grad()
        actor_loss = compute_actor_loss(
            ent_coef, t_actor, q_agents, rb_workspace
        )
        logger.add_log("actor_loss", actor_loss, agents.nb_steps)
        actor_loss.backward()
        torch.nn.utils.clip_grad_norm_(
            agents.actor.parameters(), cfg.algorithm.max_grad_norm
        )
        actor_optimizer.step()

        # Entropy coef update part #####################################################
        if entropy_coef_optimizer is not None:
            # See Eq. (17) of the SAC and Applications paper
            # log. probs have been computed when computing
            # the actor loss
            action_logprobs_rb = rb_workspace[
                "action_logprobs"
            ].detach()
            entropy_coef_loss = -(
                log_entropy_coef.exp() * (action_logprobs_rb + target_entropy)
            ).mean()
            entropy_coef_optimizer.zero_grad()
            entropy_coef_loss.backward()
            entropy_coef_optimizer.step()
            logger.add_log("entropy_coef_loss", entropy_coef_loss, agents.nb_steps)
            logger.add_log("entropy_coef", ent_coef, agents.nb_steps)

        ####################################################

        # Soft update of target q function
        soft_update_params(agents.critic_1, agents.target_critic_1, tau)
        soft_update_params(agents.critic_2, agents.target_critic_2, tau)
        # soft_update_params(actor, target_actor, tau)
        
        agents.evaluate()

## Definition of the parameters

In [None]:
params={
  "save_best": True,
   "base_dir": "${gym_env.env_name}/sac-S${algorithm.seed}_${current_time:}",
  "logger":{
    "classname": "bbrl.utils.logger.TFLogger",
    "cache_size": 10000,
    "every_n_seconds": 10,
    "verbose": False,    
    },
  "algorithm":{
    "seed": 1,
    "n_envs": 8,
    "n_steps": 32,
    "buffer_size": 1e6,
    "batch_size": 256,
    "max_grad_norm": 0.5,
    "nb_evals": 16,
    "eval_interval": 2_000,
    "learning_starts": 10_000,
    "max_epochs": 2_000,
    "discount_factor": 0.98,
    "entropy_mode": "auto", # "auto" or "fixed"
    "init_entropy_coef": 2e-7,
    "tau_target": 0.05,
    "architecture":{
      "actor_hidden_size": [64, 64],
      "critic_hidden_size": [256, 256],
    },
    },
    "gym_env":{
        "env_name": "CartPoleContinuous-v1"
        # 
    },
  "actor_optimizer":{
    "classname": "torch.optim.Adam",
    "lr": 3e-4,
    },
  "critic_optimizer":{
    "classname": "torch.optim.Adam",
    "lr": 3e-4,
    },
  "entropy_coef_optimizer":{
    "classname": "torch.optim.Adam",
    "lr": 3e-4,
    }
}

## Launching tensorboard to visualize the results

In [None]:
setup_tensorboard("./outputs/tblogs")

In [None]:
agents = SACAgents(OmegaConf.create(params))
run_sac(agents)

In [None]:
# Visualize the best policy
agents.visualize_best()

## Exercises

- use the same code on the Pendulum-v1 environment. This one is harder to
  tune. Get the parameters from the
  [rl-baseline3-zoo](https://github.com/DLR-RM/rl-baselines3-zoo) and see if
  you manage to get SAC working on Pendulum