# Outlook

In this notebook, we will implement the REINFORCE algorithm using BBRL.

To understand this code, you need to know more about 
[the BBRL interaction model](https://github.com/osigaud/bbrl/blob/master/docs/overview.md)
Then you should run [a didactical example](https://github.com/osigaud/bbrl/blob/master/docs/notebooks/02-multi_env_noautoreset.student.ipynb)
to see how agents interact in BBRL when autoreset=False.

The REINFORCE algorithm is explained in a series of 3 videos: [video
1](https://www.youtube.com/watch?v=R7ULMBXOQtE), [video
2](https://www.youtube.com/watch?v=dKUWto9B9WY) and [video
3](https://www.youtube.com/watch?v=GcJ9hl3T6x8). You can also read the
corresponding slides:
[slides1](http://pages.isir.upmc.fr/~sigaud/teach/ps/3_pg_derivation1.pdf),
[slides2](http://pages.isir.upmc.fr/~sigaud/teach/ps/4_pg_derivation2.pdf),
[slides3](http://pages.isir.upmc.fr/~sigaud/teach/ps/5_pg_derivation3.pdf).

## Installation and Imports

### Installation

The BBRL library is [here](https://github.com/osigaud/bbrl).

We use OmegaConf to that makes it possible that by just defining the `def
run_dqn(cfg):` function and then executing a long `params = {...}` variable at
the bottom of this colab, the code is run with the parameters without calling
an explicit main.

More precisely, the code is run by calling

`config=OmegaConf.create(params)`

`run_dqn(config)`

at the very bottom of the colab, after starting tensorboard.

Below, we import standard python packages, pytorch packages and gymnasium
environments.

In [None]:
# Installs the necessary Python and system libraries
try:
    from easypip import easyimport, easyinstall, is_notebook
except ModuleNotFoundError as e:
    get_ipython().run_line_magic("pip", "install easypip")
    from easypip import easyimport, easyinstall, is_notebook

easyinstall("bbrl>=0.2.2")
easyinstall("swig")
easyinstall("bbrl_gymnasium>=0.2.0")
easyinstall("bbrl_gymnasium[box2d]")
easyinstall("bbrl_gymnasium[classic_control]")
easyinstall("tensorboard")
easyinstall("moviepy")
easyinstall("box2d-kengz")

In [None]:
import os
import sys
from pathlib import Path
import math

from moviepy.editor import ipython_display as video_display
import time
from tqdm.auto import tqdm
from typing import Tuple, Optional
from functools import partial

from omegaconf import OmegaConf
import torch
import bbrl_gymnasium

import copy
from abc import abstractmethod, ABC
import torch.nn as nn
import torch.nn.functional as F
from time import strftime

OmegaConf.register_new_resolver(
    "current_time", lambda: strftime("%Y%m%d-%H%M%S"), replace=True
)

In [None]:
# Imports all the necessary classes and functions from BBRL
from bbrl.agents.agent import Agent
from bbrl import get_arguments, get_class, instantiate_class

# The workspace is the main class in BBRL, this is where all data is collected and stored
from bbrl.workspace import Workspace

# Agents(agent1,agent2,agent3,...) executes the different agents the one after the other
# TemporalAgent(agent) executes an agent over multiple timesteps in the workspace,
# or until a given condition is reached
from bbrl.agents import Agents, TemporalAgent

# ParallelGymAgent is an agent able to execute a batch of gymnasium environments
# with auto-resetting. These agents produce multiple variables in the workspace:
# ’env/env_obs’, ’env/reward’, ’env/timestep’, ’env/terminated’,
# 'env/truncated', 'env/done', ’env/cumulated_reward’, ...
#
# When called at timestep t=0, the environments are automatically reset. At
# timestep t>0, these agents will read the ’action’ variable in the workspace at
# time t − 1
from bbrl.agents.gymnasium import GymAgent, ParallelGymAgent, make_env, record_video

# Replay buffers are useful to store past transitions when training
from bbrl.utils.replay_buffer import ReplayBuffer

In [None]:
# Utility function for launching tensorboard
# For Colab - otherwise, it is easier and better to launch tensorboard from
# the terminal
def setup_tensorboard(path):
    path = Path(path)
    answer = ""
    if is_notebook():
        if get_ipython().__class__.__module__ == "google.colab._shell":
            answer = "y"
        while answer not in ["y", "n"]:
            answer = input(
                f"Do you want to launch tensorboard in this notebook [y/n] "
            ).lower()

    if answer == "y":
        get_ipython().run_line_magic("load_ext", "tensorboard")
        get_ipython().run_line_magic("tensorboard", f"--logdir {path.absolute()}")
    else:
        import sys
        import os
        import os.path as osp

        print(
            f"Launch tensorboard from the shell:\n{osp.dirname(sys.executable)}/tensorboard --logdir={path.absolute()}"
        )

### The Logger class

The logger is in charge of collecting statistics during the training
process.

Having logging provided under the hood is one of the features allowing you
to save time when using RL libraries like BBRL.

In these notebooks, the logger is defined as `bbrl.utils.logger.TFLogger` so as
to use a tensorboard visualisation (see the parameters part `params = { "logger":{ ...` below).

Note that the BBRL Logger is also saving the log in a readable format such
that you can use `Logger.read_directories(...)` to read multiple logs, create
a dataframe, and analyze many experiments afterward in a notebook for
instance. The code for the different kinds of loggers is available in the
[bbrl/utils/logger.py](https://github.com/osigaud/bbrl/blob/master/src/bbrl/utils/logger.py)
file.

`instantiate_class` is an inner BBRL mechanism. The
`instantiate_class`function is available in the
[`bbrl/__init__.py`](https://github.com/osigaud/bbrl/blob/master/src/bbrl/__init__.py)
file.

In [None]:
from bbrl import instantiate_class

class Logger():

    def __init__(self, cfg):
        self.logger = instantiate_class(cfg.logger)

    def add_log(self, log_string, loss, steps):
        self.logger.add_scalar(log_string, loss.item(), steps)

    # A specific function for RL algorithms having a critic, an actor and an entropy losses
    def log_losses(self, critic_loss, entropy_loss, actor_loss, steps):
        self.add_log("critic_loss", critic_loss, steps)
        self.add_log("entropy_loss", entropy_loss, steps)
        self.add_log("actor_loss", actor_loss, steps)

    def log_reward_losses(self, rewards, nb_steps):
        self.add_log("reward/mean", rewards.mean(), nb_steps)
        self.add_log("reward/max", rewards.max(), nb_steps)
        self.add_log("reward/min", rewards.min(), nb_steps)
        self.add_log("reward/median", rewards.median(), nb_steps)

In [None]:
from bbrl import get_arguments, get_class
from itertools import chain

def setup_optimizer(cfg_optimizer, *agents):
    """Setup an optimizer for a list of agents"""
    optimizer_args = get_arguments(cfg_optimizer)
    parameters = [agent.parameters() for agent in agents]
    optimizer = get_class(cfg_optimizer)(chain(*parameters), **optimizer_args)
    return optimizer

def copy_parameters(model_a, model_b):
    """Copy parameters from a model a to model_b"""
    for model_a_p, model_b_p in zip(model_a.parameters(), model_b.parameters()):
        model_b_p.data.copy_(model_a_p)

## Definition of agents

The [REINFORCE](https://link.springer.com/content/pdf/10.1007/BF00992696.pdf)
uses a stochastic policy and a baseline which is the value function. Thus we
need an Actor agent, a Critic agent and an Environment agent. The actor agent
is built on an intermediate ProbAgent which writes the probability of each action.

The function below builds a multi-layer perceptron where the size of each layer is given in the `size` list.
We also specify the activation function of neurons at each layer and optionally a different activation function for the final layer.

In [None]:
import torch.nn as nn
def build_mlp(sizes, activation, output_activation=nn.Identity()):
    """Helper function to build a multi-layer perceptron (function from $\mathbb R^n$ to $\mathbb R^p$)
    
    Args:
        sizes (List[int]): the number of neurons at each layer
        activation (nn.Module): a PyTorch activation function (after each layer but the last)
        output_activation (nn.Module): a PyTorch activation function (last layer)
    """
    layers = []
    for j in range(len(sizes) - 1):
        act = activation if j < len(sizes) - 2 else output_activation
        layers += [nn.Linear(sizes[j], sizes[j + 1]), act]
    return nn.Sequential(*layers)

In [None]:
class ProbAgent(Agent):
    # Computes the distribution $p(a_t|s_t)$
    
    def __init__(self, state_dim, hidden_layers, n_action, name="prob_agent"):
        super().__init__(name)
        self.model = build_mlp(
            [state_dim] + list(hidden_layers) + [n_action], activation=nn.ReLU()
        )

    def forward(self, t, **kwargs):
        # Get $s_t$
        observation = self.get(("env/env_obs", t))
        # Compute the distribution over actions
        scores = self.model(observation)
        action_probs = torch.softmax(scores, dim=-1)
        assert not torch.any(torch.isnan(action_probs)), "NaN Here"
        
        self.set(("action_probs", t), action_probs)
        entropy = torch.distributions.Categorical(action_probs).entropy()
        self.set(("entropy", t), entropy)

In [None]:
class ActorAgent(Agent):
    # Choose an action (either according to $p(a_t|s_t)$ when stochastic is true,
    # or with argmax if false.
    
    def __init__(self, stochastic: bool=False):
        super().__init__()
        self.stochastic = stochastic

    def forward(self, t: int, *, stochastic: bool=None, **kwargs):
        probs = self.get(("action_probs", t))
        stochastic = stochastic if stochastic is not None else self.stochastic
        if stochastic:
            action = torch.distributions.Categorical(probs).sample()
        else:
            action = probs.argmax(1)

        self.set(("action", t), action)

### VAgent

The VAgent is a neural network which takes an observation as input and whose
output is the value $V(s)$ of this observation.

In [None]:
class VAgent(Agent):
    def __init__(self, state_dim, hidden_layers):
        super().__init__()
        self.is_q_function = False
        self.model = build_mlp(
            [state_dim] + list(hidden_layers) + [1], activation=nn.ReLU()
        )

    def forward(self, t, **kwargs):
        observation = self.get(("env/env_obs", t))
        # The `squeeze(-1)` removes the last dimension of the tensor.
        # (since this is a scalar, we want to ignore this dimension since
        # the target values will also be scalars)
        critic = self.model(observation).squeeze(-1)
        self.set(("v_value", t), critic)

# Learning environment

To setup a common learning environment for RL algorithms, we use the `RLBase`
class. This class:

1. Initializes the environment (random seed, logger, evaluation environment)
2. Defines a `evaluate` method that keeps the best agent so far
3. Defines a `visualize_best` method that displays the behavior of the best agent

Subclasses need to define `self.train_policy` and `self.eval_policy`, two
BBRL agents that respectively choose actions when training and evaluating.

The behavior of `RLBase` is controlled by the following configuration
variables:

- `base_dir` defines the directory subpath used when outputing losses during
training as well as other outputs (serialized agent, global statistics, etc.)
- `algorithm.seed` defines the random seed used (to initialize the agent and
  the environment)
- `gym_env` defines the gymnasium environment, and in particular
`gym_env.env_name` the name of the gymansium environment
- `logger` defines what type of logger is used to log the different values
associated with learning
- `algorithm.eval_interval` defines the number of observed transitions between
each evaluation of the agent

In [None]:
import numpy as np
from typing import Any
import logging
from abc import ABC
from functools import cached_property

class RLBase(ABC):
    """Base class for Reinforcement learning algorithms
    
    This class deals with common processing:

    - defines the logger, the train and evaluation agents
    - defines how to evaluate a policy
    """

    #: The configuration
    cfg: Any

    #: The evaluation environment deals with the last action, and produces a new
    # state of the environment
    eval_env: Agent

    #: The training policy
    train_policy: Agent

    #: The evaluation policy (if not defined, uses the training policy)
    eval_policy: Agent

    def __init__(self, cfg):
        # Basic initialization
        self.cfg = cfg
        torch.manual_seed(cfg.algorithm.seed)

        # Sets the base directory and logger directory
        base_dir = Path(self.cfg.base_dir)
        self.base_dir = Path("outputs") / Path(self.cfg.base_dir)
        self.base_dir.mkdir(parents=True, exist_ok=True)

        # Initialize the logger class
        if not hasattr(cfg.logger, "log_dir"):
            cfg.logger.log_dir = str(Path("outputs") / "tblogs" / base_dir)
        self.logger = Logger(cfg)

        # Subclasses have to define the training and eval policies
        self.train_policy = None
        self.eval_policy = None

        # Sets up the evaluation environment
        self.eval_env = ParallelGymAgent(
            partial(make_env, cfg.gym_env.env_name), 
            cfg.algorithm.nb_evals
        ).seed(cfg.algorithm.seed)

        # Initialize values
        self.last_eval_step = 0
        self.nb_steps = 0
        self.best_policy = None
        self.best_reward = -torch.inf

        # Records the rewards
        self.eval_rewards = []

    @cached_property
    def train_agent(self):
        """Returns the training agent

        The agent is composed of a policy agent and the training environment.      
        This method supposes that `self.train_policy` has been setup
        """
        assert self.train_policy is not None, "The train_policy property is not defined before the policy is set"
        return TemporalAgent(Agents(self.train_env, self.train_policy))

    @cached_property
    def eval_agent(self):
        """Returns the evaluation agent 
        
        The agent is composed of a policy agent and the evaluation environment
        
        Uses `self.eval_policy` (or `self.train_policy` if not defined)

        """
        assert self.eval_policy is not None or self.train_policy is not None, "eval_agent property is not defined before the policy is set"
        return TemporalAgent(Agents(self.eval_env, self.eval_policy if self.eval_policy is not None else self.train_policy))

    def evaluate(self):
        """Evaluate the current policy `self.eval_policy`
        
        Evaluation is conducted every `cfg.algorithm.eval_interval` steps, and
        we keep a copy of the best agent so far in `self.best_policy`
        
        Returns True if the current policy is the best so far
        """
        if (self.nb_steps - self.last_eval_step) > self.cfg.algorithm.eval_interval:
            self.last_eval_step = self.nb_steps
            eval_workspace = Workspace() 
            self.eval_agent(
                eval_workspace,
                t=0,
                stop_variable="env/done"
            )
            rewards = eval_workspace["env/cumulated_reward"][-1]
            self.logger.log_reward_losses(rewards, self.nb_steps)

            if getattr(self.cfg, "collect_stats", False):
                self.eval_rewards.append(rewards)

            rewards_mean = rewards.mean()
            if rewards_mean > self.best_reward:
                self.best_policy = copy.deepcopy(self.eval_policy)
                self.best_reward = rewards_mean
                return True

    def save_stats(self):
        """Save reward statistics into `stats.npy`"""
        if getattr(self.cfg, "collect_stats", False) and self.eval_rewards:
            data = torch.stack(self.eval_rewards, axis=-1) 
            with (self.base_dir / "stats.npy").open("wt") as fp:
                np.savetxt(fp, data.numpy())

    def visualize_best(self):
        """Visualize the best agent"""
        env = make_env(self.cfg.gym_env.env_name, render_mode="rgb_array")
        path = self.base_dir / "best_agent.mp4"
        print(f"Video of best agent recorded in {path}")
        record_video(env, self.best_policy, path)
        return video_display(str(path.absolute()))

The `EpisodicAlgo` defines the environment when using episodes. In particular,
it defines `self.train_env` which is the environment used for training. As
`algorithm.n_envs` are used in parallel, when a episode ends, we don't stop
the other episodes. To cater for this:

1. the workspace variable `env/done` is set to `True` for all the next time
steps
2. The variable `env/reward` is set to 0 for all the steps 

The behavior of `EpisodicAlgo` is controlled by the following configuration
variables:

- `gym_env.env_name` defines the gymnasium environment
- `algorithm.n_envs` defines the number of parallel environments
- `algorithm.seed` defines the random seed used (to initialize the agent and
  the environment)

In [None]:
class EpisodicAlgo(RLBase):
    """Base class for RL experiments with full episodes"""
    def __init__(self, cfg, autoreset=False):
        super().__init__(cfg)

        self.train_env = ParallelGymAgent(
            partial(make_env, cfg.gym_env.env_name, autoreset=autoreset), 
            cfg.algorithm.n_envs,
        ).seed(cfg.algorithm.seed)

`iter_episodes` and `iter_partial_episodes` (autoreset) allow
to iterate over the train workspace by sampling

In [None]:
def iter_episodes(algo: EpisodicAlgo):
    pbar = tqdm(range(algo.cfg.algorithm.max_epochs))

    train_workspace = Workspace()

    for algo.epoch in pbar:
        # Collect samples
        train_workspace = Workspace()
        algo.train_agent(train_workspace, t=0, stop_variable="env/done")

        # Update the number of steps
        algo.nb_steps += int((~train_workspace["env/done"]).sum())

        # Perform a learning step
        yield train_workspace

        # Eval
        pbar.set_description(f"nb_steps: {algo.nb_steps}, best reward: {algo.best_reward:.2f}")


def iter_partial_episodes(algo: EpisodicAlgo, episode_steps: int):
    pbar = tqdm(range(algo.cfg.algorithm.max_epochs))
    train_workspace = Workspace()

    for algo.epoch in pbar:
        if algo.epoch > 0:
            train_workspace.zero_grad()
            train_workspace.copy_n_last_steps(1)
            algo.train_agent(
                train_workspace, t=1, n_steps=episode_steps-1, stochastic=True
            )
        else:
            algo.train_agent(
                train_workspace, t=0, n_steps=episode_steps, stochastic=True
            )

        algo.nb_steps += int((~train_workspace["env/done"]).sum())
        yield train_workspace

        pbar.set_description(f"nb_steps: {algo.nb_steps}, best reward: {algo.best_reward:.2f}")

### RL environment

In the next cell, we define the Reinforce environment. It is based on `EpisodicAlgo`
since learning uses full episodes.

In [None]:
class Reinforce(EpisodicAlgo):
    def __init__(self, cfg):
        super().__init__(cfg)

        obs_size, act_size = self.train_env.get_obs_and_actions_sizes()

        # Train and critic agents
        self.proba_agent = ProbAgent(obs_size, cfg.algorithm.architecture.actor_hidden_size, act_size)
        self.train_policy = Agents(self.proba_agent, ActorAgent(stochastic=True))

        # The critic (if used)
        self.critic_agent = TemporalAgent(
            VAgent(obs_size, cfg.algorithm.architecture.critic_hidden_size)
        )

        # Evaluation
        self.eval_policy = Agents(self.proba_agent, ActorAgent(stochastic=False))

        # Setup the optimizer
        self.optimizer = setup_optimizer(cfg.optimizer, self.proba_agent, self.critic_agent)

The next cell describes the arguments of the two main arguments
used in the training function `run`:
- `compute_policy_reward` computes the reward at each time step
- `compute_critic_loss` computes the loss of the critic (if we use one)

In [None]:
def compute_policy_reward(cfg, reward, v_value) -> torch.Tensor:
    """Computes the reward for each episode step

    :param reward: The rewards from the environment (tensor TxB)
    :param v_value: The values $V(s)$ computed by the critic (tensor TxB) (or None)
    :returns: The reward (tensor TxB)
    """        
    ...
    
def compute_critic_loss(cfg, reward, must_bootstrap, v_value) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
    """Compute the critic loss

    :param reward: The reward from the environment (TxB)
    :param must_bootstrap: Whether the critic should be bootstrapped
    :param v_value: The v value computed by the critic
    :return: The scalar loss
    """
    # By default, we don't have any critic
    ...

In [None]:
def run(reinforce: Reinforce, compute_policy_reward, compute_critic_loss=None):
    for train_workspace in iter_episodes(reinforce):
        # Get relevant tensors (size are timestep x n_envs x ....)
        terminated, action_probs, reward, action = train_workspace[
            "env/terminated",
            "action_probs",
            "env/reward",
            "action",
        ]
        must_bootstrap = ~terminated

        # Use the critic to recompute V values
        reinforce.critic_agent(train_workspace, stop_variable="env/done")
        v_value = train_workspace["v_value"]

        # Computes the critic loss
        if compute_critic_loss is not None:
            critic_loss = compute_critic_loss(reinforce.cfg, reward, must_bootstrap, v_value) 
        else:
            critic_loss = 0

        # Computes the reward
        reward = compute_policy_reward(reinforce.cfg, reward, v_value)

        # Take the log probability of the actions performed
        action = action.unsqueeze(-1)
        action_logp = torch.gather(action_probs.squeeze(), dim=2, index=action).squeeze().log()
        
        # Compute the policy gradient loss based on the log probability of the actions performed
        actor_loss = action_logp * reward.detach() * must_bootstrap.int()
        actor_loss = actor_loss.sum() / must_bootstrap.sum()

        # Log losses
        if compute_critic_loss is not None:
            reinforce.logger.add_log("critic_loss", critic_loss, reinforce.nb_steps)
    
        reinforce.logger.add_log("actor_loss", actor_loss, reinforce.nb_steps)

        # Performs on gradient step
        reinforce.optimizer.zero_grad()
        loss = (
            reinforce.cfg.algorithm.critic_coef * critic_loss
            - reinforce.cfg.algorithm.actor_coef * actor_loss
        )
        loss.backward()
        reinforce.optimizer.step()

        reinforce.evaluate()

## Definition of the parameters

The logger is defined as `bbrl.utils.logger.TFLogger` so as to use a tensorboard visualisation.

In [None]:
env_name = "CartPole-v1"
params={
   "base_dir": "${gym_env.env_name}/reinforce-${variant}-S${algorithm.seed}_${current_time:}",
  "logger":{
    "classname": "bbrl.utils.logger.TFLogger",
    "cache_size": 10000,
    "every_n_seconds": 10,
    "verbose": False,    
    },

  "algorithm":{
    # Number of transitions between two evaluations
    "eval_interval": 1000,
    "seed": 1,
    "n_envs": 8,
    "nb_evals": 10,
    "max_epochs": 700,
    "discount_factor": 0.99,
    "critic_coef": 1.0,
    "actor_coef": 1.0,
    "architecture":{
        "actor_hidden_size": [32],
        "critic_hidden_size": [36],
    },
  },

  "gym_env":{
    "env_name": env_name,
  },
  "optimizer":
  {
    "classname": "torch.optim.Adam",
    "lr": 0.001,
  }
}

In [None]:
setup_tensorboard("./outputs/tblogs")

### First algorithm: summing all the rewards along an episode

The most basic variant of the Policy Gradient algorithms just sums all the
rewards along an episode.

This is implemented with the `apply_sum` function below.

In [None]:
def apply_sum(cfg, reward, *args):
    reward_sum = reward.sum(axis=0)
    for i in range(len(reward)):
        reward[i] = reward_sum
    return reward

In [None]:
# Runs and visualize
reinforce_sum = Reinforce(OmegaConf.create({**params, "variant": "sum"}))
run(reinforce_sum, apply_sum)
reinforce_sum.visualize_best()

## Exercises

### First algorithm: summing discounted rewards

As explained in the [second
video](https://www.youtube.com/watch?v=dKUWto9B9WY) and [the corresponding
slides](http://pages.isir.upmc.fr/~sigaud/teach/ps/4_pg_derivation2.pdf),
using a discounted reward after the current step and ignoring the rewards
before the current step results in lower variance.

By taking inspiration from the `apply_sum()` function above, code a function
`apply_discounted_sum()` that computes the sum of discounted rewards from
immediate rewards.

Two hints:
- you should proceed backwards, starting from the final step of the episode
  and storing the previous sum into a register
- you need the discount factor as an input to your function.

In [None]:
def apply_discounted_sum(cfg, reward, v_value):
    # Implement the function (ignore v_value)

    assert False, 'Not implemented yet'


In [None]:
reinforce_dsum = Reinforce(OmegaConf.create({**params, "variant": "dsum"}))
run(reinforce_dsum, apply_discounted_sum)

In [None]:
# Visualization

reinforce_dsum.visualize_best()

### Second algorithm: Baseline with Temporal Differences

Here, we aim at computing a baseline using temporal differences. The algorithm
for computing the critic loss is given below.

Note the `critic[1:].detach()` in the computation of the temporal difference
target. The idea is that we compute this target as a function of $V(s_{t+1})$,
but we do not want to apply gradient descent on this $V(s_{t+1})$, we will
only apply gradient descent to the $V(s_t)$ according to this target value.

In practice, `x.detach()` detaches a computation graph from a tensor, so it
avoids computing a gradient over this tensor.

Note also the trick to deal with terminal states. If the state is terminal,
$V(s_{t+1})$ does not make sense. Thus we need to ignore this term. So we
multiply the term by `must_bootstrap`: if `must_bootstrap` is True (converted
into an int, it becomes a 1), we get the term. If `must_bootstrap` is False
(=0), we are at a terminal state, so we ignore the term. This trick is used in
many RL libraries, e.g. SB3.

Code a `apply_discounted_sum_minus_baseline()` function, using the critic
learned simultaneously with the policy.

In [None]:
def apply_discounted_sum_minus_baseline(cfg, reward, v_value):
    # Implement the function

    assert False, 'Not implemented yet'


(2) Code a `compute_critic_loss()` using temporal differences (bootstrapped)

In [None]:
def compute_td_critic_loss(cfg, reward, must_bootstrap, critic):
    # To be completed...

    assert False, 'Not implemented yet'


    return critic_loss

In [None]:
reinforce_td = Reinforce(OmegaConf.create({**params, "variant": "td"}))
run(reinforce_td, apply_discounted_sum_minus_baseline, compute_td_critic_loss)

In [None]:
# Visualization

reinforce_td.visualize_best()

### Third algorithm: Monte-Carlo Baseline

 The `compute_critic_loss()` function above uses the Temporal Difference
 approach to critic estimation. In this part, we will compare it to using the
 Monte Carlo estimation approach.

As explained in [this video](https://www.youtube.com/watch?v=GcJ9hl3T6x8) and
[these
slides](http://pages.isir.upmc.fr/~sigaud/teach/ps/5_pg_derivation3.pdf), the
MC estimation approach uses the following equation:

$$\phi_{j+1} = \mathop{\mathrm{argmin}}_{\phi_j} \frac{1}{m\times
   H}\sum_{i=1}^m \sum_{t=1}^H \left( \left(\sum_{k=t}^H \gamma^{k-t}
   r(s_k^{(i)},a_k^{(i)}) \right) - \hat{V}^\pi_{\phi_j}(s_t^{(i)}) \right)^2
       $$

The innermost sum of discounted rewards exactly corresponds to the computation
of the `apply_discounted_sum()` function. The rest just consists in computing
the squared difference (also known as the Means Squared Error, or MSE) over
the $m \times H$ samples ($m$ episodes of lenght $H$) that we have collected.

From the above information, create a `compute_critic_loss()` function.

In [None]:
def compute_critic_loss_mc(cfg, reward, must_bootstrap, critic):
    # To be completed...

    assert False, 'Not implemented yet'


In [None]:
reinforce_mc = Reinforce(OmegaConf.create({**params, "variant": "mc"}))
run(reinforce_mc, apply_discounted_sum_minus_baseline, compute_critic_loss_mc)

In [None]:
# Visualization

reinforce_mc.visualize_best()

Most probably, this will not work well, as initially the learned critic is a
poor estimate of the true $V(s)$. Instead, load an already trained critic that
you have saved after convergence from a previous run, and see if it works
better.

Loading and saving a network or a BBRL agent can easily be performed using
`agent.save(filename)` and `agent.load(filename)`.

Warning: Be cautious with the use of ProbAgent with just a hidden layer,
ProbAgent with build_mlp, and DiscreteActor. Try to be progressive...