Note

You can download this example as a Jupyter notebook or try it out directly in Google Colab.

4.1 Reinforcement Learning Algorithm Tutorial#

Introduction

This comprehensive tutorial introduces users to the MATD3 implementation in ASSUME and demonstrates how we leverage reinforcement learning (RL) within the framework.

The primary goal is to ensure participants understand the essential steps required to equip ASSUME with a RL algorithm. This tutorial delves one level deeper than the basic RL application example in the user guide, providing detailed insights into the algorithm’s implementation.

Note: This tutorial focuses on understanding the implementation rather than hands-on coding tasks. The knowledge presented here is optional if you plan to use the pre-configured algorithms in ASSUME.

Tutorial Structure

From Single Simulation to Learning Episodes
- Understanding the transition from one simulation year to multiple learning episodes
- How RL changes the simulation flow
The Role of Learning Roles
- Understanding the purpose and functionality of learning roles
- Key responsibilities and interactions
MATD3 Algorithm Deep Dive
- Detailed examination of the Multi-Agent Twin Delayed Deep Deterministic Policy Gradient algorithm
- Implementation specifics and key characteristics

Additional Resources

RL Background: For a comprehensive refresher on reinforcement learning concepts, visit our documentation
Implementation Details: The algorithm explained here is designed as a plug-and-play solution within the ASSUME framework

Important: This tutorial provides an overview and explanation of reinforcement learning implementation for those interested in modifying the underlying learning algorithm. It does not include practical coding exercises.

0. Get ASSUME Running#

0.1 Installation Process#

We’ll start by installing ASSUME in this environment. For Google Colab, we install the ASSUME core package via pip.

For comprehensive installation instructions, please refer to our official documentation.

Note: If we’re working in Colab, creating a virtual environment (venv) is not necessary as the environment is already isolated.

[ ]:

import importlib.util

# Check whether notebook is run in google colab
IN_COLAB = importlib.util.find_spec("google.colab") is not None

if IN_COLAB:
    !pip install assume-framework[learning]
    # Colab currently has issues with pyomo version 6.8.2, causing the notebook to crash
    # Installing an older version resolves this issue. This should only be considered a temporary fix.
    !pip install pyomo==6.8.0

Dashboard Limitations in Colab > ⚠️ Important: In Google Colab, we cannot access Docker-dependent functionalities, which means: > - Pre-defined dashboards are not available > - For full dashboard access, please install Docker and ASSUME on your local machine

0.2 Repository Setup#

To access predefined simulation scenarios, clone the ASSUME repository (Colab only):

[ ]:

if IN_COLAB:
    !git clone --depth=1 https://github.com/assume-framework/assume.git assume-repo

First Simulation Run#

Now you can run your first simulation in ASSUME! The following code will:

Navigate to the ASSUME directory
Start the simulation using example example_01b
Use a local SQLite database

Command Line Equivalent#

When running locally, you can execute this command in your shell:

assume -s example_01b -db "sqlite:///./examples/local_db/assume_db_example_01b.db"

[ ]:

if IN_COLAB:
    !cd assume-repo && assume -s example_01b -db "sqlite:///./examples/local_db/assume_db_example_01b.db"

0.3 Input Path Configuration#

We define the path to input files depending on whether you’re in Colab or working locally. This variable will be used to load configuration and scenario files throughout the tutorial.

[ ]:

colab_inputs_path = "assume-repo/examples/inputs"
local_inputs_path = "../inputs"

inputs_path = colab_inputs_path if IN_COLAB else local_inputs_path

1. From Single Simulation to Learning Episodes#

In traditional simulations without reinforcement learning, we execute the time horizon only once. However, RL introduces a fundamental change:

Learning Requirement: RL agents must learn strategies through repeated interactions
Experience Gathering: Agents need to encounter the same situations multiple times

To enable this learning process, we implement a run_learning() function that:

Repeats the entire simulation horizon multiple times
Activates when learning mode is enabled in the configuration
Manages the learning process across episodes

Required Imports#

First, let’s import all necessary libraries and modules for our RL implementation:

[ ]:

import logging
import os
import shutil
from collections import defaultdict
from pathlib import Path

import numpy as np
import torch as th
import yaml
from torch.nn import functional as F
from tqdm import tqdm

from assume.common.exceptions import AssumeException
from assume.reinforcement_learning.algorithms.base_algorithm import RLAlgorithm
from assume.reinforcement_learning.algorithms.matd3 import TD3
from assume.reinforcement_learning.buffer import ReplayBuffer
from assume.reinforcement_learning.learning_role import Learning
from assume.reinforcement_learning.learning_utils import polyak_update
from assume.scenario.loader_csv import (
    load_scenario_folder,
    setup_world,
)
from assume.world import World

logger = logging.getLogger(__name__)

This flowchart provides an overview of the key stages involved in the run_learning function, which trains Deep Reinforcement Learning (DRL) agents within a simulated market environment. The process is divided into five main steps:

Initialization of the Learning Process: The function begins by setting up the environment, initializing policies, and configuring necessary settings such as logging and buffer allocation. It ensures that no existing policies are overwritten without confirmation.
Training Loop: This is the outer loop where multiple training episodes are executed. For each episode, the world simulation is completely re-initialized and reset after execution, meaning the simulation environment is essentially killed after each episode. Crucially, all necessary information that must persist across episodes—such as collected experience stored in the buffer—is maintained in the inter-episodic data. This data is key to ensuring the continuity of the learning process as it allows the DRL agents to build knowledge over time.
Evaluation Loop: Nested within the training loop, the evaluation loop periodically assesses the performance of the learned policies. Based on average rewards, the best-performing policies are saved, and the function determines if further training is necessary.
Terminate Learning: At the end of the training phase, the function saves the final version of the learned policies, ensuring that the results are stored for future use.
Final Evaluation Run: A final evaluation run is conducted using the best policies from the training phase, providing a benchmark for overall performance.

The flowchart visually represents the interaction between the training and evaluation loops, highlighting the progression through these key stages.

[ ]:

from IPython.display import Image, display

image_path = Path("assume-repo/docs/source/img/Assume_run_learning_loop.png")
alt_image_path = Path("../../docs/source/img/Assume_run_learning_loop.png")

if image_path.exists():
    display(Image(image_path, width=400))
elif alt_image_path.exists():
    display(Image(alt_image_path, width=400))

[ ]:

def run_learning(
    world: World,
    verbose: bool = False,
) -> None:
    """
    Train Deep Reinforcement Learning (DRL) agents to act in a simulated market environment.

    This function runs multiple episodes of simulation to train DRL agents, performs evaluation, and saves the best runs. It maintains the buffer and learned agents in memory to avoid resetting them with each new run.

    Args:
        world (World): An instance of the World class representing the simulation environment.
        verbose (bool, optional): A flag indicating whether to enable verbose logging. Defaults to False.

    Note:
        - The function uses a ReplayBuffer to store experiences for training the DRL agents.
        - It iterates through training episodes, updating the agents and evaluating their performance at regular intervals.
        - Initial exploration is active at the beginning and is disabled after a certain number of episodes to improve the performance of DRL algorithms.
        - Upon completion of training, the function performs an evaluation run using the last policy learned during training.
        - The best policies are chosen based on the average reward obtained during the evaluation runs, and they are saved for future use.
    """

    if not verbose:
        logger.setLevel(logging.WARNING)
        logging.getLogger("assume.scenario.loader_csv").setLevel(logging.WARNING)

    # remove csv path so that nothing is written while learning
    temp_csv_path = world.export_csv_path
    world.export_csv_path = ""

    # initialize policies already here to set the obs_dim and act_dim in the learning role
    world.learning_role.rl_algorithm.initialize_policy()

    # check if we already stored policies for this simulation
    save_path = world.learning_config.trained_policies_save_path

    if Path(save_path).is_dir() and not world.learning_config.continue_learning:
        # we are in learning mode and about to train new policies, which might overwrite existing ones
        accept = input(
            f"{save_path=} exists - should we overwrite current learned strategies? (y/N) "
        )
        if accept.lower().startswith("y"):
            # remove existing policies
            if os.path.exists(save_path):
                shutil.rmtree(save_path, ignore_errors=True)

        else:
            # stop here - do not start learning or save anything
            raise AssumeException(
                "Simulation aborted by user not to overwrite existing learned strategies. You can use 'simulation_id' parameter in the config to start a new simulation."
            )

    # also remove tensorboard logs
    tensorboard_path = f"tensorboard/{world.scenario_data['simulation_id']}"
    if os.path.exists(tensorboard_path):
        shutil.rmtree(tensorboard_path, ignore_errors=True)

    # -----------------------------------------
    # Information that needs to be stored across episodes, aka one simulation run
    inter_episodic_data = {
        "buffer": ReplayBuffer(
            buffer_size=world.learning_role.learning_config.replay_buffer_size,
            obs_dim=world.learning_role.rl_algorithm.obs_dim,
            act_dim=world.learning_role.rl_algorithm.act_dim,
            n_rl_units=len(world.learning_role.rl_strats),
            device=world.learning_role.device,
            float_type=world.learning_role.float_type,
        ),
        "actors_and_critics": None,
        "max_eval": defaultdict(lambda: -1e9),
        "all_eval": defaultdict(list),
        "avg_all_eval": [],
        "episodes_done": 0,
        "eval_episodes_done": 0,
    }

    world.learning_role.load_inter_episodic_data(inter_episodic_data)

    # -----------------------------------------

    validation_interval = min(
        world.learning_role.learning_config.training_episodes,
        world.learning_role.learning_config.validation_episodes_interval,
    )

    eval_episode = 1

    for episode in tqdm(
        range(1, world.learning_role.learning_config.training_episodes + 1),
        desc="Training Episodes",
    ):
        # -----------------------------------------
        # Give the newly initialized learning role the needed information across episodes
        if episode != 1:
            setup_world(
                world=world,
                episode=episode,
            )
            world.learning_role.load_inter_episodic_data(inter_episodic_data)

        world.run()

        world.learning_role.tensor_board_logger.update_tensorboard()

        # -----------------------------------------
        # Store updated information across episodes
        inter_episodic_data = world.learning_role.get_inter_episodic_data()
        inter_episodic_data["episodes_done"] = episode

        # evaluation run:
        if (
            episode % validation_interval == 0
            and episode
            >= world.learning_role.learning_config.episodes_collecting_initial_experience
            + validation_interval
        ):
            world.reset()

            # load evaluation run
            setup_world(
                world=world,
                evaluation_mode=True,
                episode=episode,
                eval_episode=eval_episode,
            )

            world.learning_role.load_inter_episodic_data(inter_episodic_data)

            world.run()

            world.learning_role.tensor_board_logger.update_tensorboard()

            total_rewards = world.output_role.get_sum_reward(episode=eval_episode)

            if len(total_rewards) == 0:
                raise AssumeException("No rewards were collected during evaluation run")

            avg_reward = np.mean(total_rewards)

            # check reward improvement in evaluation run
            # and store best run in eval folder
            terminate = world.learning_role.compare_and_save_policies(
                {"avg_reward": avg_reward}
            )

            inter_episodic_data["eval_episodes_done"] = eval_episode

            # if we have not improved in the last x evaluations, we stop loop
            if terminate:
                break

            eval_episode += 1

        world.reset()

        # save the policies after each episode in case the simulation is stopped or crashes
        if (
            episode
            >= world.learning_role.learning_config.episodes_collecting_initial_experience
            + validation_interval
        ):
            world.learning_role.rl_algorithm.save_params(
                directory=f"{world.learning_role.learning_config.trained_policies_save_path}/last_policies"
            )

    # container shutdown implicitly with new initialisation
    logger.info("################")
    logger.info("Training finished, Start evaluation run")
    world.export_csv_path = temp_csv_path

    world.reset()

    # Set 'trained_policies_load_path' to None in order to load the most recent policies,
    # especially if previous strategies were loaded from an external source.
    # This is useful when continuing from a previous learning session.
    world.scenario_data["config"]["learning_config"]["trained_policies_load_path"] = (
        f"{world.learning_role.learning_config.trained_policies_save_path}/avg_reward_eval_policies"
    )

    # load scenario for evaluation
    setup_world(
        world=world,
        terminate_learning=True,
    )

2. What Role has a Learning Role#

The LearningRole class in learning_role.py is a central component of the reinforcement learning framework. It manages configurations, device settings, early stopping of the learning process, and initializes various RL strategies, the algorithm and buffers. This class ensures that the RL agent can be trained or evaluated effectively, leveraging the available hardware and adhering to the specified configurations. The parameters of the learning process are also described in the read-the-docs under learning_algorithms.

2.1 Learning Data Management#

One key feature of the LearningRole class is its ability to load and manage the inter episodic data. This involves storing experiences and the training progress and retrieving this data to train the RL agent. By efficiently handling episodic data, the LearningRole class enables the agent to learn from past experiences and improve its performance over time.

[ ]:

class Learning(Learning):
    """
    This class manages the learning process of reinforcement learning agents, including initializing key components such as
    neural networks, replay buffer, and learning hyperparameters. It handles both training and evaluation modes based on
    the provided learning configuration.

    Args:
        simulation_start (datetime.datetime): The start of the simulation.
        simulation_end (datetime.datetime): The end of the simulation.
        learning_config (LearningConfig): The configuration for the learning process.

    """

    def load_inter_episodic_data(self, inter_episodic_data):
        """
        Load the inter-episodic data from the dict stored across simulation runs.

        Args:
            inter_episodic_data (dict): The inter-episodic data to be loaded.

        """
        self.episodes_done = inter_episodic_data["episodes_done"]
        self.eval_episodes_done = inter_episodic_data["eval_episodes_done"]
        self.max_eval = inter_episodic_data["max_eval"]
        self.rl_eval = inter_episodic_data["all_eval"]
        self.avg_rewards = inter_episodic_data["avg_all_eval"]
        self.buffer = inter_episodic_data["buffer"]

        # if enough initial experience was collected according to specifications in learning config
        # turn off initial exploration and go into full learning mode
        if (
            self.episodes_done
            >= self.learning_config.episodes_collecting_initial_experience
        ):
            self.turn_off_initial_exploration()

        self.set_noise_scale(inter_episodic_data["noise_scale"])

        self.initialize_policy(inter_episodic_data["actors_and_critics"])

    def get_inter_episodic_data(self):
        """
        Dump the inter-episodic data to a dict for storing across simulation runs.

        Returns:
            dict: The inter-episodic data to be stored.
        """

        return {
            "episodes_done": self.episodes_done,
            "eval_episodes_done": self.eval_episodes_done,
            "max_eval": self.max_eval,
            "all_eval": self.rl_eval,
            "avg_all_eval": self.avg_rewards,
            "buffer": self.buffer,
            "actors_and_critics": self.rl_algorithm.extract_policy(),
            "noise_scale": self.get_noise_scale(),
        }

The metrics in inter_episodic_data are stored for the following reasons:

episodes_done and eval_episodes_done: Monitoring Progress

Keeping track of the number of episodes completed.
max_eval, all_eval, avg_all_eval: Evaluating Performance

Storing evaluation scores and average rewards to assess the agent’s performance across episodes.
buffer: Experience Replay

Using a replay buffer to learn from past experiences and improve data efficiency.
noise_scale: Policy Exploration

The noise is used to include exploration in the policy. It may be decreased across episode numbers, and we store the current noise value to continue the decrease across future episodes.
actors_and_critics: Policy Initialization

Initializing the policy with actors and critics (self.initialize_policy()) ensures that the agent starts with the pre-defined strategy from the previous episode and can improve upon it through learning.

2.2 Learning Algorithm#

If learning is used, then the learning role initializes a learning algorithm which is the heart of the learning progress. Currently, only the MATD3 is implemented, but we are working on different PPO implementations as well. If you would like to add an algorithm it would be integrated here.

[ ]:

class Learning(Learning):
    def create_learning_algorithm(self, algorithm: RLAlgorithm):
        """
        Create and initialize the reinforcement learning algorithm.

        This method creates and initializes the reinforcement learning algorithm based on the specified algorithm name. The algorithm
        is associated with the learning role and configured with relevant hyperparameters.

        Args:
            algorithm (RLAlgorithm): The name of the reinforcement learning algorithm.
        """
        if algorithm == "matd3":
            self.rl_algorithm = TD3(learning_role=self)
        else:
            logger.error(f"Learning algorithm {algorithm} not implemented!")

3. Learning Algorithm Flow in ASSUME#

The following graph illustrates the structure and flow of the learning algorithm within the reinforcement learning framework.

[ ]:

display(Image("../../docs/source/img/TD3_algorithm.jpeg", width=400))

Within the algorithm, we distinguish three different steps that are translated into ASSUME in the following way:

Initialization: This is the first step where all necessary components such as the actors, critics, and buffer are set up.
Experience Collection: The second step, represented in the flowchart above within the loop, involves the collection of experience. This includes choosing an action, observing a reward, and storing the transition tuple in the buffer.
Policy Update: The third step is the actual policy update, which is also performed within the loop, allowing the agent to improve its performance over time.

3.1 Initialization#

The initialization of the actors, critics, and the buffer is handled via the learning_role and the inter_episodic_data, as described earlier. The create_learning_algorithm function triggers their initialization in initialize_policy. At the beginning of the training process, they are initialized with new random settings. In subsequent episodes, they are initialized with pre-learned data, ensuring that previous learning is retained and built upon.

[ ]:

class TD3(TD3):
    def initialize_policy(self, actors_and_critics: dict = None) -> None:
        """
        Create actor and critic networks for reinforcement learning.

        If `actors_and_critics` is None, this method creates new actor and critic networks.
        If `actors_and_critics` is provided, it assigns existing networks to the respective attributes.

        Args:
            actors_and_critics (dict): The actor and critic networks to be assigned.

        """
        if actors_and_critics is None:
            self.create_actors()
            self.create_critics()

        else:
            for u_id, strategy in self.learning_role.rl_strats.items():
                strategy.actor = actors_and_critics["actors"][u_id]
                strategy.actor_target = actors_and_critics["actor_targets"][u_id]

                strategy.critics = actors_and_critics["critics"][u_id]
                strategy.target_critics = actors_and_critics["target_critics"][u_id]

            self.obs_dim = actors_and_critics["obs_dim"]
            self.act_dim = actors_and_critics["act_dim"]
            self.unique_obs_dim = actors_and_critics["unique_obs_dim"]

Please also note that we make a distinction in the handling of the critics and target critics compared to the actors and target actors. You can observe this in the initialize_policy function. For the critics, they are assigned to the learning_role as there are centralized critics used for all the different actors. In contrast, the actors are assigned to specific unit strategies. Each learning unit, such as a power plant, has one learning strategy and therefore an individual actor, while the critics remain centralized.

This distinction leads to the case where, even if learning is not active, we still need the actors to perform the entire simulation using pre-trained policies. This is essential, for example, when running simulations with previously learned policies.

3.2 Experience Collection#

Within the loop, the selection of an action with exploration noise, as well as the observation of a new reward and state, and the storing of this tuple in the buffer, are all handled within the bidding strategy.

This specific process is covered in more detail in another tutorial. For more details, refer to tutorial 04b.

3.3 Policy Update#

The core of the algorithm, which comprises all other steps is embodied by the assume.reinforcement_learning.algorithms.matd3.TD3.update_policy function in the learning algorithms. Here, the critic and the actor are updated according to the algorithm.

[ ]:

class TD3(TD3):
    def update_policy(self):
        """
        Update the policy of the reinforcement learning agent using the Twin Delayed Deep Deterministic Policy Gradients (TD3) algorithm.

        Note:
            This function performs the policy update step, which involves updating the actor (policy) and critic (Q-function) networks
            using TD3 algorithm. It iterates over the specified number of gradient steps and performs the following steps for each
            learning strategy:

            1. Sample a batch of transitions from the replay buffer.
            2. Calculate the next actions with added noise using the actor target network.
            3. Compute the target Q-values based on the next states, rewards, and the target critic network.
            4. Compute the critic loss as the mean squared error between current Q-values and target Q-values.
            5. Optimize the critic network by performing a gradient descent step.
            6. Optionally, update the actor network if the specified policy delay is reached.
            7. Apply Polyak averaging to update target networks.

            This function implements the TD3 algorithm's key step for policy improvement and exploration.
        """

        logger.debug("Updating Policy")

        n_rl_agents = len(self.learning_role.rl_strats)

        for _ in range(self.gradient_steps):
            self.n_updates += 1

            transitions = self.learning_role.buffer.sample(
                self.learning_config.batch_size
            )
            states, actions, next_states, rewards = (
                transitions.observations,
                transitions.actions,
                transitions.next_observations,
                transitions.rewards,
            )

            with th.no_grad():
                # Select action according to policy and add clipped noise
                noise = (
                    th.randn_like(actions) * self.learning_config.target_policy_noise
                )
                noise = noise.clamp(
                    -self.learning_config.target_noise_clip,
                    self.learning_config.target_noise_clip,
                )

                next_actions = th.stack(
                    [
                        (
                            strategy.actor_target(next_states[:, i, :]) + noise[:, i, :]
                        ).clamp(-1, 1)
                        for i, strategy in enumerate(
                            self.learning_role.rl_strats.values()
                        )
                    ]
                )

                next_actions = next_actions.transpose(0, 1).contiguous()
                next_actions = next_actions.view(-1, n_rl_agents * self.act_dim)

            all_actions = actions.view(self.learning_config.batch_size, -1)

            # Precompute unique observation parts for all agents
            unique_obs_from_others = states[
                :, :, self.obs_dim - self.unique_obs_dim :
            ].reshape(self.learning_config.batch_size, n_rl_agents, -1)
            next_unique_obs_from_others = next_states[
                :, :, self.obs_dim - self.unique_obs_dim :
            ].reshape(self.learning_config.batch_size, n_rl_agents, -1)

            # Loop over all agents and update their actor and critic networks
            for i, strategy in enumerate(self.learning_role.rl_strats.values()):
                actor = strategy.actor
                critic = strategy.critics
                critic_target = strategy.target_critics

                # Efficiently extract unique observations from all other agents
                other_unique_obs = th.cat(
                    (unique_obs_from_others[:, :i], unique_obs_from_others[:, i + 1 :]),
                    dim=1,
                )
                other_next_unique_obs = th.cat(
                    (
                        next_unique_obs_from_others[:, :i],
                        next_unique_obs_from_others[:, i + 1 :],
                    ),
                    dim=1,
                )

                # Construct final state representations
                all_states = th.cat(
                    (
                        states[:, i, :].reshape(self.learning_config.batch_size, -1),
                        other_unique_obs.reshape(self.learning_config.batch_size, -1),
                    ),
                    dim=1,
                )
                all_next_states = th.cat(
                    (
                        next_states[:, i, :].reshape(
                            self.learning_config.batch_size, -1
                        ),
                        other_next_unique_obs.reshape(
                            self.learning_config.batch_size, -1
                        ),
                    ),
                    dim=1,
                )

                with th.no_grad():
                    # Compute the next Q-values: min over all critics targets
                    next_q_values = th.cat(
                        critic_target(all_next_states, next_actions), dim=1
                    )
                    next_q_values, _ = th.min(next_q_values, dim=1, keepdim=True)
                    target_Q_values = (
                        rewards[:, i].unsqueeze(1)
                        + self.learning_config.gamma * next_q_values
                    )

                # Get current Q-values estimates for each critic network
                current_Q_values = critic(all_states, all_actions)

                # Compute critic loss
                critic_loss = sum(
                    F.mse_loss(current_q, target_Q_values)
                    for current_q in current_Q_values
                )

                # Optimize the critics
                critic.optimizer.zero_grad(set_to_none=True)
                critic_loss.backward()
                # Clip the gradients to avoid exploding gradients and stabilize training
                th.nn.utils.clip_grad_norm_(critic.parameters(), max_norm=1.0)
                critic.optimizer.step()

                # Delayed policy updates
                if self.n_updates % self.learning_config.policy_delay == 0:
                    # Compute actor loss
                    state_i = states[:, i, :]
                    action_i = actor(state_i)

                    all_actions_clone = actions.clone().detach()
                    all_actions_clone[:, i, :] = action_i

                    # calculate actor loss
                    actor_loss = -critic.q1_forward(
                        all_states,
                        all_actions_clone.view(self.learning_config.batch_size, -1),
                    ).mean()

                    actor.optimizer.zero_grad(set_to_none=True)
                    actor_loss.backward()
                    # Clip the gradients to avoid exploding gradients and stabilize training
                    th.nn.utils.clip_grad_norm_(actor.parameters(), max_norm=1.0)
                    actor.optimizer.step()

            # Perform batch-wise Polyak update at the end (instead of inside the loop)
            if self.n_updates % self.learning_config.policy_delay == 0:
                all_critic_params = []
                all_target_critic_params = []

                all_actor_params = []
                all_target_actor_params = []

                for strategy in self.learning_role.rl_strats.values():
                    all_critic_params.extend(strategy.critics.parameters())
                    all_target_critic_params.extend(
                        strategy.target_critics.parameters()
                    )

                    all_actor_params.extend(strategy.actor.parameters())
                    all_target_actor_params.extend(strategy.actor_target.parameters())

                # Perform batch-wise Polyak update (NO LOOPS)
                polyak_update(
                    all_critic_params,
                    all_target_critic_params,
                    self.learning_config.tau,
                )
                polyak_update(
                    all_actor_params, all_target_actor_params, self.learning_config.tau
                )

The other functions within the reinforcement learning algorithm are primarily there to store, update, and save the new policies. These functions either write the updated policies to a designated location or save them into the inter_episodic_data.

If you would like to make a change to this algorithm, the most likely modification would be to the update_policy function, as it plays a central role in the learning process. The other functions would only need adjustments if the different algorithm features vary like the target critics or critic architectures.

3.5 Start the Simulation#

We are almost done with all the changes to actually be able to make ASSUME learn here in this notebook. If you would rather like to load our pretrained strategies, we need a function for loading parameters, which can be found below.

To control the learning process, the config file determines the parameters of the learning algorithm. As we want to temper with these values in the notebook we will overwrite the learning config in the next cell and then load it into our world.

[ ]:

learning_config = {
    "learning_mode": True,
    "continue_learning": False,
    "trained_policies_save_path": None,
    "max_bid_price": 100,
    "algorithm": "matd3",
    "learning_rate": 0.001,
    "training_episodes": 100,
    "episodes_collecting_initial_experience": 5,
    "train_freq": "24h",
    "gradient_steps": 24,
    "batch_size": 256,
    "gamma": 0.99,
    "device": "cpu",
    "noise_sigma": 0.1,
    "noise_scale": 1,
    "noise_dt": 1,
    "validation_episodes_interval": 5,
}

[ ]:

# Read the YAML file
with open(f"{inputs_path}/example_02a/config.yaml") as file:
    data = yaml.safe_load(file)

# store our modifications to the config file
data["base"]["learning_config"] = learning_config

# Write the modified data back to the file
with open(f"{inputs_path}/example_02a/config.yaml", "w") as file:
    yaml.safe_dump(data, file)

In order to let the simulation run with the integrated learning we need to touch up the main file that runs it in the following way.

[ ]:

import os

from assume.strategies.learning_strategies import EnergyLearningStrategy

log = logging.getLogger(__name__)

csv_path = "./outputs"
os.makedirs("./local_db", exist_ok=True)

if __name__ == "__main__":
    """
    Available examples:
    - local_db: without database and grafana
    - timescale: with database and grafana (note: you need docker installed)
    """
    data_format = "local_db"  # "local_db" or "timescale"

    if data_format == "local_db":
        db_uri = "sqlite:///./local_db/assume_db.db"
    elif data_format == "timescale":
        db_uri = "postgresql://assume:assume@localhost:5432/assume"

    input_path = inputs_path
    scenario = "example_02a"
    study_case = "base"

    # create world
    world = World(database_uri=db_uri, export_csv_path=csv_path)

    # we import our defined bidding strategey class including the learning into the world bidding strategies
    # in the example files we provided the name of the learning bidding strategeis in the input csv is  "powerplant_energy_learning"
    # hence we define this strategey to be one of the learning class
    world.bidding_strategies["powerplant_energy_learning"] = EnergyLearningStrategy

    # then we load the scenario specified above from the respective input files
    load_scenario_folder(
        world,
        inputs_path=input_path,
        scenario=scenario,
        study_case=study_case,
    )

    # run learning if learning mode is enabled
    # needed as we simulate the modeling horizon multiple times to train reinforcement learning run_learning(world)

    if world.learning_mode:
        run_learning(world)

    # after the learning is done we make a normal run of the simulation, which equals a test run
    world.run()

Summary and Outlook#

Well done! You have now completed your first RL simulation with ASSUME. This tutorial provided the necessary background of the RL structure and implementation for you to modify the learning algorithm.

The results of the simulation are stored in the local_db as well as in .csv-files in the output folder ./outputs. Feel free to check the outcomes after the simulation has finished.

If you’d like to understand how the actual bids of market participants as DRL learning agents are derived, the following tutorial will dive deep into the design of adaptive bidding strategies.

Next up: 4.2 Designing Adaptive Bidding Strategies in ASSUME using Reinforcement Learning

4.1 Reinforcement Learning Algorithm Tutorial

Contents

4.1 Reinforcement Learning Algorithm Tutorial#

0. Get ASSUME Running#

0.1 Installation Process#

0.2 Repository Setup#

First Simulation Run#

Command Line Equivalent#

0.3 Input Path Configuration#

1. From Single Simulation to Learning Episodes#

Required Imports#

2. What Role has a Learning Role#

2.1 Learning Data Management#

2.2 Learning Algorithm#

3. Learning Algorithm Flow in ASSUME#

3.1 Initialization#

3.2 Experience Collection#

3.3 Policy Update#

3.5 Start the Simulation#

Summary and Outlook#