Note

You can download this example as a Jupyter notebook or try it out directly in Google Colab.

4.2 Designing Adaptive Bidding Strategies in ASSUME using Reinforcement Learning#

Introduction

This tutorial introduces the integration of reinforcement learning (RL) into the ASSUME simulation framework, with a focus on developing and deploying learning-based bidding strategies for electricity market participants.

The tutorial is designed to walk you through the essential components required to transform a conventional market participant into an RL agent. Rather than concentrating on the underlying algorithmic infrastructure—such as training loops, buffers, or learning roles—this tutorial emphasizes how to define a bidding strategy that interfaces with the ASSUME learning backend. You will learn how to construct observation spaces, define action mappings, and design reward functions that guide agent behavior.

Each core concept is addressed in a dedicated chapter, accompanied by exercises that allow you to apply the material directly. These hands-on tasks culminate in a final integration chapter where you will run a complete simulation and train your first learning agent.

Tutorial Structure

The tutorial is divided into the following chapters:

Get ASSUME Running

Instructions for installing ASSUME and preparing your environment, whether locally or in Google Colab.
ASSUME & Learning Basics

A conceptual overview of RL within the ASSUME framework, including actor-critic architectures, centralized training, and multi-agent design principles.
Defining the Observation Space

Explanation and coding tasks for constructing shared and individual observations used by agents to make decisions.
Action Selection and Exploration

Retrieving the agents actions based on the observed environment and why it is important to explore actions beyond the output values.
From Observation to Action to Bids

How to convert actor network outputs into economically meaningful bid prices and apply exploration during training.
Reward Function Design

Techniques for shaping agent behavior using profit- and regret-based reward signals. Includes a task to define your own reward logic.
Training and Evaluating Your First Learning Agent

Integration of the previously implemented components into a complete simulation run, demonstrating end-to-end learning behavior in a market setting.
Analyzing Strategic Bidding Behavior

Investigate how a RL agent exploits its market power by learning strategic bidding behavior as an example for more realistic market simulations.
Summary and Outlook

Wraps up the contents of this tutorial. Further ideas which components of the learning process can be additionally tweaked.

Learning Outcomes

By completing this tutorial, you will be able to:

Implement RL-compatible bidding strategies within the ASSUME framework.
Define observation inputs for learning agents.
Map actor outputs to valid market actions and manage exploration.
Construct reward functions that combine economic incentives with strategic signals.
Train and evaluate a basic RL agent in a multi-agent electricity market simulation.

1. Get ASSUME Running#

This chapter walks you through setting up the ASSUME framework in your environment and preparing the required input files. At the end, you will confirm that the installation was successful and ready for use.

1.1 Installation#

In Google Colab#

Google Colab already includes most scientific computing libraries (e.g., numpy, torch). You only need to install the ASSUME core framework:

[ ]:

# Only run this cell if you are using Google Colab
import importlib.util

IN_COLAB = importlib.util.find_spec("google.colab") is not None

if IN_COLAB:
    !pip install assume-framework

Note: After installation, Colab may prompt you to restart the session due to dependency changes. To do so, click “Runtime” → “Restart session…” in the menu bar, then re-run the cells above.

On Your Local Machine#

To install ASSUME with all learning-related dependencies, run the following in your terminal:

pip install 'assume-framework[learning]'

This will install the simulation framework and the packages required for RL.

1.2 Repository Setup#

To access predefined simulation scenarios, clone the ASSUME repository (Colab only):

[ ]:

# Only run this cell if you are using Google Colab
if IN_COLAB:
    !git clone --depth=1 https://github.com/assume-framework/assume.git assume-repo

Local users may skip this step if input files are already available in the project directory.

1.3 Input Path Configuration#

We define the path to input files depending on whether you’re in Colab or working locally. This variable will be used to load configuration and scenario files throughout the tutorial.

[ ]:

colab_inputs_path = "assume-repo/examples/inputs"
local_inputs_path = "../inputs"

inputs_path = colab_inputs_path if IN_COLAB else local_inputs_path

1.4 Installation Check#

Use the following cell to ensure the installation was successful and that essential components are available. This test ensures that the simulation engine and RL strategy base class are accessible before continuing.

[ ]:

try:
    from assume import World
    from assume.strategies.learning_strategies import TorchLearningStrategy

    print("✅ ASSUME framework is installed and functional.")
except ImportError as e:
    print("❌ Failed to import essential components:", e)
    print(
        "Please review the installation instructions and ensure all dependencies are installed."
    )

1.5 Limitations in Colab#

Colab does not support Docker, so dashboard visualizations included in some ASSUME workflows will not be available. However, simulation runs and RL training can still be fully executed.

In Colab: Training and basic plotting are supported.
In Local environments with Docker: Full access, including dashboards.

1.6 Core Imports#

In this section, we import the core modules that will be used throughout the tutorial. Each import is explained to clarify its role.

[ ]:

# Standard Python modules
import logging  # For logging messages during simulation and debugging
import os  # For operating system interactions
from datetime import timedelta  # To handle market time resolutions (e.g., hourly steps)

import matplotlib.pyplot as plt

# Scientific and data processing libraries
import numpy as np  # Numerical operations and array handling
import pandas as pd  # Data manipulation and analysis
import yaml  # Parsing YAML configuration files

# Database and visualization libraries
from sqlalchemy import create_engine

# ASSUME framework components
from assume import World  # Core simulation container that manages markets and agents
from assume.scenario.loader_csv import (  # Functions to load and execute scenarios
    load_scenario_folder,
    run_learning,
)
from assume.strategies.learning_strategies import (
    MinMaxStrategy,  # Abstract class for powerplant-like strategies
    TorchLearningStrategy,  # Abstract base for RL bidding strategies
)

These imports are used for:

Defining RL bidding strategies.
Managing input/output data.
Executing and analyzing simulations.

At this point, you are ready to begin building your RL bidding agent. In the next chapter, we will define how agents perceive the market by constructing their observation vectors.

2. ASSUME & Learning Basics#

2.1 The ASSUME Framework#

ASSUME is a simulation framework designed for researchers, utilities, and planners to model and understand market dynamics in electricity systems. It allows for agent-based modeling of market participants in a modular and configurable environment.

The core structure of the framework consists of:

Markets (on the left of the architecture diagram): Where electricity products are traded.
Market Participants / Units (on the right): Each agent represents a physical or virtual unit bidding into the market.
Orders: The main communication channel between units and markets.
Learning Agents: Highlighted in yellow in the architecture, these are agents using RL strategies.

The image below illustrates the high-level architecture of ASSUME. Focus on the yellow components—these are the parts involved in the learning process.

[ ]:

from pathlib import Path

from IPython.display import SVG, display

image_path = Path("assume-repo/docs/source/img/architecture.svg")
alt_image_path = Path("../../docs/source/img/architecture.svg")

if image_path.exists():
    display(SVG(image_path))
elif alt_image_path.exists():
    display(SVG(alt_image_path))

2.2 Introduction to Learning in ASSUME#

The current implementation of RL in ASSUME models electricity markets as partially observable Markov games, allowing multiple agents to operate under individual observations and reward structures.

If you are unfamiliar with RL, refer to the following links for background material:

Central Concepts:

Policy: The strategy used by an agent to select actions based on observations.
Actor-Critic Architecture: A method where the “actor” chooses actions and the “critic” evaluates them.
Learning Strategy: Defines how a unit transforms observations into bids using a trainable model.
Step Functions: The typical RL cycle of Observe → Act → Reward → Update is split across several methods in ASSUME, as described in Section 3.

2.3 Single-Agent RL#

In a single-agent setup, the agent attempts to maximize its reward over time by learning from interaction with the environment. It does so by making multiple steps in the environment. In RL, each interaction step includes:

Observation of the current state.
Action selection based on policy.
Reward from the environment.
Policy Update to improve behavior.

In ASSUME, this step cycle is modularized:

RL Step	Implemented via	Description
Step 1	`create_observation()` and `get_individual_observations()`	Constructs the observation vector.
Step 2	`calculate_bids()` and `get_actions()`	Maps observations to bid prices.
Step 3	`calculate_reward()`	Computes the reward signal.
Step 4	Handled by the learning role	Updates model and manages the replay buffer.

Actor-Critic Structure: To increase learning stability actor-critic methods are commonly used. They divide the tasks in the following way:

Actor: Learns a deterministic policy for choosing actions. Uses policy gradient methods to maximize expected reward.
Critic: Learns a value function using Temporal Difference (TD) learning. Provides feedback to the actor based on action quality.

2.4 Multi-Agent RL#

Real-world electricity markets involve multiple agents acting simultaneously, which introduces interdependencies and non-stationarity. The latter refers to the fact that the continuous adaptation of other agents makes the environment change and therefore less predictable from the perspective of a single agent. As a result, multi-agent learning requires additional considerations.

Challenges:

Actions by one agent influence the environment experienced by others.
The state transitions and rewards become non-stationary.

Solution: Centralized Training with Decentralized Execution (CTDE)

To address these challenges, ASSUME employs the Multi-Agent Twin Delayed Deep Deterministic Policy Gradient (MATD3) framework with CTDE:

Centralized Training: A critic with access to all agents’ states and actions is used during training to stabilize learning. Note the critic is only there to update the actor network, so it is only necessary while training.
Decentralized Execution: During simulation, the actual actor of each agent relies only on its own observations and learned policy.

Each agent trains two critic networks to mitigate overestimation bias, uses target noise for robustness, and relies on deterministic policy gradients for the actor update.

3. Defining the Observation Space#

In this chapter, you will define what information your RL agent perceives about the environment and itself at each decision point. This is a critical component of the agent’s behavior, as the observation vector forms the basis for all future actions and learning.

RL Step	Implemented via	Description
Step 1	`create_observation()` and `get_individual_observations()`	Constructs the observation vector.

3.2 Observation Structure in ASSUME#

Observations are composed of two parts:

1. Global Observations

These are shared across all agents and constructed by the base class method create_observation(). They include:

Forecasted residual load over the foresight horizon.
Forecasted market price over the foresight horizon.
Historical market price over a specified window.

These are normalized by maximum demand and maximum bid price for stability. These values are generated by a forecasting role and made available to all agents before each market cycle.

For this tutorial you do not need to modify this part. However, if you want to equip new units types with learning or expand the simulation by new concepts, additional global information might be needed.

2. Individual Observations

These are unit-specific and must be implemented by you. The purpose is to provide the agent with private, operational information that may help improve bidding decisions. Each agent appends this information to the end of the shared observation vector.

This is done via the method get_individual_observations(unit, start).

3.3 Defining the Strategy Class and Constructor#

To enable learning, we define a custom class that extends TorchLearningStrategy and initializes key dimensions for the model:

[ ]:

class EnergyLearningSingleBidStrategy(TorchLearningStrategy, MinMaxStrategy):
    """
    A simple reinforcement learning bidding strategy.
    """

    def __init__(self, *args, **kwargs):
        # Forecast horizon (in timesteps) used for market and residual load forecasts
        foresight = kwargs.pop("foresight", 12)
        act_dim = kwargs.pop("act_dim", 1)  # One action: bid price
        unique_obs_dim = kwargs.pop("unique_obs_dim", 2)  # Number of individual obs

        super().__init__(
            foresight=foresight,
            act_dim=act_dim,
            unique_obs_dim=unique_obs_dim,
            *args,
            **kwargs,
        )

With your defined foresight range the global observations are defined in the function create_observation of the base class. Based on the chosen foresight the observation_space dimension is calculated automatically following self.obs_dim = num_timeseries_obs_dim * foresight + unique_obs_dim as defined in the base class. If one wants to change that rational it needs to be overwritten in the learning_strategy itself.

3.4 Exercise 1: Define Individual Observations#

Now you will implement the following method:

[ ]:

# we define the class again and inherit from the initial class just to add the additional method to the original class
# this is a workaround to have different methods of the class in different cells
# which is good for the purpose of this tutorial
# however, you should have all functions in a single class when using this example in .py files


class EnergyLearningSingleBidStrategy(EnergyLearningSingleBidStrategy):
    def get_individual_observations(self, unit, start, end):
        """
        Define custom unit-specific observations for the RL agent.

        Parameters
        ----------
        unit : SupportsMinMax
            The unit representing the power plant.
        start : datetime.datetime
            Start time of the market product.

        Returns
        -------
        np.ndarray
            Normalized 1D array of individual observations.
        """
        # Your implementation here

This method must return a NumPy array of length unique_obs_dim.

What Should Be in an Individual Observation?

The key principle is to include values that are:

Known only to the unit itself.
Relevant for market bidding.
Reflective of the unit’s technical or economic constraints.

Here are some good candidate features and how to compute them using ASSUME:

Feature	Description	Access via
Current output	How much power the unit is currently producing	`unit.get_output_before(start)`
Marginal cost	Cost to produce current output	`unit.calculate_marginal_cost(start, current_volume)`
Max capacity	Upper generation limit	`unit.max_power`
Max bid price	Maximum price at market	`self.max_bid_price`
Start-up/shut-down state	May be encoded in dispatch history	infer from `unit.get_output_before(start)`
Ramp limit	Maximum change in output allowed	`unit.ramp_up`, `unit.ramp_down`
Efficiency or fuel cost factors	If applicable	custom attributes per unit model

Solution#

[ ]:

# Solution Exercise 1: Define Individual Observations

# we define the class again and inherit from the initial class just to add the additional method to the original class
# this is a workaround to have different methods of the class in different cells
# which is good for the purpose of this tutorial
# however, you should have all functions in a single class when using this example in .py files


class EnergyLearningSingleBidStrategy(EnergyLearningSingleBidStrategy):
    def get_individual_observations(self, unit, start, end):
        # --- Current volume & marginal cost ---
        current_volume = unit.get_output_before(start)
        current_costs = unit.calculate_marginal_cost(start, current_volume)

        scaled_total_dispatch = current_volume / unit.max_power
        scaled_marginal_cost = current_costs / self.max_bid_price

        individual_observations = np.array(
            [scaled_total_dispatch, scaled_marginal_cost]
        )

        return individual_observations

3.5 Summary#

Observations in ASSUME combine shared global forecasts and custom individual data.
The base class handles forecasted residual load and price, as well as historical price signals.
These observations directly affect agent behavior and learning convergence—thoughtful design matters.

In the next chapter, you will define how the agent selects actions based on its observations, and how exploration is introduced during initial training to populate the learning buffer.

4. Action Selection and Exploration#

Once an observation is formed, the next step is for the agent to decide how to act. In this context, the action determines the bid price submitted by the agent to the electricity market.

This chapter focuses on how actions are derived from the agent’s policy and how exploration is handled—especially during the early training phase when experience is sparse.

4.1 Action Selection in RL#

In RL, the policy defines the agent’s behavior: it maps observations to actions. In the actor-critic architecture used by ASSUME, this policy is represented by the actor neural network.

However, to enable exploration, especially in the early stages of training, agents must not always follow the policy exactly. They need to try out a variety of actions—even suboptimal ones—to collect diverse experiences and learn effectively.

This is done by adding noise to the actions suggested by the policy network.

Note: The implementation of noise we present here is specific to the used algorithm MADDPG. Other Algorithms such as the PPO will use a different mechanism for exploration.

4.2 Understanding `get_actions()`#

The method get_actions(next_observation) in TorchLearningStrategy defines how actions are computed in different modes of operation.

Here is a simplified overview of the logic:

def get_actions(self, next_observation):
    if self.learning_mode and not self.evaluation_mode:
        if self.collect_initial_experience_mode:
            # Initial exploration: use pure noise as action
            noise = self.action_noise.noise(...)
            curr_action = noise
        else:
            # Regular exploration: add noise to policy output
            curr_action = self.actor(next_observation).detach()
            noise = self.action_noise.noise(...)
            curr_action += noise
    else:
        # Evaluation or deterministic policy use
        curr_action = self.actor(next_observation).detach()
        noise = zeros_like(curr_action)

    return curr_action, noise

Modes of Operation:

learning_mode: Indicates that the agent is being trained (vs. used for evaluation).
evaluation_mode: Disables noise; used to assess performance of a learned policy.
collect_initial_experience_mode: Special sub-phase during early episodes where we rely heavily on randomized exploration to populate the replay buffer with diverse samples.

4.3 What Is Initial Experience Collection Mode?#

The initial experience collection mode refers to the first N episodes of training where agents fill their learning buffers purely through exploration. No learned policy is used at this stage.

The purpose is to:

Cover a broad region of the action space.
Enable agents to observe the outcome of many different bidding decisions.

By default, the action in this mode is pure noise, sampled from a Gaussian distribution.

4.4 Improving Exploration with Prior Knowledge#

While random actions help explore broadly, we can use economic and technical knowledge to make exploration more guided.

What would be a good starting point for a conventional generator? Exploring in a region around this value is far more productive than exploring arbitrarily.

Thus, instead of using random noise alone, we can shift the noisy action around a known good starting point so that exploration begins from a plausible economic baseline.

4.5 Exercise 2: Guided Exploration#

Your task is to modify the get_actions() method to implement a better initial exploration mechanism.

Objective:

During the collect_initial_experience_mode, instead of using pure noise, base the exploration around a known signal from the observation vector.

[ ]:

# we define the class again and inherit from the initial class just to add the additional method to the original class
# this is a workaround to have different methods of the class in different cells
# which is good for the purpose of this tutorial
# however, you should have all functions in a single class when using this example in .py files


class EnergyLearningSingleBidStrategy(EnergyLearningSingleBidStrategy):
    def get_actions(self, next_observation):
        """
        Compute actions based on the current observation, optionally applying noise for exploration.

        Args
        ----
        next_observation : torch.Tensor
            The current observation, where the last element is assumed to be the marginal cost.

        Returns
        -------
        tuple of torch.Tensor
            - Action (with or without noise)
            - The applied noise
        """
        # Get the base action and associated noise from the parent implementation
        curr_action, noise = super().get_actions(next_observation)

        if self.learning_mode and not self.evaluation_mode:
            if self.collect_initial_experience_mode:
                # TODO: extract a relevant reference value from next_observation
                # TODO: shift the noisy action around this value
                pass  # replace this with your implementation

        return curr_action, noise

Solution#

[ ]:

# Solution Exercise 2: Improve Initial Exploration

# we define the class again and inherit from the initial class just to add the additional method to the original class
# this is a workaround to have different methods of the class in different cells
# which is good for the purpose of this tutorial
# however, you should have all functions in a single class when using this example in .py files


class EnergyLearningSingleBidStrategy(EnergyLearningSingleBidStrategy):
    def get_actions(self, next_observation):
        # Get the base action and associated noise from the parent implementation
        curr_action, noise = super().get_actions(next_observation)

        if self.learning_mode and not self.evaluation_mode:
            if self.collect_initial_experience_mode:
                # Assumes last dimension of the observation corresponds to marginal cost
                marginal_cost = next_observation[-1].detach()
                # Center the noisy action around marginal cost
                curr_action += marginal_cost

        return curr_action, noise

This strategy anchors exploration to a meaningful economic quantity, improving the quality of early experiences and accelerating convergence.

4.6 Summary#

The get_actions() method controls how agents choose actions under different modes.
During training, actions include noise to enable exploration.
Initial exploration can be enhanced by guiding actions toward domain-relevant baselines (e.g., marginal cost).
You implemented a strategy to anchor exploration using part of the observation vector.

In the next chapter, we will transform the action values into actual bids by applying domain knowledge.

5. From Observation to Action to Bids#

In the previous chapters, we explored how an agent perceives its environment through observations and how it selects actions using its policy, optionally enriched with exploration noise. In this short chapter, we show how these two steps come together inside the calculate_bids() method.

5.1 The Role of `calculate_bids()`#

The calculate_bids() method defines how a market participant formulates its bid at each market interval. It brings together two crucial operations:

Generating Observations: Calls create_observation() to construct the full input vector (including both global and individual components).
Choosing an Action: Passes the observation to get_actions(), which invokes the actor network (and optionally adds noise) to return an action vector.

This forms the agent’s internal decision pipeline.

5.2 Action Normalization and Scaling#

The neural network policy outputs normalized actions—typically bounded in the range \([-1, 1]\). To convert these to meaningful bid prices, the raw action is scaled by a predefined constant:

bid_price = actions[0] * self.max_bid_price

For example, if self.max_bid_price = 100, the resulting bid prices will fall between \(-100\) and \(100\). This reflects a design choice that bounds the agent’s economic behavior in a defined domain.

By modifying max_bid_price in the learning config, you directly influence the economic aggressiveness of the policy.

5.3 Bid Structure#

Each bid submitted to the market follows a defined structure, encapsulated as a dictionary:

{
    "start_time": start,
    "end_time": end,
    "price": bid_price,
    "volume": max_power,
    "node": unit.node,
}

Key aspects:

price: Determined from the scaled output of the policy.
volume: Set to the full technical capacity of the unit.
node: Locational identifier (used for zonal/nodal pricing and congestion modeling).

Note that max_power is positive, as this strategy models a generator offering energy. For a consumer or demand bid, the volume would be negative to reflect load withdrawal.

5.4 Controlling Action Dimensions#

By changing the act_dim in the strategy constructor, you can control the number of outputs returned by the actor network:

act_dim = kwargs.pop("act_dim", 1)

This allows for richer bidding logic. For instance:

1 action: Bid price for total capacity.
2 actions: Bid prices for flexible vs. inflexible portions.
3 actions: Add directionality or reserve offers.

However, it is important to note that RL performance deteriorates with high-dimensional action spaces, especially in continuous domains.

If you decide to increase act_dim, ensure that your calculate_bids() method is updated accordingly to interpret and transform all action elements correctly.

5.5 Full Code Implementation#

Here is the complete calculate_bids() implementation:

[ ]:

# we define the class again and inherit from the initial class just to add the additional method to the original class
# this is a workaround to have different methods of the class in different cells
# which is good for the purpose of this tutorial
# however, you should have all functions in a single class when using this example in .py files


class EnergyLearningSingleBidStrategy(EnergyLearningSingleBidStrategy):
    def calculate_bids(self, unit, market_config, product_tuples, **kwargs):
        start = product_tuples[0][0]
        end = product_tuples[0][1]

        # get technical bounds for the unit output from the unit
        _, max_power = unit.calculate_min_max_power(start, end)
        max_power = max_power[0]

        # =============================================================================
        # 1. Get the observations, which are the basis of the action decision
        # =============================================================================
        next_observation = self.create_observation(
            unit=unit, market_id=market_config.market_id, start=start, end=end
        )

        # =============================================================================
        # 2. Get the actions, based on the observations
        # =============================================================================
        actions, noise = self.get_actions(next_observation)

        # =============================================================================
        # 3. Transform actions into bids
        # =============================================================================
        # actions are in the range [-1,1], we need to transform them into actual bids
        # we can use our domain knowledge to guide the bid formulation
        bid_price = actions[0] * self.max_bid_price

        # actually formulate bids in orderbook format
        bids = [
            {
                "start_time": start,
                "end_time": end,
                "price": bid_price,
                "volume": max_power,
                "node": unit.node,
            },
        ]

        if self.learning_mode:
            self.learning_role.add_actions_to_cache(self.unit_id, start, actions, noise)

        return bids

In the next chapter, we will define how to compute the reward associated with each bid outcome, which completes the agent’s learning cycle.

6. Reward Function Design#

The reward function is the central learning signal in any RL environment. It defines the objective the agent is trying to maximize and serves as the only feedback mechanism from the environment to the agent.

In market-based simulations such as ASSUME, designing the reward function is a delicate balance between:

Capturing realistic economic goals (e.g., profit maximization),
Enabling learning stability and convergence, and
Leaving room for the agent to discover unexpected, valid strategies.

It’s tempting to hard-code your preferred behavior into the reward function. However, this often leads to agents that are overly adapted to a specific scenario and perform poorly in general.

6.1 When Is the Reward Computed?#

In ASSUME, the reward is computed after the market clears, in the calculate_reward() method. At this point, the agent receives information about:

Which portion of its bid was accepted,
At what price,
And what operational costs it incurred.

This allows us to calculate realized profit, which is the most direct economic reward signal.

6.2 Exercise 3: Implement Profit-Based Reward#

Your first task is to implement a profit-based reward. This is mandatory.

Use the following simplified formula:

\[\begin{split}\pi_{i,t} = \begin{cases} P^\text{conf}_{i,t} \cdot (M_t - mc_{i,t}) \cdot dt & \text{if } P^\text{conf}_{i,t} \geq 0 \\ 0 & \text{otherwise} \end{cases}\end{split}\]

Where:

\(P^\text{conf}\): Confirmed volume (accepted by market),
\(M_t\): Market clearing price,
\(mc_{i,t}\): Marginal generation cost,
\(dt\): Time resolution in hours.

You can access these quantities via:

accepted_volume = order["accepted_volume"]
market_clearing_price = order["accepted_price"]
marginal_cost = unit.calculate_marginal_cost(start, unit.outputs[marketconfig.product_type].at[start])

Use the duration in hours:

duration = (end - start) / timedelta(hours=1)

6.4 Exercise 3 (optional): Thinking Beyond Profit#

While profit is a good starting point, agents trained solely on profit may struggle in competitive environments or when there is limited dispatch. In real-world operations, generators also consider missed opportunities—what could have been earned but wasn’t due to poor bidding or conservative behavior.

What other signal could guide the agent to bid more strategically?

What do real power plants look at when evaluating their bidding success—even when they were not dispatched?

Use your economic intuition or power system experience to answer this.

[ ]:

# we define the class again and inherit from the initial class just to add the additional method to the original class
# this is a workaround to have different methods of the class in different cells
# which is good for the purpose of this tutorial
# however, you should have all functions in a single class when using this example in .py files


class EnergyLearningSingleBidStrategy(EnergyLearningSingleBidStrategy):
    def calculate_reward(self, unit, marketconfig, orderbook):
        """
        Reward function: implement profit and (optionally) opportunity cost.

        Instructions:
        - Fill in the lines marked as YOUR CODE.
        - Compute profit as the primary reward signal.
        - Optionally define the opportunity cost as a regret term.
        """

        start = orderbook[0]["start_time"]
        end = orderbook[0]["end_time"]
        duration = (end - start) / timedelta(hours=1)
        end_excl = end - unit.index.freq

        order = orderbook[0]
        market_clearing_price = None  # YOUR CODE HERE
        accepted_volume = None  # YOUR CODE HERE

        marginal_cost = unit.calculate_marginal_cost(
            start, unit.outputs[marketconfig.product_type].at[start]
        )

        # === Required: compute profit ===
        order_income = None  # YOUR CODE HERE
        order_cost = None  # YOUR CODE HERE
        order_profit = None  # YOUR CODE HERE

        # === Optional: compute opportunity cost ===
        opportunity_cost = None  # YOUR CODE HERE

        regret_scale = 0.1 if accepted_volume > unit.min_power else 0.5

        # === Normalize reward to ~[-1, 1] ===
        scaling = 1 / (self.max_bid_price * unit.max_power)
        reward = scaling * (order_profit - regret_scale * opportunity_cost)
        regret = regret_scale * opportunity_cost

        # Store results in unit outputs
        # Note: these are not learning-specific results but stored for all units for analysis
        unit.outputs["profit"].loc[start:end_excl] += order_profit
        unit.outputs["total_costs"].loc[start:end_excl] += order_cost

        # write rl-rewards to buffer
        if self.learning_mode:
            self.learning_role.add_reward_to_cache(
                unit.id, start, reward, regret, order_profit
            )

💡Hint Optional Extension: Opportunity Cost

The concept of opportunity cost captures the lost profit from unused capacity. If the market price exceeds marginal cost and the unit wasn’t dispatched fully, that represents a missed opportunity.

This can be used as a regret term to penalize under-utilization of profitable bids.

Mathematically:

\[cm_{i,t} = \max\left[(P^{\max}_i - P^\text{conf}_{i,t}) \cdot (M_t - mc_{i,t}) \cdot dt, 0\right]\]

A good reward function combines profit and opportunity cost, allowing agents to learn from both actual performance and missed potential.

6.6 Reward Scaling and Learning Stability#

Scaling the reward to a narrow and consistent range is crucial for stable RL. This is particularly important in continuous-action settings like bidding, where one overly large reward spike can skew the policy updates significantly.

1. Why scale?

Stabilizes gradients during actor-critic training.
Makes different time steps comparable in magnitude.
Prevents the agent from overfitting to rare but extreme events.

2. What can go wrong?

If your scaling factor is too small:

Rewards become indistinguishable from noise.

If your scaling factor is too large:

A single high-reward event (e.g., bidding into a rare price spike) can dominate learning, making the agent try to reproduce that event rather than learn a general policy.

Tip: Use conservative scaling based on maximum realistic bid × capacity:

scaling = 1 / (self.max_bid_price * unit.max_power)

3 Recommended Practice

Before committing to training:

Plot the distribution of rewards across time steps for a few sample runs.
Check for outliers, saturation, or skewness.
If needed, adjust scaling or cap outliers in reward postprocessing.

This diagnostic step can save hours of failed training runs.

Solution#

[ ]:

# Solution Exercise 3: Implement Reward Function

# we define the class again and inherit from the initial class just to add the additional method to the original class
# this is a workaround to have different methods of the class in different cells
# which is good for the purpose of this tutorial
# however, you should have all functions in a single class when using this example in .py files


class EnergyLearningSingleBidStrategy(EnergyLearningSingleBidStrategy):
    def calculate_reward(
        self,
        unit,
        marketconfig,
        orderbook,
    ):
        """
        Calculates the reward for the unit based on profits, costs, and opportunity costs from market transactions.

        The reward is computed by combining the following:
        - **Profit**: Income from accepted bids minus marginal and start-up costs.
        - **Opportunity Cost**: Penalty for underutilizing capacity, calculated as potential lost income.
        - **Regret Term**: A scaled regret term penalizes high opportunity costs to guide effective bidding.

        The reward is scaled and stored along with other outputs in the unit’s data to support learning.
        """

        start = orderbook[0]["start_time"]
        end = orderbook[0]["end_time"]
        duration = (end - start) / timedelta(hours=1)

        # `end_excl` marks the last product's start time by subtracting one frequency interval.
        end_excl = end - unit.index.freq

        order = orderbook[0]  # Assuming a single order for simplicity

        market_clearing_price = order["accepted_price"]
        accepted_volume = order["accepted_volume"]

        # Depending on how the unit calculates marginal costs, retrieve cost values.
        marginal_cost = unit.calculate_marginal_cost(
            start, unit.outputs[marketconfig.product_type].at[start]
        )

        # Calculate profit as income minus operational cost for this event.
        order_income = market_clearing_price * accepted_volume * duration
        order_cost = marginal_cost * accepted_volume * duration

        # Accumulate income and operational cost for all orders.
        order_profit = order_income - order_cost

        # Opportunity cost: The income lost due to not operating at full capacity.
        opportunity_cost = (
            (market_clearing_price - marginal_cost)
            * (unit.max_power - accepted_volume)
            * duration
        )

        # If opportunity cost is negative, no income was lost, so we set it to zero.
        opportunity_cost = max(opportunity_cost, 0)

        # Dynamic regret scaling:
        # - If accepted volume is positive, apply lower regret (0.1) to avoid punishment for being on the edge of the merit order.
        # - If no dispatch happens, apply higher regret (0.5) to discourage idle behavior, if it could have been profitable.
        regret_scale = 0.1 if accepted_volume > unit.min_power else 0.5

        # --------------------
        # 4.1 Calculate Reward
        # Instead of directly setting reward = profit, we incorporate a regret term (opportunity cost penalty).
        # This guides the agent toward strategies that maximize accepted bids while minimizing lost opportunities.

        # scaling factor to normalize the reward to the range [-1,1]
        scaling = 1 / (self.max_bid_price * unit.max_power)
        reward = scaling * (order_profit - regret_scale * opportunity_cost)
        regret = regret_scale * opportunity_cost

        # Store results in unit outputs
        # Note: these are not learning-specific results but stored for all units for analysis
        unit.outputs["profit"].loc[start:end_excl] += order_profit
        unit.outputs["total_costs"].loc[start:end_excl] += order_cost

        # write rl-rewards to buffer
        if self.learning_mode:
            self.learning_role.add_reward_to_cache(
                unit.id, start, reward, regret, order_profit
            )

6.7 Summary#

The reward function is the core signal guiding agent learning—design it carefully.
Start with profit as the primary reward.
Consider adding opportunity cost as a regret penalty to improve bidding behavior.
Always normalize your reward to maintain training stability.
Analyze your reward distribution empirically before training large-scale agents.

In the next chapter, we will bring together all the components—observation, action, and reward—and simulate a full training run using your custom learning strategy.

7. Training and Evaluating Your First Learning Agent#

You have now implemented all essential components of a learning bidding strategy in ASSUME:

Observations
Actions and exploration
Reward function

In this chapter, you will connect your strategy to a simulation scenario, configure the learning algorithm, and evaluate the agent’s training progress.

7.1 Load and Inspect the Learning Configuration#

Each simulation scenario in ASSUME has an associated YAML configuration file. This file contains the learning configuration, which determines how the RL algorithm is executed.

[ ]:

scenario = "base"

# Read the YAML file
with open(f"{inputs_path}/example_02a/config.yaml") as file:
    config = yaml.safe_load(file)

# Print the learning config
print(f"Learning config for scenario '{scenario}':")
display(config[scenario]["learning_config"])

Explanation of Learning Configuration Parameters

Parameter	Description
learning_mode	If `True`, performs the policy updates and evaluates learned policies.
continue_learning	If `True`, resumes training from saved policy checkpoints.
trained_policies_save_path	File path where trained policies will be saved.
trained_policies_load_path	Path to pre-trained policies to load.
max_bid_price	Used to scale action outputs to economic bid prices.
algorithm	Learning algorithm used (e.g., `matd3` for multi-agent TD3).
learning_rate	Step size for policy and critic updates.
training_episodes	Number of simulation episodes (repetitions of the time horizon) used for training.
episodes_collecting_initial_experience	Number of episodes during which agents collect experience using guided exploration.
train_freq	Time between training updates, e.g., `'100h'` means update every 100 simulated hours.
gradient_steps	Number of gradient descent steps per update.
batch_size	Size of experience batch used for training.
gamma	Discount factor for future rewards (\(0 < \gamma \leq 1\)).
device	`"cpu"` or `"cuda"` depending on hardware.
action_noise_schedule	How the action noise evolves over time (`linear`, `constant`, etc.).
noise_sigma	Standard deviation of exploration noise.
noise_scale	Global multiplier for noise.
noise_dt	Discretization interval for noise time series.
validation_episodes_interval	How often (in episodes) to evaluate the current policy without exploration.

7.2 Run the Simulation and Train the Agent#

The simulation environment and learning strategy are connected and executed as follows:

Hint: In Google Colab, long-running training sessions may occasionally crash or disconnect if the output console is flooded — for example, by verbose progress bars or print statements.

To prevent this, you can suppress output during training using the following approach.

Import the required tools:

from contextlib import redirect_stdout, redirect_stderr
import os

Wrap the training phase with output redirection.

Insert the following lines just before Step 4: Run the training phase:

# Suppress output for the entire training process
with open(os.devnull, 'w') as devnull:
    with redirect_stdout(devnull), redirect_stderr(devnull):
        # Your training function call goes here
        train_agents(...)

✅ This redirects all stdout and stderr to /dev/null, preventing Colab from being overwhelmed by output and improving session stability.

[ ]:

log = logging.getLogger(__name__)
csv_path = "outputs"
os.makedirs("local_db", exist_ok=True)
np.random.seed(42)  # Set a random seed for reproducibility

if __name__ == "__main__":
    db_uri = "sqlite:///local_db/assume_db.db"

    scenario = "example_02a"
    study_case = "base"

    # 1. Create simulation world
    world = World(database_uri=db_uri, export_csv_path=csv_path)

    # 2. Register your learning strategy
    world.bidding_strategies["pp_learning"] = EnergyLearningSingleBidStrategy

    # 3. Load scenario and case
    load_scenario_folder(
        world,
        inputs_path=inputs_path,
        scenario=scenario,
        study_case=study_case,
    )

    # 4. Run the training phase
    if world.learning_mode:
        run_learning(world)

    # 5. Execute final evaluation run (no exploration)
    world.run()

This script will:

Train the agent using your defined strategy.
Periodically evaluate the agent using a noise-free policy.
Save training data into the database for post-analysis.

7.3 Analyze Learning Performance#

Once training is complete, we can evaluate the learning progress of your RL agent using data from the simulation database. ASSUME stores detailed training metrics in the rl_params table, which includes rewards for each time step, grouped by episode, unit, and whether the agent was in evaluation mode.

In this case, we are interested in the performance of a specific generator: ``pp_6``, within the simulation ``example_02a_base``.

We’ll extract the recorded rewards for this unit, group them by episode, and plot the average reward over time for both training and evaluation phases.

Instead of accessing the training evaluation via the database we also feature a tensorboard integration, which can be accessed in the console tensorboard --logdir tensorboard

[ ]:

# Connect to the simulation database
engine = create_engine("sqlite:///local_db/assume_db.db")

# Query rewards for specific simulation and unit
sql = """
SELECT
    datetime,
    unit,
    reward,
    simulation,
    evaluation_mode,
    episode
FROM rl_params
WHERE simulation = 'example_02a_base'
  AND unit = 'pp_6'
ORDER BY datetime
"""

# Load query results
rewards_df = pd.read_sql(sql, engine)

# Rename column for consistency
rewards_df.rename(columns={"evaluation_mode": "evaluation"}, inplace=True)

# --- Separate plots for training and evaluation ---
fig, axes = plt.subplots(2, 1, figsize=(12, 10), sharex=False)

# Plot training rewards (evaluation == 0)
train_df = rewards_df[rewards_df["evaluation"] == 0]
train_grouped = train_df.groupby("episode")["reward"].mean()

axes[0].plot(train_grouped.index, train_grouped.values, color="tab:blue")
axes[0].set_title("Training Reward per Episode (Unit: pp_6)")
axes[0].set_ylabel("Average Reward")
axes[0].grid(True)

# Plot evaluation rewards (evaluation == 1)
eval_df = rewards_df[rewards_df["evaluation"] == 1]
eval_grouped = eval_df.groupby("episode")["reward"].mean()

axes[1].plot(eval_grouped.index, eval_grouped.values, color="tab:green")
axes[1].set_title("Evaluation Reward per Episode (Unit: pp_6)")
axes[1].set_xlabel("Episode")
axes[1].set_ylabel("Average Reward")
axes[1].grid(True)

plt.tight_layout()
plt.show()

What This Shows

Training curve: Captures learning progress with exploration noise.
Evaluation curve: Tracks the performance of the evaluation/validation run without noise, which is performed every validation_episodes_interval steps, as defined in the learning_config.

This plot provides insight into:

How well the agent is improving over time.
Whether learning has converged or stagnated.

7.4 Summary#

You have now run your first complete training loop in ASSUME.
The learning configuration defines all key training parameters—review them carefully.
After training, rewards from rl_params allow you to inspect and validate agent behavior.
The separation of training and evaluation rewards is key to understanding generalization.

In the next chapter, you may proceed to analyze simulation outcomes in greater detail (e.g., market prices, total costs, capacity dispatch), or compare different agent configurations.

8. Analyzing Strategic Bidding Behavior#

Now that your agent has completed training, we shift our focus to a critical and more insightful question:

What did the agent actually learn?

This chapter analyzes the actual bids submitted by the agent and evaluates whether the agent developed a strategic bidding behavior—especially in the context of market power.

8.1. Background: Market Setup#

This simulation is based on example case 1 from the following study:

[1] Harder, N.; Qussous, R.; Weidlich, A. Fit for purpose: Modeling wholesale electricity markets realistically with multi-agent deep reinforcement learning. Energy and AI, 2023, 14:100295. https://doi.org/10.1016/j.egyai.2023.100295

In this case:

The market contains one large RL agent: pp_6.
The agent has enough capacity to influence the market clearing price.
It is allowed to bid freely to maximize its own reward (profit, adjusted by regret).

Marginal Cost Structure:

Unit	Marginal Cost (€/MWh)
pp_6	55.7
Next unit	85.7

A profit-maximizing agent with market power would learn to bid just below the next most expensive unit—in this case, somewhere just below 85.7 €/MWh.

8.2. Extract and Plot the Agent’s Bids#

We will extract the bids submitted by pp_6 from the market_orders table and plot them over time.

[ ]:

# Connect to database
engine = create_engine("sqlite:///local_db/assume_db.db")

# Query bids from pp_6 in simulation example_02a_base and market EOM
sql = """
SELECT
    start_time AS time,
    price,
    accepted_price,
    unit_id,
    simulation
FROM market_orders
WHERE simulation = 'example_02a_base'
  AND unit_id = 'pp_6'
  AND market_id = 'EOM'
ORDER BY start_time
"""

# Load results into DataFrame
bids_df = pd.read_sql(sql, engine)
bids_df["time"] = pd.to_datetime(bids_df["time"])

# Define marginal cost boundaries
mc_pp6 = 55.7
mc_next = 85.7

plt.figure(figsize=(14, 6))
plt.plot(bids_df["time"], bids_df["price"], label="pp_6 Bid Price", color="tab:blue")

# Reference lines for marginal cost and competitive threshold
plt.axhline(
    mc_pp6,
    color="gray",
    linestyle="--",
    linewidth=2,
    label="pp_6 Marginal Cost (55.7 €)",
)
plt.axhline(
    mc_next,
    color="red",
    linestyle="--",
    linewidth=2,
    label="Next Unit's Marginal Cost (85.7 €)",
)

plt.plot(
    bids_df["time"],
    bids_df["accepted_price"],
    label="Accepted Price",
    color="tab:orange",
)

plt.title("Bidding Behavior of RL Agent (pp_6)")
plt.xlabel("Time")
plt.ylabel("Bid Price (€/MWh)")
plt.legend()
plt.ylim(30, 100)
plt.grid(True)
plt.tight_layout()
plt.show()

8.4. What Does This Show?#

The plot typically reveals:

The agent almost never bids at its own marginal cost.
Instead, its bid prices cluster below 85.7 €/MWh, indicating that it has learned to:
- Underbid the next unit to secure dispatch.
- Exploit its market position to maximize profit rather than behave as a price-taker.
This is consistent with strategic bidding behavior in oligopolistic market settings.

This outcome aligns with the findings from [1], confirming that deep RL agents can learn to exercise market power when not explicitly restricted.

8.5. Summary#

The RL agent did not simply mimic marginal cost bidding—it learned to optimize strategically.
The bid curve confirms that market power was exercised by bidding just under the next marginal unit.
This is a core feature of realistic market modeling, and shows the value of RL in economic simulations.

9. Summary and Outlook#

9.1. What You Built#

Over the course of this tutorial, you developed a complete RL bidding strategy for an electricity market agent in the ASSUME framework. You constructed and trained a fully functional learning agent that can:

Observe the market and its own internal state.
Make strategic bidding decisions based on learned policy.
Receive reward signals and adapt its behavior accordingly.
Exploit market dynamics, including market power, when permitted.

9.2 What You Learned#

Throughout the tutorial, you explored the full learning pipeline in a realistic electricity market context:

How to construct observations from both global forecasts and unit-specific state.
How to define actions and handle exploration, including guided exploration around meaningful economic baselines.
How to design and normalize a reward function that balances realized profit with opportunity cost.
How to run a simulation using multi-agent DRL and analyze its outcomes.
How to evaluate bidding behavior and interpret economic strategies emerging from the agent’s learning process.

9.3 What You Can Try Next#

Your implementation is modular and extensible. Here are several directions you can explore on your own:

Adjust Learning Parameters

Experiment with:

learning_rate, gamma, noise_sigma, episodes_collecting_initial_experience
validation_episodes_interval, train_freq, or gradient_steps

Observe how these changes affect convergence, stability, and bidding behavior.

Try Different Scenarios

Run ``example_02b`` or ``example_02c``:
- 02b: Introduces moderate competition with several learning agents.
- 02c: Contains many learning agents, simulating a highly competitive environment.
Compare bidding behavior and reward dynamics across settings.

Dive into Other Tutorials

If you are interested in the general algorithm behind the MADDPG and how it is integrated into ASSUME look into 04a_RL_algorithm_example
In the small example we could see what the a good bidding behavior of the agent might be and, hence, can judge learning easily, but what if we model many agents in new simulations? We provide explainable RL mechanisms in another tutorial for you to dive into 09_example_Sim_and_xRL

4.2 Designing Adaptive Bidding Strategies in ASSUME using Reinforcement Learning

Contents

4.2 Designing Adaptive Bidding Strategies in ASSUME using Reinforcement Learning#

1. Get ASSUME Running#

1.1 Installation#

In Google Colab#

On Your Local Machine#

1.2 Repository Setup#

1.3 Input Path Configuration#

1.4 Installation Check#

1.5 Limitations in Colab#

1.6 Core Imports#

2. ASSUME & Learning Basics#

2.1 The ASSUME Framework#

2.2 Introduction to Learning in ASSUME#

2.3 Single-Agent RL#

2.4 Multi-Agent RL#

3. Defining the Observation Space#

3.2 Observation Structure in ASSUME#

3.3 Defining the Strategy Class and Constructor#

3.4 Exercise 1: Define Individual Observations#

Solution#

3.5 Summary#

4. Action Selection and Exploration#

4.1 Action Selection in RL#

4.2 Understanding get_actions()#

4.3 What Is Initial Experience Collection Mode?#

4.4 Improving Exploration with Prior Knowledge#

4.5 Exercise 2: Guided Exploration#

Solution#

4.6 Summary#

5. From Observation to Action to Bids#

5.1 The Role of calculate_bids()#

5.2 Action Normalization and Scaling#

5.3 Bid Structure#

5.4 Controlling Action Dimensions#

5.5 Full Code Implementation#

6. Reward Function Design#

6.1 When Is the Reward Computed?#

6.2 Exercise 3: Implement Profit-Based Reward#

6.4 Exercise 3 (optional): Thinking Beyond Profit#

6.6 Reward Scaling and Learning Stability#

Solution#

6.7 Summary#

7. Training and Evaluating Your First Learning Agent#

7.1 Load and Inspect the Learning Configuration#

7.2 Run the Simulation and Train the Agent#

7.3 Analyze Learning Performance#

7.4 Summary#

8. Analyzing Strategic Bidding Behavior#

8.1. Background: Market Setup#

8.2. Extract and Plot the Agent’s Bids#

8.4. What Does This Show?#

8.5. Summary#

9. Summary and Outlook#

9.1. What You Built#

9.2 What You Learned#

9.3 What You Can Try Next#

4.2 Understanding `get_actions()`#

5.1 The Role of `calculate_bids()`#