Note
You can download this example as a Jupyter notebook or try it out directly in Google Colab.
4.2 Designing Adaptive Bidding Strategies in ASSUME using Reinforcement Learning#
Introduction
This tutorial introduces the integration of reinforcement learning (RL) into the ASSUME simulation framework, with a focus on developing and deploying learning-based bidding strategies for electricity market participants.
The tutorial is designed to walk you through the essential components required to transform a conventional market participant into an RL agent. Rather than concentrating on the underlying algorithmic infrastructure—such as training loops, buffers, or learning roles—this tutorial emphasizes how to define a bidding strategy that interfaces with the ASSUME learning backend. You will learn how to construct observation spaces, define action mappings, and design reward functions that guide agent behavior.
Each core concept is addressed in a dedicated chapter, accompanied by exercises that allow you to apply the material directly. These hands-on tasks culminate in a final integration chapter where you will run a complete simulation and train your first learning agent.
Tutorial Structure
The tutorial is divided into the following chapters:
Get ASSUME Running
Instructions for installing ASSUME and preparing your environment, whether locally or in Google Colab.
ASSUME & Learning Basics
A conceptual overview of RL within the ASSUME framework, including actor-critic architectures, centralized training, and multi-agent design principles.
Defining the Observation Space
Explanation and coding tasks for constructing shared and individual observations used by agents to make decisions.
Action Selection and Exploration
Retrieving the agents actions based on the observed environment and why it is important to explore actions beyond the output values.
From Observation to Action to Bids
How to convert actor network outputs into economically meaningful bid prices and apply exploration during training.
Reward Function Design
Techniques for shaping agent behavior using profit- and regret-based reward signals. Includes a task to define your own reward logic.
Training and Evaluating Your First Learning Agent
Integration of the previously implemented components into a complete simulation run, demonstrating end-to-end learning behavior in a market setting.
Analyzing Strategic Bidding Behavior
Investigate how a RL agent exploits its market power by learning strategic bidding behavior as an example for more realistic market simulations.
Summary and Outlook
Wraps up the contents of this tutorial. Further ideas which components of the learning process can be additionally tweaked.
Learning Outcomes
By completing this tutorial, you will be able to:
Implement RL-compatible bidding strategies within the ASSUME framework.
Define observation inputs for learning agents.
Map actor outputs to valid market actions and manage exploration.
Construct reward functions that combine economic incentives with strategic signals.
Train and evaluate a basic RL agent in a multi-agent electricity market simulation.
1. Get ASSUME Running#
This chapter walks you through setting up the ASSUME framework in your environment and preparing the required input files. At the end, you will confirm that the installation was successful and ready for use.
1.1 Installation#
In Google Colab#
Google Colab already includes most scientific computing libraries (e.g., numpy, torch). You only need to install the ASSUME core framework:
[ ]:
# Only run this cell if you are using Google Colab
import importlib.util
IN_COLAB = importlib.util.find_spec("google.colab") is not None
if IN_COLAB:
!pip install assume-framework
Note: After installation, Colab may prompt you to restart the session due to dependency changes. To do so, click “Runtime” → “Restart session…” in the menu bar, then re-run the cells above.
On Your Local Machine#
To install ASSUME with all learning-related dependencies, run the following in your terminal:
pip install 'assume-framework[learning]'
This will install the simulation framework and the packages required for RL.
1.2 Repository Setup#
To access predefined simulation scenarios, clone the ASSUME repository (Colab only):
[ ]:
# Only run this cell if you are using Google Colab
if IN_COLAB:
!git clone --depth=1 https://github.com/assume-framework/assume.git assume-repo
Local users may skip this step if input files are already available in the project directory.
1.3 Input Path Configuration#
We define the path to input files depending on whether you’re in Colab or working locally. This variable will be used to load configuration and scenario files throughout the tutorial.
[ ]:
colab_inputs_path = "assume-repo/examples/inputs"
local_inputs_path = "../inputs"
inputs_path = colab_inputs_path if IN_COLAB else local_inputs_path
1.4 Installation Check#
Use the following cell to ensure the installation was successful and that essential components are available. This test ensures that the simulation engine and RL strategy base class are accessible before continuing.
[ ]:
try:
from assume import World
from assume.strategies.learning_strategies import TorchLearningStrategy
print("✅ ASSUME framework is installed and functional.")
except ImportError as e:
print("❌ Failed to import essential components:", e)
print(
"Please review the installation instructions and ensure all dependencies are installed."
)
1.5 Limitations in Colab#
Colab does not support Docker, so dashboard visualizations included in some ASSUME workflows will not be available. However, simulation runs and RL training can still be fully executed.
In Colab: Training and basic plotting are supported.
In Local environments with Docker: Full access, including dashboards.
1.6 Core Imports#
In this section, we import the core modules that will be used throughout the tutorial. Each import is explained to clarify its role.
[ ]:
# Standard Python modules
import logging # For logging messages during simulation and debugging
import os # For operating system interactions
from datetime import timedelta # To handle market time resolutions (e.g., hourly steps)
import matplotlib.pyplot as plt
# Scientific and data processing libraries
import numpy as np # Numerical operations and array handling
import pandas as pd # Data manipulation and analysis
import yaml # Parsing YAML configuration files
# Database and visualization libraries
from sqlalchemy import create_engine
# ASSUME framework components
from assume import World # Core simulation container that manages markets and agents
from assume.scenario.loader_csv import ( # Functions to load and execute scenarios
load_scenario_folder,
run_learning,
)
from assume.strategies.learning_strategies import (
MinMaxStrategy, # Abstract class for powerplant-like strategies
TorchLearningStrategy, # Abstract base for RL bidding strategies
)
These imports are used for:
Defining RL bidding strategies.
Managing input/output data.
Executing and analyzing simulations.
At this point, you are ready to begin building your RL bidding agent. In the next chapter, we will define how agents perceive the market by constructing their observation vectors.
2. ASSUME & Learning Basics#
2.1 The ASSUME Framework#
ASSUME is a simulation framework designed for researchers, utilities, and planners to model and understand market dynamics in electricity systems. It allows for agent-based modeling of market participants in a modular and configurable environment.
The core structure of the framework consists of:
Markets (on the left of the architecture diagram): Where electricity products are traded.
Market Participants / Units (on the right): Each agent represents a physical or virtual unit bidding into the market.
Orders: The main communication channel between units and markets.
Learning Agents: Highlighted in yellow in the architecture, these are agents using RL strategies.
The image below illustrates the high-level architecture of ASSUME. Focus on the yellow components—these are the parts involved in the learning process.
[ ]:
from pathlib import Path
from IPython.display import SVG, display
image_path = Path("assume-repo/docs/source/img/architecture.svg")
alt_image_path = Path("../../docs/source/img/architecture.svg")
if image_path.exists():
display(SVG(image_path))
elif alt_image_path.exists():
display(SVG(alt_image_path))
2.2 Introduction to Learning in ASSUME#
The current implementation of RL in ASSUME models electricity markets as partially observable Markov games, allowing multiple agents to operate under individual observations and reward structures.
If you are unfamiliar with RL, refer to the following links for background material:
Central Concepts:
Policy: The strategy used by an agent to select actions based on observations.
Actor-Critic Architecture: A method where the “actor” chooses actions and the “critic” evaluates them.
Learning Strategy: Defines how a unit transforms observations into bids using a trainable model.
Step Functions: The typical RL cycle of Observe → Act → Reward → Update is split across several methods in ASSUME, as described in Section 3.
2.3 Single-Agent RL#
In a single-agent setup, the agent attempts to maximize its reward over time by learning from interaction with the environment. It does so by making multiple steps in the environment. In RL, each interaction step includes:
Observation of the current state.
Action selection based on policy.
Reward from the environment.
Policy Update to improve behavior.
In ASSUME, this step cycle is modularized:
RL Step |
Implemented via |
Description |
|---|---|---|
Step 1 |
|
Constructs the observation vector. |
Step 2 |
|
Maps observations to bid prices. |
Step 3 |
|
Computes the reward signal. |
Step 4 |
Handled by the learning role |
Updates model and manages the replay buffer. |
Actor-Critic Structure: To increase learning stability actor-critic methods are commonly used. They divide the tasks in the following way:
Actor: Learns a deterministic policy for choosing actions. Uses policy gradient methods to maximize expected reward.
Critic: Learns a value function using Temporal Difference (TD) learning. Provides feedback to the actor based on action quality.
2.4 Multi-Agent RL#
Real-world electricity markets involve multiple agents acting simultaneously, which introduces interdependencies and non-stationarity. The latter refers to the fact that the continuous adaptation of other agents makes the environment change and therefore less predictable from the perspective of a single agent. As a result, multi-agent learning requires additional considerations.
Challenges:
Actions by one agent influence the environment experienced by others.
The state transitions and rewards become non-stationary.
Solution: Centralized Training with Decentralized Execution (CTDE)
To address these challenges, ASSUME employs the Multi-Agent Twin Delayed Deep Deterministic Policy Gradient (MATD3) framework with CTDE:
Centralized Training: A critic with access to all agents’ states and actions is used during training to stabilize learning. Note the critic is only there to update the actor network, so it is only necessary while training.
Decentralized Execution: During simulation, the actual actor of each agent relies only on its own observations and learned policy.
Each agent trains two critic networks to mitigate overestimation bias, uses target noise for robustness, and relies on deterministic policy gradients for the actor update.
3. Defining the Observation Space#
In this chapter, you will define what information your RL agent perceives about the environment and itself at each decision point. This is a critical component of the agent’s behavior, as the observation vector forms the basis for all future actions and learning.
RL Step |
Implemented via |
Description |
|---|---|---|
Step 1 |
|
Constructs the observation vector. |
3.2 Observation Structure in ASSUME#
Observations are composed of two parts:
1. Global Observations
These are shared across all agents and constructed by the base class method create_observation(). They include:
Forecasted residual load over the foresight horizon.
Forecasted market price over the foresight horizon.
Historical market price over a specified window.
These are normalized by maximum demand and maximum bid price for stability. These values are generated by a forecasting role and made available to all agents before each market cycle.
For this tutorial you do not need to modify this part. However, if you want to equip new units types with learning or expand the simulation by new concepts, additional global information might be needed.
2. Individual Observations
These are unit-specific and must be implemented by you. The purpose is to provide the agent with private, operational information that may help improve bidding decisions. Each agent appends this information to the end of the shared observation vector.
This is done via the method get_individual_observations(unit, start).
3.3 Defining the Strategy Class and Constructor#
To enable learning, we define a custom class that extends TorchLearningStrategy and initializes key dimensions for the model:
[ ]:
class EnergyLearningSingleBidStrategy(TorchLearningStrategy, MinMaxStrategy):
"""
A simple reinforcement learning bidding strategy.
"""
def __init__(self, *args, **kwargs):
# Forecast horizon (in timesteps) used for market and residual load forecasts
foresight = kwargs.pop("foresight", 12)
act_dim = kwargs.pop("act_dim", 1) # One action: bid price
unique_obs_dim = kwargs.pop("unique_obs_dim", 2) # Number of individual obs
super().__init__(
foresight=foresight,
act_dim=act_dim,
unique_obs_dim=unique_obs_dim,
*args,
**kwargs,
)
With your defined foresight range the global observations are defined in the function create_observation of the base class. Based on the chosen foresight the observation_space dimension is calculated automatically following self.obs_dim = num_timeseries_obs_dim * foresight + unique_obs_dim as defined in the base class. If one wants to change that rational it needs to be overwritten in the learning_strategy itself.
3.4 Exercise 1: Define Individual Observations#
Now you will implement the following method:
[ ]:
# we define the class again and inherit from the initial class just to add the additional method to the original class
# this is a workaround to have different methods of the class in different cells
# which is good for the purpose of this tutorial
# however, you should have all functions in a single class when using this example in .py files
class EnergyLearningSingleBidStrategy(EnergyLearningSingleBidStrategy):
def get_individual_observations(self, unit, start, end):
"""
Define custom unit-specific observations for the RL agent.
Parameters
----------
unit : SupportsMinMax
The unit representing the power plant.
start : datetime.datetime
Start time of the market product.
Returns
-------
np.ndarray
Normalized 1D array of individual observations.
"""
# Your implementation here
This method must return a NumPy array of length unique_obs_dim.
What Should Be in an Individual Observation?
The key principle is to include values that are:
Known only to the unit itself.
Relevant for market bidding.
Reflective of the unit’s technical or economic constraints.
Here are some good candidate features and how to compute them using ASSUME:
Feature |
Description |
Access via |
|---|---|---|
Current output |
How much power the unit is currently producing |
|
Marginal cost |
Cost to produce current output |
|
Max capacity |
Upper generation limit |
|
Max bid price |
Maximum price at market |
|
Start-up/shut-down state |
May be encoded in dispatch history |
infer from |
Ramp limit |
Maximum change in output allowed |
|
Efficiency or fuel cost factors |
If applicable |
custom attributes per unit model |
Solution#
[ ]:
# Solution Exercise 1: Define Individual Observations
# we define the class again and inherit from the initial class just to add the additional method to the original class
# this is a workaround to have different methods of the class in different cells
# which is good for the purpose of this tutorial
# however, you should have all functions in a single class when using this example in .py files
class EnergyLearningSingleBidStrategy(EnergyLearningSingleBidStrategy):
def get_individual_observations(self, unit, start, end):
# --- Current volume & marginal cost ---
current_volume = unit.get_output_before(start)
current_costs = unit.calculate_marginal_cost(start, current_volume)
scaled_total_dispatch = current_volume / unit.max_power
scaled_marginal_cost = current_costs / self.max_bid_price
individual_observations = np.array(
[scaled_total_dispatch, scaled_marginal_cost]
)
return individual_observations
3.5 Summary#
Observations in ASSUME combine shared global forecasts and custom individual data.
The base class handles forecasted residual load and price, as well as historical price signals.
These observations directly affect agent behavior and learning convergence—thoughtful design matters.
In the next chapter, you will define how the agent selects actions based on its observations, and how exploration is introduced during initial training to populate the learning buffer.
4. Action Selection and Exploration#
Once an observation is formed, the next step is for the agent to decide how to act. In this context, the action determines the bid price submitted by the agent to the electricity market.
This chapter focuses on how actions are derived from the agent’s policy and how exploration is handled—especially during the early training phase when experience is sparse.
4.1 Action Selection in RL#
In RL, the policy defines the agent’s behavior: it maps observations to actions. In the actor-critic architecture used by ASSUME, this policy is represented by the actor neural network.
However, to enable exploration, especially in the early stages of training, agents must not always follow the policy exactly. They need to try out a variety of actions—even suboptimal ones—to collect diverse experiences and learn effectively.
This is done by adding noise to the actions suggested by the policy network.
Note: The implementation of noise we present here is specific to the used algorithm MADDPG. Other Algorithms such as the PPO will use a different mechanism for exploration.
4.2 Understanding get_actions()#
The method get_actions(next_observation) in TorchLearningStrategy defines how actions are computed in different modes of operation.
Here is a simplified overview of the logic:
def get_actions(self, next_observation):
if self.learning_mode and not self.evaluation_mode:
if self.collect_initial_experience_mode:
# Initial exploration: use pure noise as action
noise = self.action_noise.noise(...)
curr_action = noise
else:
# Regular exploration: add noise to policy output
curr_action = self.actor(next_observation).detach()
noise = self.action_noise.noise(...)
curr_action += noise
else:
# Evaluation or deterministic policy use
curr_action = self.actor(next_observation).detach()
noise = zeros_like(curr_action)
return curr_action, noise
Modes of Operation:
learning_mode: Indicates that the agent is being trained (vs. used for evaluation).evaluation_mode: Disables noise; used to assess performance of a learned policy.collect_initial_experience_mode: Special sub-phase during early episodes where we rely heavily on randomized exploration to populate the replay buffer with diverse samples.
4.3 What Is Initial Experience Collection Mode?#
The initial experience collection mode refers to the first N episodes of training where agents fill their learning buffers purely through exploration. No learned policy is used at this stage.
The purpose is to:
Cover a broad region of the action space.
Enable agents to observe the outcome of many different bidding decisions.
By default, the action in this mode is pure noise, sampled from a Gaussian distribution.
4.4 Improving Exploration with Prior Knowledge#
While random actions help explore broadly, we can use economic and technical knowledge to make exploration more guided.
What would be a good starting point for a conventional generator? Exploring in a region around this value is far more productive than exploring arbitrarily.
Thus, instead of using random noise alone, we can shift the noisy action around a known good starting point so that exploration begins from a plausible economic baseline.
4.5 Exercise 2: Guided Exploration#
Your task is to modify the get_actions() method to implement a better initial exploration mechanism.
Objective:
During the collect_initial_experience_mode, instead of using pure noise, base the exploration around a known signal from the observation vector.
[ ]:
# we define the class again and inherit from the initial class just to add the additional method to the original class
# this is a workaround to have different methods of the class in different cells
# which is good for the purpose of this tutorial
# however, you should have all functions in a single class when using this example in .py files
class EnergyLearningSingleBidStrategy(EnergyLearningSingleBidStrategy):
def get_actions(self, next_observation):
"""
Compute actions based on the current observation, optionally applying noise for exploration.
Args
----
next_observation : torch.Tensor
The current observation, where the last element is assumed to be the marginal cost.
Returns
-------
tuple of torch.Tensor
- Action (with or without noise)
- The applied noise
"""
# Get the base action and associated noise from the parent implementation
curr_action, noise = super().get_actions(next_observation)
if self.learning_mode and not self.evaluation_mode:
if self.collect_initial_experience_mode:
# TODO: extract a relevant reference value from next_observation
# TODO: shift the noisy action around this value
pass # replace this with your implementation
return curr_action, noise
Solution#
[ ]:
# Solution Exercise 2: Improve Initial Exploration
# we define the class again and inherit from the initial class just to add the additional method to the original class
# this is a workaround to have different methods of the class in different cells
# which is good for the purpose of this tutorial
# however, you should have all functions in a single class when using this example in .py files
class EnergyLearningSingleBidStrategy(EnergyLearningSingleBidStrategy):
def get_actions(self, next_observation):
# Get the base action and associated noise from the parent implementation
curr_action, noise = super().get_actions(next_observation)
if self.learning_mode and not self.evaluation_mode:
if self.collect_initial_experience_mode:
# Assumes last dimension of the observation corresponds to marginal cost
marginal_cost = next_observation[-1].detach()
# Center the noisy action around marginal cost
curr_action += marginal_cost
return curr_action, noise
This strategy anchors exploration to a meaningful economic quantity, improving the quality of early experiences and accelerating convergence.
4.6 Summary#
The
get_actions()method controls how agents choose actions under different modes.During training, actions include noise to enable exploration.
Initial exploration can be enhanced by guiding actions toward domain-relevant baselines (e.g., marginal cost).
You implemented a strategy to anchor exploration using part of the observation vector.
In the next chapter, we will transform the action values into actual bids by applying domain knowledge.
5. From Observation to Action to Bids#
In the previous chapters, we explored how an agent perceives its environment through observations and how it selects actions using its policy, optionally enriched with exploration noise. In this short chapter, we show how these two steps come together inside the calculate_bids() method.
5.1 The Role of calculate_bids()#
The calculate_bids() method defines how a market participant formulates its bid at each market interval. It brings together two crucial operations:
Generating Observations: Calls
create_observation()to construct the full input vector (including both global and individual components).Choosing an Action: Passes the observation to
get_actions(), which invokes the actor network (and optionally adds noise) to return an action vector.
This forms the agent’s internal decision pipeline.
5.2 Action Normalization and Scaling#
The neural network policy outputs normalized actions—typically bounded in the range \([-1, 1]\). To convert these to meaningful bid prices, the raw action is scaled by a predefined constant:
bid_price = actions[0] * self.max_bid_price
For example, if self.max_bid_price = 100, the resulting bid prices will fall between \(-100\) and \(100\). This reflects a design choice that bounds the agent’s economic behavior in a defined domain.
By modifying max_bid_price in the learning config, you directly influence the economic aggressiveness of the policy.
5.3 Bid Structure#
Each bid submitted to the market follows a defined structure, encapsulated as a dictionary:
{
"start_time": start,
"end_time": end,
"price": bid_price,
"volume": max_power,
"node": unit.node,
}
Key aspects:
price: Determined from the scaled output of the policy.
volume: Set to the full technical capacity of the unit.
node: Locational identifier (used for zonal/nodal pricing and congestion modeling).
Note that max_power is positive, as this strategy models a generator offering energy. For a consumer or demand bid, the volume would be negative to reflect load withdrawal.
5.4 Controlling Action Dimensions#
By changing the act_dim in the strategy constructor, you can control the number of outputs returned by the actor network:
act_dim = kwargs.pop("act_dim", 1)
This allows for richer bidding logic. For instance:
1 action: Bid price for total capacity.
2 actions: Bid prices for flexible vs. inflexible portions.
3 actions: Add directionality or reserve offers.
However, it is important to note that RL performance deteriorates with high-dimensional action spaces, especially in continuous domains.
If you decide to increase act_dim, ensure that your calculate_bids() method is updated accordingly to interpret and transform all action elements correctly.
5.5 Full Code Implementation#
Here is the complete calculate_bids() implementation:
[ ]:
# we define the class again and inherit from the initial class just to add the additional method to the original class
# this is a workaround to have different methods of the class in different cells
# which is good for the purpose of this tutorial
# however, you should have all functions in a single class when using this example in .py files
class EnergyLearningSingleBidStrategy(EnergyLearningSingleBidStrategy):
def calculate_bids(self, unit, market_config, product_tuples, **kwargs):
start = product_tuples[0][0]
end = product_tuples[0][1]
# get technical bounds for the unit output from the unit
_, max_power = unit.calculate_min_max_power(start, end)
max_power = max_power[0]
# =============================================================================
# 1. Get the observations, which are the basis of the action decision
# =============================================================================
next_observation = self.create_observation(
unit=unit, market_id=market_config.market_id, start=start, end=end
)
# =============================================================================
# 2. Get the actions, based on the observations
# =============================================================================
actions, noise = self.get_actions(next_observation)
# =============================================================================
# 3. Transform actions into bids
# =============================================================================
# actions are in the range [-1,1], we need to transform them into actual bids
# we can use our domain knowledge to guide the bid formulation
bid_price = actions[0] * self.max_bid_price
# actually formulate bids in orderbook format
bids = [
{
"start_time": start,
"end_time": end,
"price": bid_price,
"volume": max_power,
"node": unit.node,
},
]
if self.learning_mode:
self.learning_role.add_actions_to_cache(self.unit_id, start, actions, noise)
return bids
In the next chapter, we will define how to compute the reward associated with each bid outcome, which completes the agent’s learning cycle.
6. Reward Function Design#
The reward function is the central learning signal in any RL environment. It defines the objective the agent is trying to maximize and serves as the only feedback mechanism from the environment to the agent.
In market-based simulations such as ASSUME, designing the reward function is a delicate balance between:
Capturing realistic economic goals (e.g., profit maximization),
Enabling learning stability and convergence, and
Leaving room for the agent to discover unexpected, valid strategies.
It’s tempting to hard-code your preferred behavior into the reward function. However, this often leads to agents that are overly adapted to a specific scenario and perform poorly in general.
6.1 When Is the Reward Computed?#
In ASSUME, the reward is computed after the market clears, in the calculate_reward() method. At this point, the agent receives information about:
Which portion of its bid was accepted,
At what price,
And what operational costs it incurred.
This allows us to calculate realized profit, which is the most direct economic reward signal.
6.2 Exercise 3: Implement Profit-Based Reward#
Your first task is to implement a profit-based reward. This is mandatory.
Use the following simplified formula:
Where:
\(P^\text{conf}\): Confirmed volume (accepted by market),
\(M_t\): Market clearing price,
\(mc_{i,t}\): Marginal generation cost,
\(dt\): Time resolution in hours.
You can access these quantities via:
accepted_volume = order["accepted_volume"]
market_clearing_price = order["accepted_price"]
marginal_cost = unit.calculate_marginal_cost(start, unit.outputs[marketconfig.product_type].at[start])
Use the duration in hours:
duration = (end - start) / timedelta(hours=1)
6.4 Exercise 3 (optional): Thinking Beyond Profit#
While profit is a good starting point, agents trained solely on profit may struggle in competitive environments or when there is limited dispatch. In real-world operations, generators also consider missed opportunities—what could have been earned but wasn’t due to poor bidding or conservative behavior.
What other signal could guide the agent to bid more strategically?
What do real power plants look at when evaluating their bidding success—even when they were not dispatched?
Use your economic intuition or power system experience to answer this.
[ ]:
# we define the class again and inherit from the initial class just to add the additional method to the original class
# this is a workaround to have different methods of the class in different cells
# which is good for the purpose of this tutorial
# however, you should have all functions in a single class when using this example in .py files
class EnergyLearningSingleBidStrategy(EnergyLearningSingleBidStrategy):
def calculate_reward(self, unit, marketconfig, orderbook):
"""
Reward function: implement profit and (optionally) opportunity cost.
Instructions:
- Fill in the lines marked as YOUR CODE.
- Compute profit as the primary reward signal.
- Optionally define the opportunity cost as a regret term.
"""
start = orderbook[0]["start_time"]
end = orderbook[0]["end_time"]
duration = (end - start) / timedelta(hours=1)
end_excl = end - unit.index.freq
order = orderbook[0]
market_clearing_price = None # YOUR CODE HERE
accepted_volume = None # YOUR CODE HERE
marginal_cost = unit.calculate_marginal_cost(
start, unit.outputs[marketconfig.product_type].at[start]
)
# === Required: compute profit ===
order_income = None # YOUR CODE HERE
order_cost = None # YOUR CODE HERE
order_profit = None # YOUR CODE HERE
# === Optional: compute opportunity cost ===
opportunity_cost = None # YOUR CODE HERE
regret_scale = 0.1 if accepted_volume > unit.min_power else 0.5
# === Normalize reward to ~[-1, 1] ===
scaling = 1 / (self.max_bid_price * unit.max_power)
reward = scaling * (order_profit - regret_scale * opportunity_cost)
regret = regret_scale * opportunity_cost
# Store results in unit outputs
# Note: these are not learning-specific results but stored for all units for analysis
unit.outputs["profit"].loc[start:end_excl] += order_profit
unit.outputs["total_costs"].loc[start:end_excl] += order_cost
# write rl-rewards to buffer
if self.learning_mode:
self.learning_role.add_reward_to_cache(
unit.id, start, reward, regret, order_profit
)
💡Hint Optional Extension: Opportunity Cost
The concept of opportunity cost captures the lost profit from unused capacity. If the market price exceeds marginal cost and the unit wasn’t dispatched fully, that represents a missed opportunity.
This can be used as a regret term to penalize under-utilization of profitable bids.
Mathematically:
A good reward function combines profit and opportunity cost, allowing agents to learn from both actual performance and missed potential.
6.6 Reward Scaling and Learning Stability#
Scaling the reward to a narrow and consistent range is crucial for stable RL. This is particularly important in continuous-action settings like bidding, where one overly large reward spike can skew the policy updates significantly.
1. Why scale?
Stabilizes gradients during actor-critic training.
Makes different time steps comparable in magnitude.
Prevents the agent from overfitting to rare but extreme events.
2. What can go wrong?
If your scaling factor is too small:
Rewards become indistinguishable from noise.
If your scaling factor is too large:
A single high-reward event (e.g., bidding into a rare price spike) can dominate learning, making the agent try to reproduce that event rather than learn a general policy.
Tip: Use conservative scaling based on maximum realistic bid × capacity:
scaling = 1 / (self.max_bid_price * unit.max_power)
3 Recommended Practice
Before committing to training:
Plot the distribution of rewards across time steps for a few sample runs.
Check for outliers, saturation, or skewness.
If needed, adjust
scalingor cap outliers in reward postprocessing.
This diagnostic step can save hours of failed training runs.
Solution#
[ ]:
# Solution Exercise 3: Implement Reward Function
# we define the class again and inherit from the initial class just to add the additional method to the original class
# this is a workaround to have different methods of the class in different cells
# which is good for the purpose of this tutorial
# however, you should have all functions in a single class when using this example in .py files
class EnergyLearningSingleBidStrategy(EnergyLearningSingleBidStrategy):
def calculate_reward(
self,
unit,
marketconfig,
orderbook,
):
"""
Calculates the reward for the unit based on profits, costs, and opportunity costs from market transactions.
The reward is computed by combining the following:
- **Profit**: Income from accepted bids minus marginal and start-up costs.
- **Opportunity Cost**: Penalty for underutilizing capacity, calculated as potential lost income.
- **Regret Term**: A scaled regret term penalizes high opportunity costs to guide effective bidding.
The reward is scaled and stored along with other outputs in the unit’s data to support learning.
"""
start = orderbook[0]["start_time"]
end = orderbook[0]["end_time"]
duration = (end - start) / timedelta(hours=1)
# `end_excl` marks the last product's start time by subtracting one frequency interval.
end_excl = end - unit.index.freq
order = orderbook[0] # Assuming a single order for simplicity
market_clearing_price = order["accepted_price"]
accepted_volume = order["accepted_volume"]
# Depending on how the unit calculates marginal costs, retrieve cost values.
marginal_cost = unit.calculate_marginal_cost(
start, unit.outputs[marketconfig.product_type].at[start]
)
# Calculate profit as income minus operational cost for this event.
order_income = market_clearing_price * accepted_volume * duration
order_cost = marginal_cost * accepted_volume * duration
# Accumulate income and operational cost for all orders.
order_profit = order_income - order_cost
# Opportunity cost: The income lost due to not operating at full capacity.
opportunity_cost = (
(market_clearing_price - marginal_cost)
* (unit.max_power - accepted_volume)
* duration
)
# If opportunity cost is negative, no income was lost, so we set it to zero.
opportunity_cost = max(opportunity_cost, 0)
# Dynamic regret scaling:
# - If accepted volume is positive, apply lower regret (0.1) to avoid punishment for being on the edge of the merit order.
# - If no dispatch happens, apply higher regret (0.5) to discourage idle behavior, if it could have been profitable.
regret_scale = 0.1 if accepted_volume > unit.min_power else 0.5
# --------------------
# 4.1 Calculate Reward
# Instead of directly setting reward = profit, we incorporate a regret term (opportunity cost penalty).
# This guides the agent toward strategies that maximize accepted bids while minimizing lost opportunities.
# scaling factor to normalize the reward to the range [-1,1]
scaling = 1 / (self.max_bid_price * unit.max_power)
reward = scaling * (order_profit - regret_scale * opportunity_cost)
regret = regret_scale * opportunity_cost
# Store results in unit outputs
# Note: these are not learning-specific results but stored for all units for analysis
unit.outputs["profit"].loc[start:end_excl] += order_profit
unit.outputs["total_costs"].loc[start:end_excl] += order_cost
# write rl-rewards to buffer
if self.learning_mode:
self.learning_role.add_reward_to_cache(
unit.id, start, reward, regret, order_profit
)
6.7 Summary#
The reward function is the core signal guiding agent learning—design it carefully.
Start with profit as the primary reward.
Consider adding opportunity cost as a regret penalty to improve bidding behavior.
Always normalize your reward to maintain training stability.
Analyze your reward distribution empirically before training large-scale agents.
In the next chapter, we will bring together all the components—observation, action, and reward—and simulate a full training run using your custom learning strategy.
7. Training and Evaluating Your First Learning Agent#
You have now implemented all essential components of a learning bidding strategy in ASSUME:
Observations
Actions and exploration
Reward function
In this chapter, you will connect your strategy to a simulation scenario, configure the learning algorithm, and evaluate the agent’s training progress.
7.1 Load and Inspect the Learning Configuration#
Each simulation scenario in ASSUME has an associated YAML configuration file. This file contains the learning configuration, which determines how the RL algorithm is executed.
[ ]:
scenario = "base"
# Read the YAML file
with open(f"{inputs_path}/example_02a/config.yaml") as file:
config = yaml.safe_load(file)
# Print the learning config
print(f"Learning config for scenario '{scenario}':")
display(config[scenario]["learning_config"])
Explanation of Learning Configuration Parameters
Parameter |
Description |
|---|---|
learning_mode |
If |
continue_learning |
If |
trained_policies_save_path |
File path where trained policies will be saved. |
trained_policies_load_path |
Path to pre-trained policies to load. |
max_bid_price |
Used to scale action outputs to economic bid prices. |
algorithm |
Learning algorithm used (e.g., |
learning_rate |
Step size for policy and critic updates. |
training_episodes |
Number of simulation episodes (repetitions of the time horizon) used for training. |
episodes_collecting_initial_experience |
Number of episodes during which agents collect experience using guided exploration. |
train_freq |
Time between training updates, e.g., |
gradient_steps |
Number of gradient descent steps per update. |
batch_size |
Size of experience batch used for training. |
gamma |
Discount factor for future rewards (\(0 < \gamma \leq 1\)). |
device |
|
action_noise_schedule |
How the action noise evolves over time ( |
noise_sigma |
Standard deviation of exploration noise. |
noise_scale |
Global multiplier for noise. |
noise_dt |
Discretization interval for noise time series. |
validation_episodes_interval |
How often (in episodes) to evaluate the current policy without exploration. |
7.2 Run the Simulation and Train the Agent#
The simulation environment and learning strategy are connected and executed as follows:
Hint: In Google Colab, long-running training sessions may occasionally crash or disconnect if the output console is flooded — for example, by verbose progress bars or print statements.To prevent this, you can suppress output during training using the following approach.
Import the required tools:
from contextlib import redirect_stdout, redirect_stderr import os
- Wrap the training phase with output redirection.Insert the following lines just before Step 4: Run the training phase:
# Suppress output for the entire training process with open(os.devnull, 'w') as devnull: with redirect_stdout(devnull), redirect_stderr(devnull): # Your training function call goes here train_agents(...)
✅ This redirects all
stdoutandstderrto/dev/null, preventing Colab from being overwhelmed by output and improving session stability.
[ ]:
log = logging.getLogger(__name__)
csv_path = "outputs"
os.makedirs("local_db", exist_ok=True)
np.random.seed(42) # Set a random seed for reproducibility
if __name__ == "__main__":
db_uri = "sqlite:///local_db/assume_db.db"
scenario = "example_02a"
study_case = "base"
# 1. Create simulation world
world = World(database_uri=db_uri, export_csv_path=csv_path)
# 2. Register your learning strategy
world.bidding_strategies["pp_learning"] = EnergyLearningSingleBidStrategy
# 3. Load scenario and case
load_scenario_folder(
world,
inputs_path=inputs_path,
scenario=scenario,
study_case=study_case,
)
# 4. Run the training phase
if world.learning_mode:
run_learning(world)
# 5. Execute final evaluation run (no exploration)
world.run()
This script will:
Train the agent using your defined strategy.
Periodically evaluate the agent using a noise-free policy.
Save training data into the database for post-analysis.
7.3 Analyze Learning Performance#
Once training is complete, we can evaluate the learning progress of your RL agent using data from the simulation database. ASSUME stores detailed training metrics in the rl_params table, which includes rewards for each time step, grouped by episode, unit, and whether the agent was in evaluation mode.
In this case, we are interested in the performance of a specific generator: ``pp_6``, within the simulation ``example_02a_base``.
We’ll extract the recorded rewards for this unit, group them by episode, and plot the average reward over time for both training and evaluation phases.
Instead of accessing the training evaluation via the database we also feature a tensorboard integration, which can be accessed in the console
tensorboard --logdir tensorboard
[ ]:
# Connect to the simulation database
engine = create_engine("sqlite:///local_db/assume_db.db")
# Query rewards for specific simulation and unit
sql = """
SELECT
datetime,
unit,
reward,
simulation,
evaluation_mode,
episode
FROM rl_params
WHERE simulation = 'example_02a_base'
AND unit = 'pp_6'
ORDER BY datetime
"""
# Load query results
rewards_df = pd.read_sql(sql, engine)
# Rename column for consistency
rewards_df.rename(columns={"evaluation_mode": "evaluation"}, inplace=True)
# --- Separate plots for training and evaluation ---
fig, axes = plt.subplots(2, 1, figsize=(12, 10), sharex=False)
# Plot training rewards (evaluation == 0)
train_df = rewards_df[rewards_df["evaluation"] == 0]
train_grouped = train_df.groupby("episode")["reward"].mean()
axes[0].plot(train_grouped.index, train_grouped.values, color="tab:blue")
axes[0].set_title("Training Reward per Episode (Unit: pp_6)")
axes[0].set_ylabel("Average Reward")
axes[0].grid(True)
# Plot evaluation rewards (evaluation == 1)
eval_df = rewards_df[rewards_df["evaluation"] == 1]
eval_grouped = eval_df.groupby("episode")["reward"].mean()
axes[1].plot(eval_grouped.index, eval_grouped.values, color="tab:green")
axes[1].set_title("Evaluation Reward per Episode (Unit: pp_6)")
axes[1].set_xlabel("Episode")
axes[1].set_ylabel("Average Reward")
axes[1].grid(True)
plt.tight_layout()
plt.show()
What This Shows
Training curve: Captures learning progress with exploration noise.
Evaluation curve: Tracks the performance of the evaluation/validation run without noise, which is performed every
validation_episodes_intervalsteps, as defined in thelearning_config.
This plot provides insight into:
How well the agent is improving over time.
Whether learning has converged or stagnated.
7.4 Summary#
You have now run your first complete training loop in ASSUME.
The learning configuration defines all key training parameters—review them carefully.
After training, rewards from
rl_paramsallow you to inspect and validate agent behavior.The separation of training and evaluation rewards is key to understanding generalization.
In the next chapter, you may proceed to analyze simulation outcomes in greater detail (e.g., market prices, total costs, capacity dispatch), or compare different agent configurations.
8. Analyzing Strategic Bidding Behavior#
Now that your agent has completed training, we shift our focus to a critical and more insightful question:
What did the agent actually learn?
This chapter analyzes the actual bids submitted by the agent and evaluates whether the agent developed a strategic bidding behavior—especially in the context of market power.
8.1. Background: Market Setup#
This simulation is based on example case 1 from the following study:
[1] Harder, N.; Qussous, R.; Weidlich, A. Fit for purpose: Modeling wholesale electricity markets realistically with multi-agent deep reinforcement learning. Energy and AI, 2023, 14:100295. https://doi.org/10.1016/j.egyai.2023.100295
In this case:
The market contains one large RL agent:
pp_6.The agent has enough capacity to influence the market clearing price.
It is allowed to bid freely to maximize its own reward (profit, adjusted by regret).
Marginal Cost Structure:
Unit |
Marginal Cost (€/MWh) |
|---|---|
pp_6 |
55.7 |
Next unit |
85.7 |
A profit-maximizing agent with market power would learn to bid just below the next most expensive unit—in this case, somewhere just below 85.7 €/MWh.
8.2. Extract and Plot the Agent’s Bids#
We will extract the bids submitted by pp_6 from the market_orders table and plot them over time.
[ ]:
# Connect to database
engine = create_engine("sqlite:///local_db/assume_db.db")
# Query bids from pp_6 in simulation example_02a_base and market EOM
sql = """
SELECT
start_time AS time,
price,
accepted_price,
unit_id,
simulation
FROM market_orders
WHERE simulation = 'example_02a_base'
AND unit_id = 'pp_6'
AND market_id = 'EOM'
ORDER BY start_time
"""
# Load results into DataFrame
bids_df = pd.read_sql(sql, engine)
bids_df["time"] = pd.to_datetime(bids_df["time"])
# Define marginal cost boundaries
mc_pp6 = 55.7
mc_next = 85.7
plt.figure(figsize=(14, 6))
plt.plot(bids_df["time"], bids_df["price"], label="pp_6 Bid Price", color="tab:blue")
# Reference lines for marginal cost and competitive threshold
plt.axhline(
mc_pp6,
color="gray",
linestyle="--",
linewidth=2,
label="pp_6 Marginal Cost (55.7 €)",
)
plt.axhline(
mc_next,
color="red",
linestyle="--",
linewidth=2,
label="Next Unit's Marginal Cost (85.7 €)",
)
plt.plot(
bids_df["time"],
bids_df["accepted_price"],
label="Accepted Price",
color="tab:orange",
)
plt.title("Bidding Behavior of RL Agent (pp_6)")
plt.xlabel("Time")
plt.ylabel("Bid Price (€/MWh)")
plt.legend()
plt.ylim(30, 100)
plt.grid(True)
plt.tight_layout()
plt.show()
8.4. What Does This Show?#
The plot typically reveals:
The agent almost never bids at its own marginal cost.
Instead, its bid prices cluster below 85.7 €/MWh, indicating that it has learned to:
Underbid the next unit to secure dispatch.
Exploit its market position to maximize profit rather than behave as a price-taker.
This is consistent with strategic bidding behavior in oligopolistic market settings.
This outcome aligns with the findings from [1], confirming that deep RL agents can learn to exercise market power when not explicitly restricted.
8.5. Summary#
The RL agent did not simply mimic marginal cost bidding—it learned to optimize strategically.
The bid curve confirms that market power was exercised by bidding just under the next marginal unit.
This is a core feature of realistic market modeling, and shows the value of RL in economic simulations.
9. Summary and Outlook#
9.1. What You Built#
Over the course of this tutorial, you developed a complete RL bidding strategy for an electricity market agent in the ASSUME framework. You constructed and trained a fully functional learning agent that can:
Observe the market and its own internal state.
Make strategic bidding decisions based on learned policy.
Receive reward signals and adapt its behavior accordingly.
Exploit market dynamics, including market power, when permitted.
9.2 What You Learned#
Throughout the tutorial, you explored the full learning pipeline in a realistic electricity market context:
How to construct observations from both global forecasts and unit-specific state.
How to define actions and handle exploration, including guided exploration around meaningful economic baselines.
How to design and normalize a reward function that balances realized profit with opportunity cost.
How to run a simulation using multi-agent DRL and analyze its outcomes.
How to evaluate bidding behavior and interpret economic strategies emerging from the agent’s learning process.
9.3 What You Can Try Next#
Your implementation is modular and extensible. Here are several directions you can explore on your own:
Adjust Learning Parameters
Experiment with:
learning_rate,gamma,noise_sigma,episodes_collecting_initial_experiencevalidation_episodes_interval,train_freq, orgradient_steps
Observe how these changes affect convergence, stability, and bidding behavior.
Try Different Scenarios
Run ``example_02b`` or ``example_02c``:
02b: Introduces moderate competition with several learning agents.02c: Contains many learning agents, simulating a highly competitive environment.
Compare bidding behavior and reward dynamics across settings.
Dive into Other Tutorials
If you are interested in the general algorithm behind the MADDPG and how it is integrated into ASSUME look into 04a_RL_algorithm_example
In the small example we could see what the a good bidding behavior of the agent might be and, hence, can judge learning easily, but what if we model many agents in new simulations? We provide explainable RL mechanisms in another tutorial for you to dive into 09_example_Sim_and_xRL