Note
You can download this example as a Jupyter notebook or try it out directly in Google Colab.
4.3 Storage Units: Learning Temporal Bidding Strategies#
Introduction
In this tutorial, we extend the reinforcement learning (RL) framework to storage units, such as batteries, which face a unique decision structure: they must buy energy (charge) when prices are low and sell it (discharge) when prices are high, across multiple time steps.
This creates a temporal coupling of actions and rewards. Unlike generators that can profit immediately from selling electricity and are only time-coupled due to technical constraints (ramping, shut down etc), storage units must learn to plan ahead, accepting short-term costs in anticipation of future gains. As such, both the observation space and reward design need to reflect this temporal structure.
The tutorial is again designed to walk you through the essential components required to transform a conventional storage into an RL agent. Rather than concentrating on the underlying algorithmic infrastructure—such as training loops, buffers, or learning roles—this tutorial emphasizes how to define a bidding strategy that interfaces with the ASSUME learning backend. You will learn how to construct observation spaces, define action mappings, and design reward functions that guide agent behavior, but this time for a storage unit. If you have done the 04b_RL_example some parts will be familiar, since this is designed as a stand-alone tutorial as well. Mainly step 1. ASSUME & Learning Basics and 2. Get ASSUME Running can be skipped.
Each core concept is addressed in a dedicated chapter, accompanied by exercises that allow you to apply the material directly. These hands-on tasks culminate in a final integration chapter where you will run a complete simulation and train your first learning agent.
Tutorial Structure
The tutorial is divided into the following chapters:
Get ASSUME Running
Instructions for installing ASSUME and preparing your environment, whether locally or in Google Colab.
ASSUME & Learning Basics
A conceptual overview of RL within the ASSUME framework, including actor-critic architectures, centralized training, and multi-agent design principles.
Defining the Observation Space for Storages
Explanation and coding tasks for constructing shared and individual observations used by agents to make decisions.
Action Selection
How to convert actor network outputs into economically meaningful bid prices.
Reward Function Design
Techniques for shaping agent behavior using profit- and regret-based reward signals. Includes a task to define your own reward logic.
Training Your First Learning Agent
Integration of the previously implemented components into a complete simulation run, demonstrating end-to-end learning behavior in a market setting.
Learning Outcomes
By completing this tutorial, you will be able to:
Implement RL-compatible bidding strategies for storages within the ASSUME framework.
Define observation inputs for learning agents, with an emphasis on the cost of stored energy.
Map actor outputs to valid market actions and manage exploration.
Construct reward functions that combine economic incentives for charging and discharging.
Train and evaluate a basic RL agent in a multi-agent electricity market simulation.
1. Get ASSUME Running#
This chapter walks you through setting up the ASSUME framework in your environment and preparing the required input files. At the end, you will confirm that the installation was successful and ready for use.
1.1 Installation#
In Google Colab#
Google Colab already includes most scientific computing libraries (e.g., numpy, torch). You only need to install the ASSUME core framework:
[ ]:
# Only run this cell if you are using Google Colab
import importlib.util
IN_COLAB = importlib.util.find_spec("google.colab") is not None
if IN_COLAB:
!pip install assume-framework
Note: After installation, Colab may prompt you to restart the session due to dependency changes. To do so, click “Runtime” → “Restart session…” in the menu bar, then re-run the cells above.
On Your Local Machine#
To install ASSUME with all learning-related dependencies, run the following in your terminal:
pip install 'assume-framework[learning]'
This will install the simulation framework and the packages required for RL.
1.2 Repository Setup#
To access predefined simulation scenarios, clone the ASSUME repository (Colab only):
[ ]:
# Only run this cell if you are using Google Colab
if IN_COLAB:
!git clone --depth=1 https://github.com/assume-framework/assume.git assume-repo
Local users may skip this step if input files are already available in the project directory.
1.3 Input Path Configuration#
We define the path to input files depending on whether you’re in Colab or working locally. This variable will be used to load configuration and scenario files throughout the tutorial.
[ ]:
colab_inputs_path = "assume-repo/examples/inputs"
local_inputs_path = "../inputs"
inputs_path = colab_inputs_path if IN_COLAB else local_inputs_path
1.4 Installation Check#
Use the following cell to ensure the installation was successful and that essential components are available. This test ensures that the simulation engine and RL strategy base class are accessible before continuing.
[ ]:
try:
from assume import World
from assume.strategies.learning_strategies import TorchLearningStrategy
print("✅ ASSUME framework is installed and functional.")
except ImportError as e:
print("❌ Failed to import essential components:", e)
print(
"Please review the installation instructions and ensure all dependencies are installed."
)
1.5 Limitations in Colab#
Colab does not support Docker, so dashboard visualizations included in some ASSUME workflows will not be available. However, simulation runs and RL training can still be fully executed.
In Colab: Training and basic plotting are supported.
In Local environments with Docker: Full access, including dashboards.
1.6 Core Imports#
In this section, we import the core modules that will be used throughout the tutorial. Each import is explained to clarify its role.
[ ]:
# Standard Python modules
import logging # For logging messages during simulation and debugging
import os # For operating system interactions
from datetime import timedelta # To handle market time resolutions (e.g., hourly steps)
import matplotlib.pyplot as plt
# Scientific and data processing libraries
import numpy as np # Numerical operations and array handling
import pandas as pd # Data manipulation and analysis
import yaml # Parsing YAML configuration files
# Database and visualization libraries
from sqlalchemy import create_engine
# ASSUME framework components
from assume import World # Core simulation container that manages markets and agents
from assume.scenario.loader_csv import ( # Functions to load and execute scenarios
load_scenario_folder,
run_learning,
)
from assume.strategies.learning_strategies import (
MinMaxChargeStrategy, # Abstract class for storage-like strategies
TorchLearningStrategy, # Abstract base for RL bidding strategies
)
These imports are used for:
Defining RL bidding strategies.
Managing input/output data.
Executing and analyzing simulations.
At this point, you are ready to begin building your RL bidding agent. In the next chapter, we will define how agents perceive the market by constructing their observation vectors.
2. ASSUME & Learning Basics#
2.1 The ASSUME Framework#
ASSUME is a simulation framework designed for researchers, utilities, and planners to model and understand market dynamics in electricity systems. It allows for agent-based modeling of market participants in a modular and configurable environment.
The core structure of the framework consists of:
Markets (on the left of the architecture diagram): Where electricity products are traded.
Market Participants / Units (on the right): Each agent represents a physical or virtual unit bidding into the market.
Orders: The main communication channel between units and markets.
Learning Agents: Highlighted in yellow in the architecture, these are agents using RL strategies.
The image below illustrates the high-level architecture of ASSUME. Focus on the yellow components—these are the parts involved in the learning process.
[ ]:
from pathlib import Path
from IPython.display import SVG, display
image_path = Path("assume-repo/docs/source/img/architecture.svg")
alt_image_path = Path("../../docs/source/img/architecture.svg")
if image_path.exists():
display(SVG(image_path))
elif alt_image_path.exists():
display(SVG(alt_image_path))
2.2 Introduction to Learning in ASSUME#
The current implementation of RL in ASSUME models electricity markets as partially observable Markov games, allowing multiple agents to operate under individual observations and reward structures.
If you are unfamiliar with RL, refer to the following links for background material:
Central Concepts:
Policy: The strategy used by an agent to select actions based on observations.
Actor-Critic Architecture: A method where the “actor” chooses actions and the “critic” evaluates them.
Learning Strategy: Defines how a unit transforms observations into bids using a trainable model.
Step Functions: The typical RL cycle of Observe → Act → Reward → Update is split across several methods in ASSUME, as described in Section 3.
2.3 Single-Agent RL#
In a single-agent setup, the agent attempts to maximize its reward over time by learning from interaction with the environment. It does so by making multiple steps in the environment. In RL, each interaction step includes:
Observation of the current state.
Action selection based on policy.
Reward from the environment.
Policy Update to improve behavior.
In ASSUME, this step cycle is modularized:
RL Step |
Implemented via |
Description |
|---|---|---|
Step 1 |
|
Constructs the observation vector. |
Step 2 |
|
Maps observations to bid prices. |
Step 3 |
|
Computes the reward signal. |
Step 4 |
Handled by the learning role |
Updates model and manages the replay buffer. |
Actor-Critic Structure: To increase learning stability actor-critic methods are commonly used. They divide the tasks in the following way:
Actor: Learns a deterministic policy for choosing actions. Uses policy gradient methods to maximize expected reward.
Critic: Learns a value function using Temporal Difference (TD) learning. Provides feedback to the actor based on action quality.
2.4 Multi-Agent RL#
Real-world electricity markets involve multiple agents acting simultaneously, which introduces interdependencies and non-stationarity. As a result, multi-agent learning requires additional considerations.
Challenges:
Actions by one agent influence the environment experienced by others.
The state transitions and rewards become non-stationary.
Solution: Centralized Training with Decentralized Execution (CTDE)
To address these challenges, ASSUME employs the Multi-Agent Twin Delayed Deep Deterministic Policy Gradient (MATD3) framework with CTDE:
Centralized Training: A critic with access to all agents’ states and actions is used during training to stabilize learning. Note the critic is only there to update the actor network, so it is only necessary while training.
Decentralized Execution: During simulation, the actual actor of each agent relies only on its own observations and learned policy.
Each agent trains two critic networks to mitigate overestimation bias, uses target noise for robustness, and relies on deterministic policy gradients for the actor update.
3. Defining the Observation Space for Storages#
In this chapter, you will define what information your RL agent perceives about the environment and itself at each decision point. This is a critical component of the agent’s behavior, as the observation vector forms the basis for all future actions and learning.
RL Step |
Implemented via |
Description |
|---|---|---|
Step 1 |
|
Constructs the observation vector. |
3.2 Observation Structure in ASSUME#
Observations are composed of two parts:
1. Global Observations
These are shared across all agents and constructed by the base class method create_observation(). They include:
Forecasted residual load over the foresight horizon.
Forecasted market price over the foresight horizon.
Historical market price over a specified window.
These are normalized by maximum demand and maximum bid price for stability. These values are generated by a forecasting role and made available to all agents before each market cycle.
For this tutorial you do not need to modify this part. However, if you want to equip new units types with learning or expand the simulation by new concepts, additional global information might be needed.
2. Individual Observations
These are unit-specific and must be implemented by you. The purpose is to provide the agent with private, operational information that may help improve bidding decisions. Each agent appends this information to the end of the shared observation vector.
This is done via the method get_individual_observations(unit, start).
3.3 Exercise 1: Choose a Suitable Foresight for Storage Agents#
To enable learning for storage units, we define a custom strategy class that extends TorchLearningStrategy. This class specifies key dimensions such as the size of the observation and action spaces. One crucial parameter you need to define is the foresight—how many future time steps the agent considers when making decisions.
Unlike dispatchable power plants, storage units face temporally coupled decisions: they must charge at one point in time and discharge at another, often hours later. This delay between cost and profit means that storage agents require a longer foresight than units that act on short-term signals.
Define the foresight of your agent by updating the self.foresight attribute inside the constructor:
Hint: Typical power plants operate well with a foresight of 12 hours.
[ ]:
class StorageEnergyLearningStrategy(TorchLearningStrategy, MinMaxChargeStrategy):
"""
A simple reinforcement learning bidding strategy.
"""
def __init__(self, *args, **kwargs):
# Forecast horizon (in timesteps) used for market and residual load forecasts
foresight = None # Your Choice here
act_dim = kwargs.pop("act_dim", 1) # One action: bid price
unique_obs_dim = kwargs.pop("unique_obs_dim", 2) # Number of individual obs
# all further calculations are handled in the parent classes
# like the general observation calculation based on the foresight
super().__init__(
foresight=foresight,
act_dim=act_dim,
unique_obs_dim=unique_obs_dim,
*args,
**kwargs,
)
Solution Excercise 1
[ ]:
# @title Solution Exercise 1
class StorageEnergyLearningStrategy(TorchLearningStrategy, MinMaxChargeStrategy):
"""
A simple reinforcement learning bidding strategy.
"""
def __init__(self, *args, **kwargs):
# Forecast horizon (in timesteps) used for market and residual load forecasts
foresight = 24 # Your implementation here
act_dim = kwargs.pop("act_dim", 1) # One action: bid price
unique_obs_dim = kwargs.pop("unique_obs_dim", 2) # Number of individual obs
super().__init__(
foresight=foresight,
act_dim=act_dim,
unique_obs_dim=unique_obs_dim,
*args,
**kwargs,
)
For storages, we recommend a foresight of 24 hours, which aligns with standard industry practice and allows for daily charge/discharge cycles. Note that longer foresight increases the size of the observation space, as each forecasted time series (e.g., price, residual load) is extended accordingly. If you’re designing seasonal storage agents (e.g., hydrogen or pumped hydro), you may consider even longer horizons—but beware the combinatorial explosion of the input space.
With this foresight range the global observations are defined in the function create_observationof the base class. Based on the chosen foresight the observation_space dimension is calculated automatically following self.obs_dim = num_timeseries_obs_dim * foresight + unique_obs_dim as defined in the base class. If one wants to change that rational it needs to be overwritten in the learning_strategy itself. We focus here on the individual observations in the next chapter.
3.4 Exercise 2: Define Individual Observations#
The storage agent receives the standard set of global observations, including price and load forecasts over a 24-hour foresight window. However, two individual features are added to reflect its internal state:
State of Charge (SoC): How full the battery is, scaled between 0 and 1.Energy Cost
These individual features are returned by the method:
[ ]:
# we define the class again and inherit from the initial class just to add the additional method to the original class
# this is a workaround to have different methods of the class in different cells
# which is good for the purpose of this tutorial
# however, you should have all functions in a single class when using this example in .py files
class StorageEnergyLearningStrategy(StorageEnergyLearningStrategy):
def get_individual_observations(self, unit, start, end):
"""
Define custom unit-specific observations for the RL agent.
Parameters
----------
unit : SupportsMinMax
The unit representing the power plant.
start : datetime.datetime
Start time of the market product.
Returns
-------
np.ndarray
Normalized 1D array of individual observations.
"""
# get the current soc and energy cost value
soc = unit.outputs["soc"].at[start]
energy_cost_scaled = unit.outputs["energy_cost"].at[start] / self.max_bid_price
individual_observations = np.array([soc, energy_cost_scaled])
return individual_observations
This method must return a NumPy array of length unique_obs_dim.
When a storage unit charges or discharges, the cost of its stored energy must be updated to reflect the new energy mix. This cost is defined as the volume-weighted average procurement cost and plays a key role when deciding whether selling is profitable.
The update depends on the type of action:
When charging, the cost of stored energy is updated.
If discharging or inactive, the cost remains unchanged.
The energy cost update depends on:
accepted_volume: How much energy was bought (negative) or sold (positive).accepted_priceandmarginal_cost: Cost components when buying.duration_hours: How long the bid covers.current_socandnext_soc: Storage level before and after the bid.
Implement the following logic in the update_energy_cost funtion:
```python if next_soc < 1: cost_next = 0 elif accepted_volume < 0: # Charging cost_next = (old_cost * current_soc - (price + marginal_cost) * volume * duration) / next_soc elif accepted_volume > 0: # Discharging cost_next = (old_cost * (current_soc - volume * duration)) / next_soc else: # No accepted action cost_next = old_cost
[ ]:
# we define the class again and inherit from the initial class just to add the additional method to the original class
# this is a workaround to have different methods of the class in different cells
# which is good for the purpose of this tutorial
# however, you should have all functions in a single class when using this example in .py files
class StorageEnergyLearningStrategy(StorageEnergyLearningStrategy):
def update_cost_stored_energy(
self,
unit,
start,
next_time,
current_soc,
next_soc,
accepted_volume,
accepted_price,
marginal_cost,
duration_hours,
max_bid_price,
):
"""
Updates the cost of stored energy based on accepted market actions.
"""
# TODO: Replace this with your own logic
if next_soc < 1:
unit.outputs["cost_stored_energy"].at[next_time] = (
None # Your implementation here
)
elif accepted_volume < 0: # Charging
cost = None # Your implementation here
else: # No action
unit.outputs["cost_stored_energy"].at[next_time] = (
None # Your implementation here
)
unit.outputs["cost_stored_energy"].at[next_time] = np.clip(
cost, -max_bid_price, max_bid_price
)
Why do we clip the energy cost?In rare cases, especially during initial learning or under extreme prices, the calculated energy cost can get very high.Clipping the value ensures numerical stability for the observation space and keeps the input to the neural network within a realistic and learnable range (between-max_bid_priceand+max_bid_price), which is also the bound we chose for scaling.
Solution Excercise 2
[ ]:
# @title Solution Excercise 2
# we define the class again and inherit from the initial class just to add the additional method to the original class
# this is a workaround to have different methods of the class in different cells
# which is good for the purpose of this tutorial
# however, you should have all functions in a single class when using this example in .py files
class StorageEnergyLearningStrategy(StorageEnergyLearningStrategy):
def update_cost_stored_energy(
self,
unit,
start,
next_time,
current_soc,
next_soc,
accepted_volume,
accepted_price,
marginal_cost,
duration_hours,
max_bid_price,
):
"""
Updates the cost of stored energy based on accepted market actions.
"""
# Calculate and clip the energy cost for the start time
# cost_stored_energy = average volume weighted procurement costs of the currently stored energy
if next_soc < 1:
unit.outputs["cost_stored_energy"].at[next_time] = 0
elif accepted_volume < 0:
# increase costs of current SoC by price for buying energy
unit.outputs["cost_stored_energy"].at[next_time] = (
unit.outputs["cost_stored_energy"].at[start] * current_soc
- (accepted_price + marginal_cost) * accepted_volume * duration_hours
) / next_soc
else:
unit.outputs["cost_stored_energy"].at[next_time] = unit.outputs[
"cost_stored_energy"
].at[start]
unit.outputs["cost_stored_energy"].at[next_time] = np.clip(
unit.outputs["cost_stored_energy"].at[next_time],
-max_bid_price,
max_bid_price,
)
Note for advanced users:The environment for storage units is not fully Markovian. Future rewards depend on past actions — particularly the prices at which energy was charged.To mitigate this partial observability, we augment the observation space with the average cost of stored energy. This acts as a memory proxy, helping the agent assess whether selling at a given price is profitable.This approach is a form of state augmentation, commonly used in reinforcement learning to approximate Markovian behavior in partially observable environments (POMDPs).
3.5 Summary#
Observations in ASSUME combine shared global forecasts and custom individual data.
The base class handles forecasted residual load and price, as well as historical price signals.
For storage units, individual observations include the state of charge and the cost of stored energy, which reflects past purchase prices and is updated over time.
You implemented the logic for updating this cost after market actions—this is crucial for enabling the agent to assess profitability when selling energy.
These observations directly affect agent behavior and learning convergence—thoughtful design matters.
In the next chapter, you will define how the agent selects actions based on its observations, and how exploration is introduced during initial training to populate the learning buffer.
4. Action Selection and Exploration#
Once an observation is formed, the next step is for the agent to decide how to act. In this context, the action determines the bid consisting of a price-volume pair that the agent submits to the electricity market.
This chapter focuses on how actions are derived from the agent’s policy and how exploration is handled—especially during the early training phase when experience is sparse.
4.1 Action Selection in RL#
In RL, the policy defines the agent’s behavior: it maps observations to actions. In the actor-critic architecture used by ASSUME, this policy is represented by the actor neural network.
However, to enable exploration, especially in the early stages of training, agents must not always follow the policy exactly. They need to try out a variety of actions—even suboptimal ones—to collect diverse experiences and learn effectively.
This is done by adding noise to the actions suggested by the policy network.
Note: The implementation of noise we present here is specific to the used algorithm MADDPG. Other Algorithms such as the PPO will use a different mechanism for exploration.
4.2 Understanding get_actions()#
The method get_actions(next_observation) in BaseLearningStrategy defines how actions are computed in different modes of operation.
Here is a simplified overview of the logic:
def get_actions(self, next_observation):
if self.learning_mode and not self.evaluation_mode:
if self.collect_initial_experience_mode:
# Initial exploration: use pure noise as action
noise = self.action_noise.noise(...)
curr_action = noise
else:
# Regular exploration: add noise to policy output
curr_action = self.actor(next_observation).detach()
noise = self.action_noise.noise(...)
curr_action += noise
else:
# Evaluation or deterministic policy use
curr_action = self.actor(next_observation).detach()
noise = zeros_like(curr_action)
return curr_action, noise
Modes of Operation:
learning_mode: Indicates that the agent is being trained.evaluation_mode: Disables noise; used to assess performance of a learned policy.collect_initial_experience_mode: Special sub-phase during early episodes of training where we rely heavily on randomized exploration to populate the replay buffer with diverse samples.
The output of get_actions is then transformed into the actual bids by the calculate_bidsfunction which we will look at in the next chapter.
5. From Observation to Action to Bids#
In the previous chapters, we explored how an agent perceives its environment through observations and how it selects actions using its policy, optionally enriched with exploration noise. In this short chapter, we show how these two steps come together inside the calculate_bids() method.
There is no task in this chapter we just walk you through the function.
5.1 The Role of calculate_bids()#
The calculate_bids() method defines how a market participant formulates its bid at each market interval. It brings together two crucial operations:
Generating Observations: Calls
create_observation()to construct the full input vector (including both global and individual components).Choosing an Action: Passes the observation to
get_actions(), which invokes the actor network (and optionally adds noise) to return an action vector.
This forms the agent’s internal decision pipeline.
5.2 Action Normalization and Scaling#
The actor network produces a single continuous output between (-1) and (+1), which we interpret in two ways:
The sign of the output determines the bid direction:
action < 0: The agent wants to buy energy (i.e. charge the battery).action ≥ 0: The agent wants to sell energy (i.e. discharge the battery).
The magnitude (absolute value) of the output determines the bid price, scaled by the agent’s
max_bid_price:$ \text{bid\_price} = |:nbsphinx-math:text{action}| \times `:nbsphinx-math:text{max_bid_price}` $
-0.7, this translates to a buy bid with price 0.7 × max_bid_price.+0.4, this becomes a sell bid at 0.4 × max_bid_price.By modifying max_bid_price in the learning config, you directly influence the economic “aggressiveness” of the policy.
5.3 Bid Structure#
Each bid submitted to the market follows a defined structure, encapsulated as a dictionary:
{
"start_time": start,
"end_time": end,
"price": bid_price,
"volume": max_power,
"node": unit.node,
}
Key aspects:
price: Determined from the scaled output of the policy.
volume: Set to the full technical charging of discharging power of the sotrage considering the current State of Charge.
node: Locational identifier (used for zonal/nodal pricing and congestion modeling).
Note that max_power is positive, as this strategy models a generator offering energy. For a consumer or demand bid, the volume would be negative to reflect load withdrawal.
5.4 Controlling Action Dimensions#
By changing the act_dim in the strategy constructor, you can control the number of outputs returned by the actor network:
act_dim = kwargs.pop("act_dim", 1)
This allows for richer bidding logic. For instance:
1 action: Bid price for total capacity.
2 actions: Bid prices and bid volume.
3 actions: Add directionality or reserve offers.
However, it is important to note that RL performance deteriorates with high-dimensional action spaces, especially in continuous domains.
If you decide to increase act_dim, ensure that your calculate_bids() method is updated accordingly to interpret and transform all action elements correctly.
5.5 Full Code Implementation#
Here is the complete calculate_bids() implementation:
[ ]:
# we define the class again and inherit from the initial class just to add the additional method to the original class
# this is a workaround to have different methods of the class in different cells
# which is good for the purpose of this tutorial
# however, you should have all functions in a single class when using this example in .py files
class StorageEnergyLearningStrategy(StorageEnergyLearningStrategy):
def calculate_bids(
self,
unit,
market_config,
product_tuples,
**kwargs,
):
"""
Generates market bids based on the unit's current state and observations.
Args
----
unit : SupportsMinMaxCharge
The storage unit with information on charging/discharging capacity.
market_config : MarketConfig
Configuration of the energy market.
product_tuples : list[Product]
List of market products to bid on, each containing start and end times.
**kwargs : Additional keyword arguments.
Returns
-------
Orderbook
Structured bids including price, volume, and bid direction.
Notes
-----
Observations are used to calculate bid actions, which are then scaled and processed
into bids for submission in the market.
"""
start = product_tuples[0][0]
end_all = product_tuples[-1][1]
next_observation = self.create_observation(
unit=unit,
market_id=market_config.market_id,
start=start,
end=end_all,
)
# =============================================================================
# Get the Actions, based on the observations
# =============================================================================
actions, noise = self.get_actions(next_observation)
# =============================================================================
# 3. Transform Actions into bids
# =============================================================================
# the absolute value of the action determines the bid price
bid_price = abs(actions[0]) * self.max_bid_price
# the sign of the action determines the bid direction
if actions[0] < 0:
bid_direction = "buy"
elif actions[0] >= 0:
bid_direction = "sell"
_, max_discharge = unit.calculate_min_max_discharge(start, end_all)
_, max_charge = unit.calculate_min_max_charge(start, end_all)
bid_quantity_supply = max_discharge[0]
bid_quantity_demand = max_charge[0]
bids = []
if bid_direction == "sell":
bids.append(
{
"start_time": start,
"end_time": end_all,
"only_hours": None,
"price": bid_price,
"volume": bid_quantity_supply,
"node": unit.node,
}
)
elif bid_direction == "buy":
bids.append(
{
"start_time": start,
"end_time": end_all,
"only_hours": None,
"price": bid_price,
"volume": bid_quantity_demand, # negative value for demand
"node": unit.node,
}
)
if self.learning_mode:
self.learning_role.add_actions_to_cache(self.unit_id, start, actions, noise)
return bids
In the next chapter, we will define how to compute the reward associated with each bid outcome, which completes the agent’s learning cycle.
6. Reward Function Design#
The reward function is the central learning signal in any RL environment. It defines the objective the agent is trying to maximize and serves as the only feedback mechanism from the environment to the agent.
Designing the reward function is a delicate balance between:
Capturing realistic economic goals (e.g., profit maximization),
Enabling learning stability and convergence, and
Leaving room for the agent to discover unexpected, valid strategies.
It’s tempting to hard-code your preferred behavior into the reward function. However, this often leads to agents that are overly adapted to a specific scenario and perform poorly in general.
6.1 When Is the Reward Computed?#
In ASSUME, the reward is computed after the market clears, in the calculate_reward() method. At this point, the agent receives information about:
Which portion of its bid was accepted,
at what price,
and what operational costs it incurred, if any.
This allows us to calculate realized profit, which is the most direct economic reward signal.
6.2 RL Theory for Temporally Distributed Reward#
In RL, the agent’s goal is to maximise the expected sum of discounted future rewards:
The discount factor \(\gamma\) (typically between 0.95 and 0.99) controls how much future rewards are valued compared to immediate ones.
For storage units, this matters a lot: charging leads to short-term losses, while discharging later yields profits. The agent must therefore learn to delay gratification and value future gains.
Choosing a high discount factor (e.g. \(\gamma = 0.999\)) is essential so the agent connects today’s cost with tomorrow’s profit.
If \(\gamma\) is too low, the agent may avoid charging altogether, failing to discover arbitrage opportunities.
6.3 Exercise 3: Implement Profit-Based Reward for Storage Units#
Your next task is to implement a profit-based reward for a storage unit.
Unlike generators, storage units may buy energy (at a cost) and sell it later (for a profit). So, their reward must reflect charging costs and discharging revenue, depending on the action taken.
Use the following formula:
Where:
\(P^\text{conf}\): Confirmed volume (positive for discharge / sell, negative for charge / buy),
\(M_t\): Accepted market price,
\(mc_{i,t}\): Marginal cost of charging or discharging, if any,
\(dt\): Duration in hours.
You can access the required quantities using:
```python accepted_volume = order[“accepted_volume”] accepted_price = order[“accepted_price”] duration = (end - start) / timedelta(hours=1)
marginal_cost = unit.calculate_marginal_cost( start, unit.outputs[marketconfig.product_type].at[start] ) marginal_cost += unit.get_starting_costs(int(duration))
[ ]:
# we define the class again and inherit from the initial class just to add the additional method to the original class
# this is a workaround to have different methods of the class in different cells
# which is good for the purpose of this tutorial
# however, you should have all functions in a single class when using this example in .py files
class StorageEnergyLearningStrategy(StorageEnergyLearningStrategy):
def calculate_reward(self, unit, marketconfig, orderbook):
"""
Reward function: implement profit-based reward for storage agents.
"""
order = orderbook[0]
start = order["start_time"]
end = order["end_time"]
end_excl = end - unit.index.freq
next_time = start + unit.index.freq
duration = (end - start) / timedelta(hours=1)
# === Access values ===
accepted_price = None # YOUR CODE
accepted_volume = None # YOUR CODE
marginal_cost = unit.calculate_marginal_cost(
start, unit.outputs[marketconfig.product_type].at[start]
)
marginal_cost += unit.get_starting_costs(int(duration))
# === Compute profit ===
order_profit = None # YOUR CODE
order_cost = None # YOUR CODE
profit = None # YOUR CODE
# === Scale reward ===
scaling = 1 / (self.max_bid_price * unit.max_power_discharge)
reward = scaling * profit
# === Update stored energy cost ===
self.update_cost_stored_energy(
unit=unit,
start=start,
next_time=next_time,
current_soc=unit.outputs["soc"].at[start],
next_soc=unit.outputs["soc"].at[next_time],
accepted_volume=accepted_volume,
accepted_price=accepted_price,
marginal_cost=marginal_cost,
duration_hours=duration,
max_bid_price=self.max_bid_price,
)
# === Store results ===
# Note: these are not learning-specific results but stored for all units for analysis
unit.outputs["profit"].loc[start:end_excl] += profit
unit.outputs["total_costs"].loc[start:end_excl] += order_cost
# write rl-rewards to buffer
if self.learning_mode:
self.learning_role.add_reward_to_cache(unit.id, start, reward, 0, profit)
6.4 Why Just the Profit as Feedback?#
In contrast to other agent types, storage units use only realised profit as a reward signal — without including opportunity cost or regret terms.
For storage, defining missed opportunities is difficult:
Profit depends on temporal strategies (charge now, discharge later).
Simple heuristics often provide misleading incentives.
Unlike generators, there’s no clear rule like “produce if price > cost” that works somewhat reliably.
In theory, we could compute opportunity costs by comparing the agent’s profit to a hindsight optimal schedule.
But this is:
Computationally infeasible at every step,
Non-scalable in multi-agent settings.
6.6 Reward Scaling and Learning Stability#
Scaling the reward to a narrow and consistent range is crucial for stable RL. This is particularly important in continuous-action settings like bidding, where one overly large reward spike can skew the policy updates significantly.
1. Why scale?
Stabilizes gradients during actor-critic training.
Makes different time steps comparable in magnitude.
Prevents the agent from overfitting to rare but extreme events.
2. What can go wrong?
If your scaling factor is too small:
Rewards become indistinguishable from noise.
If your scaling factor is too large:
A single high-reward event (e.g., bidding into a rare price spike) can dominate learning, making the agent try to reproduce that event rather than learn a general policy.
Tip: Use conservative scaling based on maximum realistic bid × capacity:
scaling = 1 / (self.max_bid_price * unit.max_power_discharge)
3. Recommended Practice
Before committing to training:
Plot the distribution of rewards across time steps for a few sample runs.
Check for outliers, saturation, or skewness.
If needed, adjust
scalingor cap outliers in reward postprocessing.
This diagnostic step can save hours of failed training runs.
Solution Exercise 3
[ ]:
# @title Solution Exercise 3: Implement Reward Function
# we define the class again and inherit from the initial class just to add the additional method to the original class
# this is a workaround to have different methods of the class in different cells
# which is good for the purpose of this tutorial
# however, you should have all functions in a single class when using this example in .py files
class StorageEnergyLearningStrategy(StorageEnergyLearningStrategy):
def calculate_reward(self, unit, marketconfig, orderbook):
"""
Reward function: implement profit-based reward for storage agents.
"""
order = orderbook[0]
start = order["start_time"]
end = order["end_time"]
end_excl = end - unit.index.freq
next_time = start + unit.index.freq
duration = (end - start) / timedelta(hours=1)
# === Access values ===
accepted_price = order["accepted_price"]
accepted_volume = order["accepted_volume"]
marginal_cost = unit.calculate_marginal_cost(
start, unit.outputs[marketconfig.product_type].at[start]
)
marginal_cost += unit.get_starting_costs(int(duration))
# === Compute profit ===
order_profit = accepted_price * accepted_volume * duration
order_cost = abs(marginal_cost * accepted_volume * duration)
profit = order_profit - order_cost
# === Scale reward ===
scaling = 1 / (self.max_bid_price * unit.max_power_discharge)
reward = scaling * profit
# === Update stored energy cost ===
self.update_cost_stored_energy(
unit=unit,
start=start,
next_time=next_time,
current_soc=unit.outputs["soc"].at[start],
next_soc=unit.outputs["soc"].at[next_time],
accepted_volume=accepted_volume,
accepted_price=accepted_price,
marginal_cost=marginal_cost,
duration_hours=duration,
max_bid_price=self.max_bid_price,
)
# === Store results ===
# Note: these are not learning-specific results but stored for all units for analysis
unit.outputs["profit"].loc[start:end_excl] += profit
unit.outputs["total_costs"].loc[start:end_excl] += order_cost
# write rl-rewards to buffer
if self.learning_mode:
self.learning_role.add_reward_to_cache(unit.id, start, reward, 0, profit)
6.7 Summary#
The reward function is the core signal guiding agent learning—design it carefully.
For storage units, use realised profit as the primary reward.
Avoid opportunity cost terms unless you can compute them reliably—storage bidding is temporally coupled and hard to benchmark heuristically.
Always normalize your reward to maintain training stability.
Analyze your reward distribution empirically before training large-scale agents.
In the next chapter, we will bring together all the components—observation, action, and reward—and simulate a full training run using your custom learning strategy.
7. Training and Evaluating Your First Learning Agent#
You have now implemented all essential components of a learning bidding strategy in ASSUME:
Observations
Actions and exploration
Reward function
In this chapter, you will connect your strategy to a simulation scenario, configure the learning algorithm, and evaluate the agent’s training progress.
7.1 Load and Inspect the Learning Configuration#
Each simulation scenario in ASSUME has an associated YAML configuration file. This file contains the learning configuration, which determines how the RL algorithm is executed.
[ ]:
scenario = "base"
# Read the YAML file
with open(f"{inputs_path}/example_02e/config.yaml") as file:
config = yaml.safe_load(file)
# Print the learning config
print(f"Learning config for scenario '{scenario}':")
display(config[scenario]["learning_config"])
Explanation of Learning Configuration Parameters
Parameter |
Description |
|---|---|
learning_mode |
If |
continue_learning |
If |
trained_policies_save_path |
File path where trained policies will be saved. |
trained_policies_load_path |
Path to pre-trained policies to load. |
max_bid_price |
Used to scale action outputs to economic bid prices. |
algorithm |
Learning algorithm used (e.g., |
learning_rate |
Step size for policy and critic updates. |
training_episodes |
Number of simulation episodes (repetitions of the time horizon) used for training. |
episodes_collecting_initial_experience |
Number of episodes during which agents collect experience using guided exploration. |
train_freq |
Time between training updates, e.g., |
gradient_steps |
Number of gradient descent steps per update. |
batch_size |
Size of experience batch used for training. |
gamma |
Discount factor for future rewards (\(0 < \gamma \leq 1\)). |
device |
|
action_noise_schedule |
How the action noise evolves over time ( |
noise_sigma |
Standard deviation of exploration noise. |
noise_scale |
Global multiplier for noise. |
noise_dt |
Discretization interval for noise time series. |
validation_episodes_interval |
How often (in episodes) to evaluate the current policy without exploration. |
7.2 Run the Simulation and Train the two Agents#
The simulation environment and learning strategy are connected and executed as follows:
Hint: In Google Colab, long-running training sessions may occasionally crash or disconnect if the output console is flooded — for example, by verbose progress bars or print statements.To prevent this, you can suppress output during training using the following approach.
Import the required tools:
from contextlib import redirect_stdout, redirect_stderr import os
- Wrap the training phase with output redirection.Insert the following lines just before Step 4: Run the training phase:
# Suppress output for the entire training process with open(os.devnull, 'w') as devnull: with redirect_stdout(devnull), redirect_stderr(devnull): # Your training function call goes here train_agents(...)
✅ This redirects all
stdoutandstderrto/dev/null, preventing Colab from being overwhelmed by output and improving session stability.
[ ]:
log = logging.getLogger(__name__)
csv_path = "outputs"
os.makedirs("local_db", exist_ok=True)
np.random.seed(42) # Set a random seed for reproducibility
if __name__ == "__main__":
db_uri = "sqlite:///../local_db/assume_db.db"
scenario = "example_02e"
study_case = "base"
# 1. Create simulation world
world = World(database_uri=db_uri, export_csv_path=csv_path)
# 2. Register your learning strategy
world.bidding_strategies["storage_energy_learning"] = StorageEnergyLearningStrategy
# 3. Load scenario and case
load_scenario_folder(
world,
inputs_path=inputs_path,
scenario=scenario,
study_case=study_case,
)
# 4. Run the training phase
if world.learning_mode:
run_learning(world)
# 5. Execute final evaluation run (no exploration)
world.run()
This script will:
Train the agent using your defined strategy.
Periodically evaluate the agent using a noise-free policy.
Save training data into the database for post-analysis.
7.3 Analyze Learning Performance#
Once training is complete, we can evaluate the learning progress of your RL agent using data from the simulation database. ASSUME stores detailed training metrics in the rl_params table, which includes rewards for each time step, grouped by episode, unit, and whether the agent was in evaluation mode.
In this case, we are interested in the performance of a specific storage: ``Storage 1``, within the simulation ``example_02e_base``.
We’ll extract the recorded rewards for this unit, group them by episode, and plot the average reward over time for both training and evaluation phases.
Instead of accessing the training evaluation via the database we also feature a tensorboard integration, which can be accessed in the console
tensorboard --logdir tensorboard
[ ]:
# Connect to the simulation database
engine = create_engine("sqlite:///../local_db/assume_db.db")
# Query rewards for specific simulation and unit
sql = """
SELECT
datetime,
unit,
reward,
simulation,
evaluation_mode,
episode
FROM rl_params
WHERE simulation = 'example_02e_base'
ORDER BY datetime
"""
# Load query results
rewards_df = pd.read_sql(sql, engine)
# Rename column for consistency
rewards_df.rename(columns={"evaluation_mode": "evaluation"}, inplace=True)
# --- Separate plots for training and evaluation ---
fig, axes = plt.subplots(2, 1, figsize=(12, 10), sharex=False)
# Plot training rewards (evaluation == 0)
train_df = rewards_df[rewards_df["evaluation"] == 0]
train_grouped = train_df.groupby("episode")["reward"].mean()
axes[0].plot(train_grouped.index, train_grouped.values, "s-", color="tab:blue")
axes[0].set_title("Training Reward per Episode")
axes[0].set_ylabel("Average Reward")
axes[0].grid(True)
# Plot evaluation rewards (evaluation == 1)
eval_df = rewards_df[rewards_df["evaluation"] == 1]
eval_grouped = eval_df.groupby("episode")["reward"].mean()
axes[1].plot(eval_grouped.index, eval_grouped.values, "s-", color="tab:green")
axes[1].set_title("Evaluation Reward per Episode")
axes[1].set_xlabel("Episode")
axes[1].set_ylabel("Average Reward")
axes[1].grid(True)
plt.tight_layout()
plt.show()
What This Shows
Training curve: Captures learning progress with exploration noise.
Evaluation curve: Tracks the performance of the evaluation/validation run without noise, which is performed every
validation_episodes_intervalsteps, as defined in thelearning_config.
7.4 Exercise 4: Get a feeling for the Reward#
Why is the reward so small? To better understand why the observed reward is so small, revisit the definition of the reward function. Pay particular attention to the aggregation function used in the plotting cell above. What exactly is being plotted?
Hint: Think about whether the plotted value reflects individual time steps or aggregated performance, and how inactive hours might affect the result.
What happens to the individual units? The plot shows the average reward across both learning storage units. What would you expect the reward curve of a single storage unit to look like? Will it follow a similar trend, or could there be differences? Consider what impact one unit being more active or successful than the other might have on the average.
Take a moment to think through this before checking the explanation below.
Solution Excerise 4 The reward is based on the scaled profit of each storage unit:
scaling = 1 / (self.max_bid_price * unit.max_power_discharge)
reward = scaling * profit
This scaling ensures that the reward stays within a small, consistent numerical range. Additionally, what is plotted is the average reward per time step across an entire episode — typically defined as one month, i.e., all hours within that month.
In many of these hours, the agent does not act (i.e., the profit is zero), which pulls the average down. If you change the aggregation function from mean to sum, you’ll notice that the overall reward becomes larger, and you’ll see how the zero-reward hours affect the average.
By understanding this, you can better interpret agent performance and evaluate whether learning is actually occurring — even if the average reward appears small.
Let’s look at individual units
[ ]:
# Connect to the simulation database
engine = create_engine("sqlite:///../local_db/assume_db.db")
# Query rewards for specific simulation and unit
sql = """
SELECT
datetime,
unit,
reward,
simulation,
evaluation_mode,
episode
FROM rl_params
WHERE simulation = 'example_02e_base'
ORDER BY datetime
"""
# Load query results
rewards_df = pd.read_sql(sql, engine)
# Rename column for consistency
rewards_df.rename(columns={"evaluation_mode": "evaluation"}, inplace=True)
# Get unique units
units = rewards_df["unit"].unique()
# --- Separate plots for training and evaluation ---
fig, axes = plt.subplots(2, 1, figsize=(12, 10), sharex=False)
# Plot training rewards (evaluation == 0) - one line per unit
train_df = rewards_df[rewards_df["evaluation"] == 0]
for unit in units:
unit_data = train_df[train_df["unit"] == unit]
train_grouped = unit_data.groupby("episode")["reward"].mean()
axes[0].plot(
train_grouped.index, train_grouped.values, "o-", label=f"{unit}", alpha=0.7
)
axes[0].set_title("Training Reward per Episode (by Unit)")
axes[0].set_ylabel("Average Reward")
axes[0].legend()
axes[0].grid(True)
# Plot evaluation rewards (evaluation == 1) - one line per unit
eval_df = rewards_df[rewards_df["evaluation"] == 1]
for unit in units:
unit_data = eval_df[eval_df["unit"] == unit]
eval_grouped = unit_data.groupby("episode")["reward"].mean()
axes[1].plot(
eval_grouped.index, eval_grouped.values, "s-", label=f"{unit}", alpha=0.7
)
axes[1].set_title("Evaluation Reward per Episode (by Unit)")
axes[1].set_xlabel("Episode")
axes[1].set_ylabel("Average Reward")
axes[1].legend()
axes[1].grid(True)
plt.tight_layout()
plt.show()
As seen in the plots above, one storage unit (Storage 2) learns a profitable strategy, while the other (Storage 1) shows little to no improvement over time. This divergence is due to differences in initialisation and learning dynamics.
One key factor is that Storage 2 quickly learns that its actions can influence the market price — it becomes a price setter in certain hours. This feedback between its bidding strategy and the resulting price allows it to understand the reward signal more clearly and improve faster. In contrast, Storage 1 rarely becomes price-setting and thus finds it harder to link its actions to outcomes. Without this feedback loop, learning is significantly slower or even stagnant. Here we can see a slight increase in the evaluation rewrad ar the end, that indicates storage 1 might recover.
To mitigate this, we often use a warm start strategy in practice: agents are initialised with policies that have already learned basic behavioral patterns, such as first charge and then discharge or how to bid in a stationary environment. This helps agents reach the price-setting regime more quickly and facilitates meaningful learning, especially in multi-agent setups.
7.4 Summary#
You have now run your first complete training loop in ASSUME.
The learning configuration defines all key training parameters—review them carefully.
After training, rewards from
rl_paramsallow you to inspect and validate agent behavior.The separation of training and evaluation rewards is key to understanding generalization.
Agent performance may vary significantly due to initialisation and the ability to influence market prices.
To support learning, agents are often warm-started with strategies that already capture basic bidding logic.
In the next chapter, you may proceed to analyze simulation outcomes in greater detail (e.g., market prices, total costs, capacity dispatch), or compare different agent configurations.
8. Analyzing Bidding Behavior#
Now that your agent has completed training, we shift our focus to a critical and more insightful question:
What did the agent actually learn?
This chapter analyzes the actual bids submitted by the agent and evaluates whether the agent developed a reasonable bidding behavior.
8.2. Extract and Plot the Agent’s Bids#
We will extract the bids submitted by Storage 2 from the market_orders table and plot them over time.
[ ]:
# Connect to database
engine = create_engine("sqlite:///../local_db/assume_db.db")
# Query bids from pp_6 in simulation example_02a_base and market EOM
sql = """
SELECT
start_time AS time,
accepted_price,
unit_id,
simulation,
accepted_volume
FROM market_orders
WHERE simulation = 'example_02e_base'
AND unit_id = 'Storage 2'
AND market_id = 'EOM'
ORDER BY start_time
"""
bids_df = pd.read_sql(sql, engine)
bids_df["time"] = pd.to_datetime(bids_df["time"])
buy_bids = bids_df[bids_df["accepted_volume"] < 0].copy()
sell_bids = bids_df[bids_df["accepted_volume"] > 0].copy()
# plot sell and buy bids
plt.figure(figsize=(14, 7))
plt.plot(
sell_bids["time"],
sell_bids["accepted_price"],
"o",
label="Sell Bids",
color="tab:orange",
)
plt.plot(
buy_bids["time"],
buy_bids["accepted_price"],
"o",
label="Buy Bids",
color="tab:blue",
)
plt.title("Storage Bidding Behavior: Charging vs Discharging")
plt.xlabel("Time")
# just plot one day
plt.xlim(bids_df["time"].max() - pd.Timedelta(days=1), bids_df["time"].max())
plt.ylabel("Accepted Market Price (€/MWh)")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
8.3. What Does This Show?#
The plot above shows the accepted market prices for storage bids over time, separated into buy bids (charging) and sell bids (discharging). Each point represents a successful bid that was accepted by the market at a specific time.
Blue dots indicate charging actions (buy bids), where the storage unit purchases electricity at lower prices.
Orange dots represent discharging actions (sell bids), where electricity is sold back to the market at higher prices.
From the visual distribution, we can observe a typical storage behavior:
Charging occurs during low-price hours, typically at night or early morning.
Discharging is concentrated in higher-price hours, typically in the afternoon or evening.
This indicates that the agent has learned a basic arbitrage strategy — charge low, discharge high — which aligns with economic incentives. The spread between buy and sell prices is essential for generating profit and, consequently, positive rewards.
While this gives insight into the agent’s strategy and confidence, it’s important to compare this behavior against a benchmark to judge its effectiveness.
For a single storage unit, a common benchmark is a perfect-foresight optimization, which computes the best possible charging and discharging schedule based on known future prices. This shows how close the RL agent gets to the theoretical optimum.
However, if you have multiple storage agents, their actions can influence market prices. In this case, the environment becomes strategic and interdependent, and a simple optimization no longer reflects the true benchmark.
For such settings, the appropriate comparison is a Mathematical Program with Equilibrium Constraints (MPEC) or other game-theoretic models. These account for the fact that agents anticipate their own market impact, and provide a consistent equilibrium benchmark.
8.4. Summary#
You analyzed the learned bidding strategy of the storage unit.
Investigating how RL design choices impact market modeling results is crutial.
Don’t forget to benchmark your results to other models, e.g. optimization or game-theoretic models.
9. Summary and Outlook#
9.1 What You Built#
Over the course of this tutorial, you developed a complete RL bidding strategy for a storage unit in the ASSUME framework. You constructed and trained a learning agent that can:
Observe market signals and internal state-of-charge dynamics.
Decide when to charge or discharge based on learned economic strategies.
Receive profit-based reward signals and adapt behavior over time.
React to changing market conditions with temporally coordinated actions.
9.2 What You Learned#
Throughout the tutorial, you explored the end-to-end learning pipeline for storage units in a realistic market setting:
How to construct observation spaces that reflect temporal coupling and internal energy cost.
How to define bid direction and price using a compact action space.
Why realized profit is used as the reward signal, and why opportunity cost is avoided for storage.
How to scale rewards and update the cost of stored energy after market interactions.
How to train and evaluate a storage agent’s behavior in multi-agent DRL simulations.
9.3 What You Can Try Next#
Your implementation is modular and extensible. Here are several directions you can explore on your own:
Adjust Learning Parameters
Experiment with:
learning_rate,gamma,noise_sigma,episodes_collecting_initial_experiencevalidation_episodes_interval,train_freq, orgradient_steps
Observe how these changes affect convergence, stability, and bidding behavior.
Try Different Scenarios
Adjust scenario inputs of example
02e:Remove the second storage unit from the
storage_units.csvfile.Add many learning agents, simulating a highly competitive environment.
Compare bidding behavior and reward dynamics across settings.
Dive into other tutorials
If you are interested in the general algorithm behind the MADDPG and how it is integrated into ASSUME look into 04a_RL_algorithm_example
In the small example we could see what the a good bidding behavior of the agent might be and, hence, can judge learning easily, but what if we model many agents in new simulations? We provide explainable RL mechanisms in another tutorial for you to dive into 09_example_Sim_and_xRL