Reinforcement learning algorithms#
Submodules#
assume.reinforcement_learning.buffer module#
- class assume.reinforcement_learning.buffer.ReplayBuffer(buffer_size: int, obs_dim: int, act_dim: int, n_rl_units: int, device: str, float_type)#
Bases:
object- add(obs: ndarray, actions: ndarray, reward: ndarray)#
Adds an observation, action, and reward of all agents to the replay buffer.
- Parameters:
obs (numpy.ndarray) – The observation to add.
actions (numpy.ndarray) – The actions to add.
reward (numpy.ndarray) – The reward to add.
- sample(batch_size: int) ReplayBufferSamples#
Samples a random batch of experiences from the replay buffer.
- size()#
Return the current size of the buffer (i.e. number of transitions stored in the buffer).
- Returns:
The current size of the buffer
- Return type:
buffer_size(int)
- to_torch(array: array, copy=True)#
Converts a numpy array to a PyTorch tensor. Note: It copies the data by default.
- Parameters:
array (numpy.ndarray) – The numpy array to convert.
copy (bool, optional) – Whether to copy the data or not (may be useful to avoid changing things by reference). Defaults to True.
- Returns:
The converted PyTorch tensor.
- Return type:
- class assume.reinforcement_learning.buffer.ReplayBufferSamples(observations, actions, next_observations, rewards)#
Bases:
NamedTuple
assume.reinforcement_learning.learning_role module#
- class assume.reinforcement_learning.learning_role.Learning(learning_config: LearningConfig, start: datetime, end: datetime)#
Bases:
RoleThis class manages the learning process of reinforcement learning agents, including initializing key components such as neural networks, replay buffer, and learning hyperparameters. It handles both training and evaluation modes based on the provided learning configuration.
- Parameters:
learning_config (LearningConfig) – The configuration for the learning process.
start (datetime.datetime) – The start datetime for the simulation.
end (datetime.datetime) – The end datetime for the simulation.
- add_actions_to_cache(unit_id, start, action, noise) None#
Add the action and noise to the cache dict, per unit_id.
- Parameters:
unit_id (str) – The id of the unit.
action (torch.Tensor) – The action to be added.
noise (torch.Tensor) – The noise to be added.
- add_observation_to_cache(unit_id, start, observation) None#
Add the observation to the cache dict, per unit_id.
- Parameters:
unit_id (str) – The id of the unit.
observation (torch.Tensor) – The observation to be added.
- add_reward_to_cache(unit_id, start, reward, regret, profit) None#
Add the reward to the cache dict, per unit_id.
- compare_and_save_policies(metrics: dict) bool#
Compare evaluation metrics and save policies based on the best achieved performance according to the metrics calculated.
This method compares the evaluation metrics, such as reward, profit, and regret, and saves the policies if they achieve the best performance in their respective categories. It iterates through the specified modes, compares the current evaluation value with the previous best, and updates the best value if necessary. If an improvement is detected, it saves the policy and associated parameters.
metrics contain a metric key like “reward” and the current value. This function stores the policies with the highest metric. So if minimize is required one should add for example “minus_regret” which is then maximized.
- Returns:
True if early stopping criteria is triggered.
- Return type:
Note
This method is typically used during the evaluation phase to save policies that achieve superior performance. Currently the best evaluation metric is still assessed by the development team and preliminary we use the average rewards.
- create_learning_algorithm(algorithm: RLAlgorithm)#
Create and initialize the reinforcement learning algorithm.
This method creates and initializes the reinforcement learning algorithm based on the specified algorithm name. The algorithm is associated with the learning role and configured with relevant hyperparameters.
- Parameters:
algorithm (RLAlgorithm) – The name of the reinforcement learning algorithm.
- determine_validation_interval() int#
Compute and validate validation_interval.
- Returns:
validation_interval (int)
- Raises:
ValueError if training_episodes is too small. –
- get_inter_episodic_data()#
Dump the inter-episodic data to a dict for storing across simulation runs.
- Returns:
The inter-episodic data to be stored.
- Return type:
- init_logging(simulation_id: str, episode: int, eval_episode: int, db_uri: str, output_agent_addr: str, train_start: str)#
Initialize the logging for the reinforcement learning agent.
This method initializes the tensor board logger for the reinforcement learning agent. It also initializes the parameters required for sending data to the output role.
- Parameters:
simulation_id (str) – The unique identifier for the simulation.
episode (int) – The current training episode number.
eval_episode (int) – The current evaluation episode number.
db_uri (str) – URI for connecting to the database.
output_agent_addr (str) – The address of the output agent.
train_start (str) – The start time of simulation.
- initialize_policy(actors_and_critics: dict = None) None#
Initialize the policy of the reinforcement learning agent considering the respective algorithm.
This method initializes the policy (actor) of the reinforcement learning agent. It tests if we want to continue the learning process with stored policies from a former training process. If so, it loads the policies from the specified directory. Otherwise, it initializes the respective new policies.
- load_inter_episodic_data(inter_episodic_data)#
Load the inter-episodic data from the dict stored across simulation runs.
- Parameters:
inter_episodic_data (dict) – The inter-episodic data to be loaded.
- on_ready()#
Set up the learning role for reinforcement learning training.
Notes
This method prepares the learning role for the reinforcement learning training process. It subscribes to relevant messages for handling the training process and schedules recurrent tasks for policy updates based on the specified training frequency. This cannot happen in the init since the context (compare mango agents) is not yet available there.To avoid inconsistent replay buffer states (e.g. observation and action has been stored but not the reward), this slightly shifts the timing of the buffer updates.
- register_strategy(strategy: LearningStrategy) None#
Register a learning strategy with this learning role.
- Parameters:
strategy (LearningStrategy) – The learning strategy to register.
- sync_train_freq_with_simulation_horizon() str | None#
Ensure self.train_freq evenly divides the simulation length. If not, adjust self.train_freq (in-place) and return the new string, otherwise return None. Uses self.start_datetime/self.end_datetime when available, otherwise falls back to timestamp fields.
- turn_off_initial_exploration(loaded_only=False) None#
Disable initial exploration mode.
If loaded_only=True, only turn off exploration for strategies that were loaded (used in continue_learning mode). If loaded_only=False, turn it off for all strategies.
- Parameters:
loaded_only (bool) – Whether to disable exploration only for loaded strategies.
- write_rl_grad_params_to_output(learning_rate: float, unit_params_list: list[dict]) None#
Writes learning parameters and critic losses to output at specified time intervals.
This function processes training metrics for each critic over multiple time steps and sends them to a database for storage. It tracks the learning rate and critic losses across training iterations, associating each record with a timestamp.
- write_rl_params_to_output(cache)#
Sends the current rl_strategy update to the output agent.
- Parameters:
products_index (pandas.DatetimeIndex) – The index of all products.
marketconfig (MarketConfig) – The market configuration.
assume.reinforcement_learning.learning_utils module#
- class assume.reinforcement_learning.learning_utils.NormalActionNoise(action_dimension, mu=0.0, sigma=0.1, scale=1.0, dt=0.9998)#
Bases:
objectA Gaussian action noise that supports direct tensor creation on a given device.
- noise(device=None, dtype=torch.float32)#
Generates noise using torch.normal(), ensuring efficient execution on GPU if needed.
Args: - device (torch.device, optional): Target device (e.g., ‘cuda’ or ‘cpu’). - dtype (torch.dtype, optional): Data type of the tensor (default: torch.float32).
Returns: - torch.Tensor: Noise tensor on the specified device.
- class assume.reinforcement_learning.learning_utils.OUNoise(action_dimension, mu=0, sigma=0.5, theta=0.15, dt=0.01)#
Bases:
objectA class that implements Ornstein-Uhlenbeck noise.
- noise()#
- assume.reinforcement_learning.learning_utils.constant_schedule(val: float) Callable[[float], float]#
Create a function that returns a constant. It is useful for learning rate schedule (to avoid code duplication)
- Parameters:
val – constant value
- Returns:
Constant schedule function.
Note
From SB3: DLR-RM/stable-baselines3
- assume.reinforcement_learning.learning_utils.copy_layer_data(dst, src)#
- assume.reinforcement_learning.learning_utils.linear_schedule_func(start: float, end: float = 0, end_fraction: float = 1) Callable[[float], float]#
Create a function that interpolates linearly between start and end between
progress_remaining= 1 andprogress_remaining= 1 -end_fraction.- Parameters:
start – value to start with if
progress_remaining= 1end – value to end with if
progress_remaining= 0end_fraction – fraction of
progress_remainingwhere end is reached e.g 0.1 then end is reached after 10% of the complete training process.
- Returns:
Linear schedule function.
Note
Adapted from SB3: DLR-RM/stable-baselines3
- assume.reinforcement_learning.learning_utils.polyak_update(params, target_params, tau: float)#
Perform a Polyak average update on
target_paramsusingparams: target parameters are slowly updated towards the main parameters.tau, the soft update coefficient controls the interpolation:tau=1corresponds to copying the parameters to the target ones whereas nothing happens whentau=0. The Polyak update is done in place, withno_grad, and therefore does not create intermediate tensors, or a computation graph, reducing memory cost and improving performance. We scale the target params by1-tau(in-place), add the new weights, scaled bytauand store the result of the sum in the target params (in place). See DLR-RM/stable-baselines3#93- Parameters:
params – parameters to use to update the target params
target_params – parameters to update
tau – the soft update coefficient (“Polyak update”, between 0 and 1)
- assume.reinforcement_learning.learning_utils.transfer_weights(model: Module, loaded_state: dict, loaded_id_order: list[str], new_id_order: list[str], obs_base: int, act_dim: int, unique_obs: int) dict | None#
Transfer weights from loaded model to new model. Copy only those obs- and action-slices for matching IDs. New IDs keep their original (random) weights. Function only works if the neural network architeczture remained stable besides the input layer, namely with the same hidden layers.
- Parameters:
model (th.nn.Module) – The model to transfer weights to.
loaded_state (dict) – The state dictionary of the loaded model.
loaded_id_order (list[str]) – The list of unit IDs from the loaded model that shows us the order of units.
new_id_order (list[str]) – The list of IDs from the new model, includes potentially different agents in comparison to the loaded model.
obs_base (int) – The base observation size.
act_dim (int) – The action dimension size.
unique_obs (int) – The unique observation size per agent, smaller than obs_base as these include also shared observation values.
- Returns:
The updated state dictionary with transferred weights, or None if architecture mismatch.
- Return type:
dict | None
- assume.reinforcement_learning.learning_utils.transform_buffer_data(nested_dict: dict, device: device) ndarray#
Transform nested dict {datetime -> {unit_id -> [values]}} into torch tensor of shape (timesteps, powerplants, values). Compatible with buffer storage. Get tensors from GPU to CPU.
- Parameters:
nested_dict – Dict with structure {datetime -> {unit_id -> list[tensor]}}
- Returns:
Shape (n_timesteps, n_powerplants, feature_dim)
- Return type:
th.Tensor
assume.reinforcement_learning.algorithms.base_algorithm module#
- class assume.reinforcement_learning.algorithms.base_algorithm.RLAlgorithm(learning_role)#
Bases:
objectThe base RL model class. To implement your own RL algorithm, you need to subclass this class and implement the update_policy method.
- Parameters:
learning_role (Learning Role object) – Learning object
- load_obj(directory: str)#
Load an object from a specified directory.
This method loads an object, typically saved as a checkpoint file, from the specified directory and returns it. It uses the torch.load function and specifies the device for loading.
- load_params(directory: str) None#
Load learning params - abstract method to be implemented by the Learning Algorithm
- update_learning_rate(optimizers: list[Optimizer] | Optimizer, learning_rate: float) None#
Update the optimizers learning rate using the current learning rate schedule and the current progress remaining (from 1 to 0).
- Parameters:
optimizers (List[th.optim.Optimizer] | th.optim.Optimizer) – An optimizer or a list of optimizers.
Note
Adapted from SB3: - DLR-RM/stable-baselines3 - DLR-RM/stable-baselines3
- update_policy()#
assume.reinforcement_learning.algorithms.matd3 module#
- class assume.reinforcement_learning.algorithms.matd3.TD3(learning_role)#
Bases:
RLAlgorithmTwin Delayed Deep Deterministic Policy Gradients (TD3). Addressing Function Approximation Error in Actor-Critic Methods. TD3 is a direct successor of DDPG and improves it using three major tricks: clipped double Q-Learning, delayed policy update and target policy smoothing.
Open AI Spinning guide: https://spinningup.openai.com/en/latest/algorithms/td3.html
Original paper: https://arxiv.org/pdf/1802.09477.pdf
- check_strategy_dimensions() None#
Iterate over all learning strategies and check if the dimensions of observations and actions are the same. Also check if the unique observation dimensions are the same. If not, raise a ValueError. This is important for the TD3 algorithm, as it uses a centralized critic that requires consistent dimensions across all agents.
- create_actors() None#
Create actor networks for reinforcement learning for each unit strategy.
This method initializes actor networks and their corresponding target networks for each unit strategy. The actors are designed to map observations to action probabilities in a reinforcement learning setting.
The created actor networks are associated with each unit strategy and stored as attributes.
Note
The observation dimension need to be the same, due to the centralized criic that all actors share. If you have units with different observation dimensions. They need to have different critics and hence learning roles.
- create_critics() None#
Create critic networks for reinforcement learning.
This method initializes critic networks for each agent in the reinforcement learning setup.
Note
The observation dimension need to be the same, due to the centralized criic that all actors share. If you have units with different observation dimensions. They need to have different critics and hence learning roles.
- extract_policy() dict#
Extract actor and critic networks.
This method extracts the actor and critic networks associated with each learning strategy and organizes them into a dictionary structure. The extracted networks include actors, actor_targets, critics, and target_critics. The resulting dictionary is typically used for saving and sharing these networks.
- Returns:
The extracted actor and critic networks.
- Return type:
- initialize_policy(actors_and_critics: dict = None) None#
Create actor and critic networks for reinforcement learning.
If actors_and_critics is None, this method creates new actor and critic networks. If actors_and_critics is provided, it assigns existing networks to the respective attributes.
- Parameters:
actors_and_critics (dict) – The actor and critic networks to be assigned.
- load_actor_params(directory: str) None#
Load the parameters of actor networks from a specified directory.
This method loads the parameters of actor networks, including the actor’s state_dict, actor_target’s state_dict, and the actor’s optimizer state_dict, from the specified directory. It iterates through the learning strategies associated with the learning role, loads the respective parameters, and updates the actor and target actor networks accordingly.
- Parameters:
directory (str) – The directory from which the parameters should be loaded.
- load_critic_params(directory: str) None#
Load critic, target_critic, and optimizer states for each agent strategy. If agent count differs between saved and current model, performs weight transfer for both networks. :param directory: The directory from which the parameters should be loaded. :type directory: str
- load_params(directory: str) None#
Load the parameters of both actor and critic networks.
This method loads the parameters of both the actor and critic networks associated with the learning role from the specified directory. It uses the load_critic_params and load_actor_params methods to load the respective parameters.
- Parameters:
directory (str) – The directory from which the parameters should be loaded.
- save_actor_params(directory)#
Save the parameters of actor networks.
This method saves the parameters of the actor networks, including the actor’s state_dict, actor_target’s state_dict, and the actor’s optimizer state_dict. It organizes the saved parameters into a directory structure specific to the actor associated with each learning strategy.
- Parameters:
directory (str) – The base directory for saving the parameters.
- save_critic_params(directory)#
Save the parameters of critic networks.
This method saves the parameters of the critic networks, including the critic’s state_dict, critic_target’s state_dict, and the critic’s optimizer state_dict. It organizes the saved parameters into a directory structure specific to the critic associated with each learning strategy.
- Parameters:
directory (str) – The base directory for saving the parameters.
- save_params(directory)#
This method saves the parameters of both the actor and critic networks associated with the learning role. It organizes the saved parameters into separate directories for critics and actors within the specified base directory.
- Parameters:
directory (str) – The base directory for saving the parameters.
- update_policy()#
Update the policy of the reinforcement learning agent using the Twin Delayed Deep Deterministic Policy Gradients (TD3) algorithm.
Note
This function performs the policy update step, which involves updating the actor (policy) and critic (Q-function) networks using TD3 algorithm. It iterates over the specified number of gradient steps and performs the following steps for each learning strategy:
Sample a batch of transitions from the replay buffer.
Calculate the next actions with added noise using the actor target network.
Compute the target Q-values based on the next states, rewards, and the target critic network.
Compute the critic loss as the mean squared error between current Q-values and target Q-values.
Optimize the critic network by performing a gradient descent step.
Update the actor network if the specified policy delay is reached.
Apply Polyak averaging to update target networks.