1. Introduction
Today, with the advancement of IT technology, smart grids have emerged in the electricity market. In a smart grid, intelligent technologies can monitor electricity consumption and supply in real time, enabling energy and cost savings through optimization algorithms. Energy storage systems (ESS) are widely used in smart grids, allowing electricity to be stored in batteries and used when needed. Particularly in a market where real-time electricity prices fluctuate, electricity can be purchased when prices are low and sold or consumed when prices rise, thereby reducing the cost burden on consumers. Additionally, since electricity prices are determined by supply and demand, optimization for cost reduction naturally leads to load distribution during peak hours, benefiting both electricity suppliers and consumers.
Many studies have emerged for such optimization purposes [1-3]. However, these traditional algorithms find it difficult to incorporate complex environmental information and have limited adaptability to dynamic environments compared to reinforcement learning. Therefore, reinforcement learning has been introduced to the ESS optimization problem in smart grids, which needs to reflect dynamic environments such as real-time load and prices, allowing the system to learn adaptive behavior. In this study, we apply a Markov decision process (MDP) model to determine the charging/discharging amounts of ESS. Based on this model, we compare various reinforcement learning techniques.
The contributions of this study are as follows:
· We design an MDP to extend battery life and reduce costs.
· We build two data scenarios with different consumption patterns to demonstrate that the agent operates according to the goals.
· We compare the behavior and performance of various reinforcement learning agents.
This paper is structured as follows. Section 2 introduces research applying reinforcement learning in smart grid environments. Section 3 explains the foundational techniques of deep deterministic policy gradient (DDPG) [4], twin delayed deep deterministic policy gradient (TD3) [5], and soft actor-critic (SAC) [6] for the experiments, covering learning objectives, environment setup, and MDP design. Section 4 compares and analyzes the experimental results of various techniques. Section 5 concludes with a summary and outlines future research directions.
2. Related Work
Table 1 highlights the differences between this study and related works [7-11]. First, a study that applies reinforcement learning to a dynamic smart grid environment is [7]. In [7], Q-learning and state–action–reward–state–action (SARSA) are used to learn the charging, discharging, and holding actions of a battery in an environment where real-time electricity prices fluctuate, aiming to reduce consumer costs. However, this approach only decides whether to charge, discharge, or hold, without determining the exact amounts. In [8], a new power management system framework using Q-learning is proposed to reduce load during peak hours in an environment with both battery energy storage systems and thermal energy storage systems. However, this approach applies a fixed price according to a standard rather than adapting to real-time price changes, limiting its applicability in dynamic environments.
Comparison of related works
With advancements in reinforcement learning, deep reinforcement learning (DRL) using deep neural networks has emerged. An example of applying DRL to smart grids is [9]. In [9], a deep Q-network (DQN) agent determines the charging and discharging amounts to reduce costs in a real-time price variation environment. The agent charges during low-price periods and discharges during high-price periods. However, the charging and discharging amounts are discretely fixed, making fine optimization challenging. The authors of [10] compares DQN, DDPG, and TD3 algorithms to not only minimize costs but also maintain the battery's state of charge (SoC) within specific limits. The reinforcement learning agents determine the charging and discharging amounts of the ESS according to the given environment. However, a limitation of the experiment is that it did not compare the performance of a stochastic policy. In [11], DDPG in a real-time electricity price fluctuation environment is used to reduce costs while imposing overcharging and overdischarging constraints. However, these constraints are not included in the reward function but are pre-restricted outside the agent's influence, potentially ignoring latent benefits.
Therefore, this study is designed to optimize battery SoC by simultaneously incorporating overcharging, overdischarging, and cost savings in an environment where real-time electricity tariffs fluctuate. Additionally, it aims to learn specific and continuous charging/discharging amounts [12].
3. System Model
3.1 Background
The core reinforcement learning techniques central to this experiment are DDPG, TD3, and SAC. DDPG extends deterministic policy gradient and the discrete action space-based algorithm DQN [13]. The Actor in DDPG uses a deterministic policy to select the optimal action at given the current state st. The Critic (q-network) evaluates the value of the action at selected by the Actor by outputting the Q-value, and it learns by following the reward to maximize this value. Additionally, DDPG uses a replay buffer to store past samples experienced by the agent. This allows DDPG to address data dependency issues and achieve stable learning. However, using a single Q-network can result in overestimation bias, where the value of actions is incorrectly estimated.
To address this issue, TD3 was introduced, which, like DDPG, allows the Actor to select the optimal at using a deterministic policy given the state st. In TD3, the update of the Actor is delayed to enhance learning stability. TD3 uses two Critic (q-networks) that output Q-values to mitigate overestimation.
SAC adds entropy to the objective function, increasing exploration randomness and ensuring sample diversity. The Actor in SAC receives the state st, uses a stochastic policy to output a probability distribution over actions, and selects the action at by sampling from this distribution. The Critic in SAC evaluates the Q-value of the selected action at, and a target network is used in this process to provide stability in policy updates. According to [6], SAC demonstrates stable learning performance across various experimental environments, outperforming methods like DDPG, proximal policy optimization (PPO), speedy Q-learning (SQL), and TD3.
DDPG, TD3, and SAC are off-policy and utilize an actor-critic network structure, making them suitable for continuous action spaces. However, a key difference is that DDPG and TD3 use a deterministic policy, while SAC uses a stochastic policy. A deterministic policy outputs a single action for a given state, with the policy network directly producing the specific action. In contrast, a stochastic policy outputs a probability distribution over actions for a given state, with actions sampled from this distribution. Additionally, the Actors and Critics in DDPG and TD3 have target networks, whereas in SAC, only the Critic network has a target network.
In the context of power systems, which exhibit continuous properties, using methods designed for discrete action spaces like Q-learning or DQN can result in information loss. Therefore, for complete optimization, DDPG, TD3, and SAC, which operate in continuous action spaces, are more suitable. DDPG and TD3 are stable since they take the same action for the same state, but they may lack sufficient exploration. On the other hand, SAC, with its added exploration, is better suited for high-variability power environments but can be more complex to train due to the entropy calculations.
3.2 Markov Decision Process
The state is composed of load, price, and the current SoC of the ESS to identify specific times. To reflect the potential time-series characteristics of load and price in the next action selection, the state includes the sets of past loads [TeX:] $$\begin{equation} \left\{l_{t-24}, \ldots, l_{t-1}\right\} \end{equation}$$ and price [TeX:] $$\begin{equation} \left\{p_{t-24}, \ldots, p_{t-1}\right\} \end{equation}$$ up to the current time t. This is summarized in Eq. (1). The method to calculate the new SoC, soct+1, is shown in Eq. (2), which adds the ESS charging/discharging amount determined by the action, ESSat, to the current SoC.
The range of action at that the agent can select is a real number in [-δ,δ], where δ is a value between 0 and 1. The actual charging/discharging amount ESSat is determined by multiplying at by the ESS's capacity ESScap. When at is negative, power is discharged from the ESS for the consumer to use or sell. When at is positive, the ESS is charged by drawing electricity from the source. When at is zero, the agent is in a holding state, taking no action. Eq. (3) shows the calculation method for ESSat.
Eq. (4) represents min-max scaling, which normalizes values between 0 and 1, ensuring that different factors have comparable influence and facilitating smoother learning. The agent must minimize electricity costs while keeping the ESS's SoC between the target minimum threshold ESSmin and the target maximum threshold ESSmax. Therefore, the reward function consists of rsoc, which imposes a penalty when the target range is exceeded, and rcost, which represents the current electricity cost. In Eq. (5), the reward is defined as the current price multiplied by the scaled amount of ESS charged/discharged ESSat. The charge/discharge amount of ESS is scaled using min-max scaling on |ESSat|. If ESSat is positive, it indicates charging, so a minus sign is added to represent the loss due to the amount charged. Conversely, if ESSat is negative, it indicates discharging, which results in a profit, so the value remains positive. Eq. (6) calculates rsoc. When the SoC is outside the target range, the deviation is scaled using min-max scaling. rt in Eq. (7) is the final reward formula that calculates this together. The reward rcost is typically within the range of -1 to 1, with profits expressed as positive values and losses as negative, while the penalty rsoc is subtracted to account for deviations from the thresholds.
4. Performance Analysis
4.1 Environment
The smart grid structure considered in the experiment comprises a consumer, an agent, and a source, as shown in Fig. 1. Consumer accessing electricity from ESS through the agent is referred to as “discharging.” The consumer first consumes electricity discharged from the ESS, and if the ESS discharges more electricity than the consumer's load, the excess electricity can be sold to the electricity source. Conversely, if the ESS discharge is insufficient to meet the load, the shortfall is purchased from the electricity source. The agent can also purchase electricity from the source, which is known as “charging.” By charging when real-time electricity prices are low and discharging when prices are high or during peak load times, the agent can help reduce costs.
For the experiment, the ROBOD dataset [14] collected from buildings at the National University of Singapore was used. This dataset includes lighting, plug, and HVAC (heating, ventilation, and air conditioning) load data collected at 5-minute intervals from administrative offices and library buildings used by students. The dataset consists of weekday data from September to December 2021, with some missing days. The Office data spans 29 days, while the Library data spans 47 days. Real-time electricity prices were sourced from the Singapore National Electricity Market, provided by the Energy Market Company (https://www.home.emcsg.com) at 30-minute intervals, and augmented to 5-minute intervals for use in this study. To augment the data to 5-minute intervals, linear interpolation was applied. This method connects the two original data points with a straight line to extract new values, allowing for a smooth representation of the data flow. The agent treats 24 data points, i.e., 2 hours of 5-minute interval data, as one episode for training.
Table 2 presents detailed environment settings for the ESS in the Office and Library datasets. ESScap denotes the total capacity of the ESS, rounded to the highest value among the average load profiles for each time period in the data. ESSinit represents the initial capacity of the ESS, set at 40% of the total capacity. Additionally, to extend the battery lifespan, the target range that the ESS must maintain, ESSmin and ESSmax, spans from 20% to 80% of the total capacity, respectively. The range for action selection, denoted by δ, is set to 0.15.
ESS environment configuration values
4.2 Experimental Results
Evaluation simulations utilized real-time load and price data from December 21–23, 2021. Fig. 2 illustrates the learning rewards for different techniques in the Office and Library environments. Firstly, focusing on the results for the Office, both DDPG and TD3 ultimately achieve rewards close to -2. However, DDPG exhibits oscillations in rewards ranging from -4 to -12 during early stages of learning, whereas TD3 explores within a narrower range of approximately -2 to -8 and shows comparatively smaller fluctuations in later rewards. SAC exhibits initial fluctuations similar to TD3 but converges to a final reward around 0, higher than that of DDPG and TD3. It also occasionally dips to very low rewards between learning stages, likely due to SAC's exploration nature with entropy. In the Library dataset, all techniques eventually converge to a reward of 0. However, DDPG initially shows very low rewards and reaches 0 later compared to other techniques. Similar to the Office dataset, TD3 demonstrates stable learning with minimal oscillations compared to DDPG. SAC also shows exploration behavior with occasional low rewards between learning phases.
Fig. 3 shows graphs representing the simulation results of each technique. It illustrates how the reinforcement learning agents adjust the ESS's SoC over 3 days based on real-time Load and Price data. Dashed lines indicate the target maximum and minimum SoC of the ESS. In the Office dataset, DDPG maintains a tightly fluctuating SoC throughout all times, indicating frequent switching between charging and discharging actions. In contrast, TD3 maintains a similar SoC level during nighttime periods with minimal load. SAC continues to engage in purchasing and selling actions even during nighttime. In the Library dataset, all three techniques show similar behaviors. However, TD3 discharges power during periods of very high prices, resulting in a SoC below the target minimum. This decision suggests prioritizing economic gains over penalties for overdischarging. Moreover, generally observing discharges during high real-time electricity prices, most techniques maintain their target ranges well except for TD3 in the Library dataset.
Fig. 4 shows a graph of cumulative cost savings over 3 days for each technique. Positive values indicate cost savings, while negative values indicate additional costs. In the Office dataset, DDPG shows a decreasing trend in cost savings, whereas TD3 initially increases and ultimately saves approximately $0.8. Similarly, SAC also shows an increasing trend and ultimately saves about twice as much as TD3, totaling $1.6. Looking at the Library dataset, DDPG still incurs losses but shows a very gradual increase. TD3 saves around $2.5, while SAC saves about $2. Overall, there is a sudden increase observed during 18–21 hours on the first day, attributed to the characteristics of the data where selling at high electricity prices resulted in profit.
The reward convergence graphs for DDPG, TD3, and SAC: (a, c, e) Office dataset and (b, d, f) Library dataset.
The simulation graphs for DDPG, TD3, and SAC: (a, c, e) Office dataset and (b, d, f) Library dataset.
In summary, from a reward perspective, TD3 demonstrated the most stable learning and convergence, while DDPG and SAC exhibited large oscillations. This stability in TD3 can be attributed to its dual Q-network structure, which provides robust value estimation. Additionally, due to its deterministic policy, TD3 tends to take relatively stable actions in the same state. SAC reached high rewards quickly in the early stages of learning, likely due to its entropy promoting exploration and increasing the probability of achieving high rewards. However, SAC also frequently showed very low rewards in the later stages of learning, likely due to entropy effects.
The cumulative cost savings graphs for DDPG, TD3, and SAC: (a, c, e) Office dataset and (b, d, f) Library dataset.
In terms of simulation comparison, actions were generally chosen to maintain target ranges and capitalize on increasing real-time electricity prices. Both TD3 and SAC demonstrated cost savings, with SAC saving more costs in the Office environment with less data, while TD3 saved more costs in the Library environment with more data. TD3 requires more data for stability, whereas SAC can learn quickly with less data. Therefore, TD3 appears suitable for scenarios with relatively stable price fluctuations or consistent consumption patterns, while SAC seems more appropriate for dynamic or highly fluctuating power markets such as renewable energy management.
5. Conclusion
In this study, various reinforcement learning algorithms were compared to efficiently and economically operate ESS in a smart grid environment where real-time electricity prices change every 5 minutes. An MDP was designed to determine the charge/discharge amount of ESS aimed at cost savings while extending its lifespan by defining SoC levels. In terms of reward convergence, SAC showed the fastest convergence, followed by TD3 and DDPG. However, due to SAC's randomness, TD3 exhibited the best performance in terms of stability. According to simulation results, DDPG, TD3, and SAC all adjusted their charge/discharge actions based on real-time prices within the target range. From a cost-saving perspective, SAC demonstrated significant cost savings even with limited data for training. TD3 showed slower learning speed with less data but achieved the best performance when trained with more data to gain stable action experience.
Through such research, consumers can learn their energy usage patterns within a smart grid and develop real-time energy management strategies. Particularly, with the integration of personal ESS and existing power wholesale market infrastructure, consumers can assume the role of producers. In this context, DRL agents can predict real-time demand and electricity prices, allowing consumers to profit by selling excess electricity or to mitigate peak loads. This offers economic benefits while also minimizing wasted energy, thus providing environmental advantages. However, DRL requires precise modeling tailored to dynamic environments. Demand patterns and electricity prices vary depending on weather conditions, geographical factors, and cultural differences, making comprehensive modeling a significant challenge. Additionally, large-scale smart grids may require robust infrastructure for data collection and processing. Moreover, measures must be taken to manage DRL, as incorrect learning could lead to issues like power supply interruptions.
Future research could consider scenarios that include unstable renewable energy infrastructure such as solar, thermal, and wind energy. For instance, in the case of solar energy, cloudy or rainy days result in low energy production, requiring additional power from the source. Conversely, if more power is generated than needed, surplus electricity occurs. In such highly dynamic scenarios, DRL algorithms are well-suited for energy management using ESS. However, accounting for weather factors, which are composed of complex variables that significantly impact renewable energy production, presents a significant challenge. Additionally, since weather indirectly affects demand patterns and electricity prices, complex environmental settings must be taken into account.
Conflict of Interest
The authors declare that they have no competing interests.
Funding
This work was supported by the IITP (Institute of Information & Communications Technology Planning & Evaluation)-ICAN (ICT Challenge and Advanced Network of HRD) grant funded by the Korea government (Ministry of Science and ICT) (No. IITP-2025-RS-2022-00156299).