Optimizing Energy Storage Systems Using Deep Reinforcement Learning in Smart Grids

Hajin Noh and Yujin Lim

Article Information

Corresponding Author: Yujin Lim , yujin91@sookmyung.ac.kr

Hajin Noh, Department of IT Engineering, Sookmyung Women’s University, Seoul, Korea, hajins@sookmyung.ac.kr

Yujin Lim, Division of Artificial Intelligence Engineering, Sookmyung Women’s University, Seoul, Korea, yujin91@sookmyung.ac.kr

Received: July 11 2024

Revision received: September 26 2024

Accepted: September 26 2024

Published (Print): April 30 2025

Published (Electronic): April 30 2025

Abstract

Abstract: With the progress of IT technology, it has become possible to reduce consumer costs in the power market by using energy storage systems (ESS) in smart grids. Traditional algorithms proposed to solve optimization of ESS problems are difficult to apply to dynamic situations, hence adaptable and relatively simple designs such as deep reinforcement learning (DRL) techniques have begun to be used instead. In this study, a Markov decision process is designed to determine the charging and discharging amounts within a certain range to extend the lifespan of the ESS. Furthermore, DRL techniques such as deep deterministic policy gradient (DDPG), twin delayed deep deterministic policy gradient (TD3), and soft actor-critic (SAC) were trained, and their performances were compared for analysis.

Keywords: Energy Storage System , Reinforcement Learning , Smart Grid

1. Introduction

Today, with the advancement of IT technology, smart grids have emerged in the electricity market. In a smart grid, intelligent technologies can monitor electricity consumption and supply in real time, enabling energy and cost savings through optimization algorithms. Energy storage systems (ESS) are widely used in smart grids, allowing electricity to be stored in batteries and used when needed. Particularly in a market where real-time electricity prices fluctuate, electricity can be purchased when prices are low and sold or consumed when prices rise, thereby reducing the cost burden on consumers. Additionally, since electricity prices are determined by supply and demand, optimization for cost reduction naturally leads to load distribution during peak hours, benefiting both electricity suppliers and consumers.

Many studies have emerged for such optimization purposes [1-3]. However, these traditional algorithms find it difficult to incorporate complex environmental information and have limited adaptability to dynamic environments compared to reinforcement learning. Therefore, reinforcement learning has been introduced to the ESS optimization problem in smart grids, which needs to reflect dynamic environments such as real-time load and prices, allowing the system to learn adaptive behavior. In this study, we apply a Markov decision process (MDP) model to determine the charging/discharging amounts of ESS. Based on this model, we compare various reinforcement learning techniques.

The contributions of this study are as follows:

· We design an MDP to extend battery life and reduce costs.

· We build two data scenarios with different consumption patterns to demonstrate that the agent operates according to the goals.

· We compare the behavior and performance of various reinforcement learning agents.

This paper is structured as follows. Section 2 introduces research applying reinforcement learning in smart grid environments. Section 3 explains the foundational techniques of deep deterministic policy gradient (DDPG) [4], twin delayed deep deterministic policy gradient (TD3) [5], and soft actor-critic (SAC) [6] for the experiments, covering learning objectives, environment setup, and MDP design. Section 4 compares and analyzes the experimental results of various techniques. Section 5 concludes with a summary and outlines future research directions.

2. Related Work

Table 1 highlights the differences between this study and related works [7-11]. First, a study that applies reinforcement learning to a dynamic smart grid environment is [7]. In [7], Q-learning and state–action–reward–state–action (SARSA) are used to learn the charging, discharging, and holding actions of a battery in an environment where real-time electricity prices fluctuate, aiming to reduce consumer costs. However, this approach only decides whether to charge, discharge, or hold, without determining the exact amounts. In [8], a new power management system framework using Q-learning is proposed to reduce load during peak hours in an environment with both battery energy storage systems and thermal energy storage systems. However, this approach applies a fixed price according to a standard rather than adapting to real-time price changes, limiting its applicability in dynamic environments.

Table 1.

Comparison of related works

With advancements in reinforcement learning, deep reinforcement learning (DRL) using deep neural networks has emerged. An example of applying DRL to smart grids is [9]. In [9], a deep Q-network (DQN) agent determines the charging and discharging amounts to reduce costs in a real-time price variation environment. The agent charges during low-price periods and discharges during high-price periods. However, the charging and discharging amounts are discretely fixed, making fine optimization challenging. The authors of [10] compares DQN, DDPG, and TD3 algorithms to not only minimize costs but also maintain the battery's state of charge (SoC) within specific limits. The reinforcement learning agents determine the charging and discharging amounts of the ESS according to the given environment. However, a limitation of the experiment is that it did not compare the performance of a stochastic policy. In [11], DDPG in a real-time electricity price fluctuation environment is used to reduce costs while imposing overcharging and overdischarging constraints. However, these constraints are not included in the reward function but are pre-restricted outside the agent's influence, potentially ignoring latent benefits.

Therefore, this study is designed to optimize battery SoC by simultaneously incorporating overcharging, overdischarging, and cost savings in an environment where real-time electricity tariffs fluctuate. Additionally, it aims to learn specific and continuous charging/discharging amounts [12].

3. System Model

3.1 Background

The core reinforcement learning techniques central to this experiment are DDPG, TD3, and SAC. DDPG extends deterministic policy gradient and the discrete action space-based algorithm DQN [13]. The Actor in DDPG uses a deterministic policy to select the optimal action a_t given the current state s_t. The Critic (q-network) evaluates the value of the action a_t selected by the Actor by outputting the Q-value, and it learns by following the reward to maximize this value. Additionally, DDPG uses a replay buffer to store past samples experienced by the agent. This allows DDPG to address data dependency issues and achieve stable learning. However, using a single Q-network can result in overestimation bias, where the value of actions is incorrectly estimated.

To address this issue, TD3 was introduced, which, like DDPG, allows the Actor to select the optimal a_t using a deterministic policy given the state s_t. In TD3, the update of the Actor is delayed to enhance learning stability. TD3 uses two Critic (q-networks) that output Q-values to mitigate overestimation.

SAC adds entropy to the objective function, increasing exploration randomness and ensuring sample diversity. The Actor in SAC receives the state s_t, uses a stochastic policy to output a probability distribution over actions, and selects the action a_t by sampling from this distribution. The Critic in SAC evaluates the Q-value of the selected action a_t, and a target network is used in this process to provide stability in policy updates. According to [6], SAC demonstrates stable learning performance across various experimental environments, outperforming methods like DDPG, proximal policy optimization (PPO), speedy Q-learning (SQL), and TD3.

DDPG, TD3, and SAC are off-policy and utilize an actor-critic network structure, making them suitable for continuous action spaces. However, a key difference is that DDPG and TD3 use a deterministic policy, while SAC uses a stochastic policy. A deterministic policy outputs a single action for a given state, with the policy network directly producing the specific action. In contrast, a stochastic policy outputs a probability distribution over actions for a given state, with actions sampled from this distribution. Additionally, the Actors and Critics in DDPG and TD3 have target networks, whereas in SAC, only the Critic network has a target network.

In the context of power systems, which exhibit continuous properties, using methods designed for discrete action spaces like Q-learning or DQN can result in information loss. Therefore, for complete optimization, DDPG, TD3, and SAC, which operate in continuous action spaces, are more suitable. DDPG and TD3 are stable since they take the same action for the same state, but they may lack sufficient exploration. On the other hand, SAC, with its added exploration, is better suited for high-variability power environments but can be more complex to train due to the entropy calculations.

3.2 Markov Decision Process

The state is composed of load, price, and the current SoC of the ESS to identify specific times. To reflect the potential time-series characteristics of load and price in the next action selection, the state includes the sets of past loads [TeX:] $$\begin{equation} \left\{l_{t-24}, \ldots, l_{t-1}\right\} \end{equation}$$ and price [TeX:] $$\begin{equation} \left\{p_{t-24}, \ldots, p_{t-1}\right\} \end{equation}$$ up to the current time t. This is summarized in Eq. (1). The method to calculate the new SoC, soc^t+1, is shown in Eq. (2), which adds the ESS charging/discharging amount determined by the action, ESS^{a^t}, to the current SoC.

(1)

[TeX:] $$\begin{equation} s_t=\left[\left\{l_{t-24}, \ldots, l_{t-1}\right\}, \operatorname{SOC}_t,\left\{p_{t-24}, \ldots, p_{t-1}\right\}\right], \end{equation}$$

(2)

[TeX:] $$\begin{equation} s o c_{t+1}=s o c_t+E S S_{a_t} . \end{equation}$$

The range of action a^t that the agent can select is a real number in [-δ,δ], where δ is a value between 0 and 1. The actual charging/discharging amount ESS^{a^t} is determined by multiplying a^t by the ESS's capacity ESS_cap. When a^t is negative, power is discharged from the ESS for the consumer to use or sell. When a^t is positive, the ESS is charged by drawing electricity from the source. When a^t is zero, the agent is in a holding state, taking no action. Eq. (3) shows the calculation method for ESS^{a^t}.

(3)

[TeX:] $$\begin{equation} E S S_{a_t}=E S S_{\text {cap }} \times a_t, \quad\left(-\delta \leq a_t \leq \delta\right), \end{equation}$$

(4)

[TeX:] $$\begin{equation} \operatorname{MinMax}(x)=\frac{x-x_{\min }}{x_{\max }-x_{\min }} \end{equation}$$

Eq. (4) represents min-max scaling, which normalizes values between 0 and 1, ensuring that different factors have comparable influence and facilitating smoother learning. The agent must minimize electricity costs while keeping the ESS's SoC between the target minimum threshold ESS^min and the target maximum threshold ESS^max. Therefore, the reward function consists of r_soc, which imposes a penalty when the target range is exceeded, and r^cost, which represents the current electricity cost. In Eq. (5), the reward is defined as the current price multiplied by the scaled amount of ESS charged/discharged ESS^{a^t}. The charge/discharge amount of ESS is scaled using min-max scaling on |ESS^{a^t}|. If ESS^{a^t} is positive, it indicates charging, so a minus sign is added to represent the loss due to the amount charged. Conversely, if ESS^{a^t} is negative, it indicates discharging, which results in a profit, so the value remains positive. Eq. (6) calculates r_soc. When the SoC is outside the target range, the deviation is scaled using min-max scaling. r^t in Eq. (7) is the final reward formula that calculates this together. The reward r^cost is typically within the range of -1 to 1, with profits expressed as positive values and losses as negative, while the penalty r_soc is subtracted to account for deviations from the thresholds.

(5)

[TeX:] $$\begin{equation} r_{\text {cost }}= \begin{cases}-\operatorname{MinMax}\left(\left|ESS_{a_t}\right|\right) \times p_t, & \left(\text { if ESS }_{a_t}>0\right) \\ \operatorname{MinMax}\left(\left|ESS_{a_t}\right|\right) \times p_t, & \left(\text { if } ESS_{a_t}<0\right)\end{cases} \end{equation}$$

(6)

[TeX:] $$\begin{equation} r_{s o c}= \begin{cases}\operatorname{MinMax}\left(E S S_{\min }-s o c_{t+1}\right), & \left(\text { if } s o c_{t+1}<E S S_{\min }\right) \\ \operatorname{MinMax}\left(s o c_{t+1}-E S S_{\max }\right), & \left(\text { if } s o c_{t+1}>E S S_{\max }\right)\end{cases} \end{equation}$$

(7)

[TeX:] $$\begin{equation} r_t=r_{\text {cost }}-r_{\text {soc }} . \end{equation}$$

4. Performance Analysis

4.1 Environment

The smart grid structure considered in the experiment comprises a consumer, an agent, and a source, as shown in Fig. 1. Consumer accessing electricity from ESS through the agent is referred to as “discharging.” The consumer first consumes electricity discharged from the ESS, and if the ESS discharges more electricity than the consumer's load, the excess electricity can be sold to the electricity source. Conversely, if the ESS discharge is insufficient to meet the load, the shortfall is purchased from the electricity source. The agent can also purchase electricity from the source, which is known as “charging.” By charging when real-time electricity prices are low and discharging when prices are high or during peak load times, the agent can help reduce costs.

Fig. 1.

Smart grid architecture.

For the experiment, the ROBOD dataset [14] collected from buildings at the National University of Singapore was used. This dataset includes lighting, plug, and HVAC (heating, ventilation, and air conditioning) load data collected at 5-minute intervals from administrative offices and library buildings used by students. The dataset consists of weekday data from September to December 2021, with some missing days. The Office data spans 29 days, while the Library data spans 47 days. Real-time electricity prices were sourced from the Singapore National Electricity Market, provided by the Energy Market Company (https://www.home.emcsg.com) at 30-minute intervals, and augmented to 5-minute intervals for use in this study. To augment the data to 5-minute intervals, linear interpolation was applied. This method connects the two original data points with a straight line to extract new values, allowing for a smooth representation of the data flow. The agent treats 24 data points, i.e., 2 hours of 5-minute interval data, as one episode for training.

Table 2 presents detailed environment settings for the ESS in the Office and Library datasets. ESS_cap denotes the total capacity of the ESS, rounded to the highest value among the average load profiles for each time period in the data. ESS_init represents the initial capacity of the ESS, set at 40% of the total capacity. Additionally, to extend the battery lifespan, the target range that the ESS must maintain, ESS_min and ESS_max, spans from 20% to 80% of the total capacity, respectively. The range for action selection, denoted by δ, is set to 0.15.

Table 2.

ESS environment configuration values

4.2 Experimental Results

Evaluation simulations utilized real-time load and price data from December 21–23, 2021. Fig. 2 illustrates the learning rewards for different techniques in the Office and Library environments. Firstly, focusing on the results for the Office, both DDPG and TD3 ultimately achieve rewards close to -2. However, DDPG exhibits oscillations in rewards ranging from -4 to -12 during early stages of learning, whereas TD3 explores within a narrower range of approximately -2 to -8 and shows comparatively smaller fluctuations in later rewards. SAC exhibits initial fluctuations similar to TD3 but converges to a final reward around 0, higher than that of DDPG and TD3. It also occasionally dips to very low rewards between learning stages, likely due to SAC's exploration nature with entropy. In the Library dataset, all techniques eventually converge to a reward of 0. However, DDPG initially shows very low rewards and reaches 0 later compared to other techniques. Similar to the Office dataset, TD3 demonstrates stable learning with minimal oscillations compared to DDPG. SAC also shows exploration behavior with occasional low rewards between learning phases.

Fig. 3 shows graphs representing the simulation results of each technique. It illustrates how the reinforcement learning agents adjust the ESS's SoC over 3 days based on real-time Load and Price data. Dashed lines indicate the target maximum and minimum SoC of the ESS. In the Office dataset, DDPG maintains a tightly fluctuating SoC throughout all times, indicating frequent switching between charging and discharging actions. In contrast, TD3 maintains a similar SoC level during nighttime periods with minimal load. SAC continues to engage in purchasing and selling actions even during nighttime. In the Library dataset, all three techniques show similar behaviors. However, TD3 discharges power during periods of very high prices, resulting in a SoC below the target minimum. This decision suggests prioritizing economic gains over penalties for overdischarging. Moreover, generally observing discharges during high real-time electricity prices, most techniques maintain their target ranges well except for TD3 in the Library dataset.

Fig. 4 shows a graph of cumulative cost savings over 3 days for each technique. Positive values indicate cost savings, while negative values indicate additional costs. In the Office dataset, DDPG shows a decreasing trend in cost savings, whereas TD3 initially increases and ultimately saves approximately $0.8. Similarly, SAC also shows an increasing trend and ultimately saves about twice as much as TD3, totaling $1.6. Looking at the Library dataset, DDPG still incurs losses but shows a very gradual increase. TD3 saves around $2.5, while SAC saves about $2. Overall, there is a sudden increase observed during 18–21 hours on the first day, attributed to the characteristics of the data where selling at high electricity prices resulted in profit.

Fig. 2.

The reward convergence graphs for DDPG, TD3, and SAC: (a, c, e) Office dataset and (b, d, f) Library dataset.

Fig. 3.

The simulation graphs for DDPG, TD3, and SAC: (a, c, e) Office dataset and (b, d, f) Library dataset.

In summary, from a reward perspective, TD3 demonstrated the most stable learning and convergence, while DDPG and SAC exhibited large oscillations. This stability in TD3 can be attributed to its dual Q-network structure, which provides robust value estimation. Additionally, due to its deterministic policy, TD3 tends to take relatively stable actions in the same state. SAC reached high rewards quickly in the early stages of learning, likely due to its entropy promoting exploration and increasing the probability of achieving high rewards. However, SAC also frequently showed very low rewards in the later stages of learning, likely due to entropy effects.

Fig. 4.

The cumulative cost savings graphs for DDPG, TD3, and SAC: (a, c, e) Office dataset and (b, d, f) Library dataset.

In terms of simulation comparison, actions were generally chosen to maintain target ranges and capitalize on increasing real-time electricity prices. Both TD3 and SAC demonstrated cost savings, with SAC saving more costs in the Office environment with less data, while TD3 saved more costs in the Library environment with more data. TD3 requires more data for stability, whereas SAC can learn quickly with less data. Therefore, TD3 appears suitable for scenarios with relatively stable price fluctuations or consistent consumption patterns, while SAC seems more appropriate for dynamic or highly fluctuating power markets such as renewable energy management.

5. Conclusion

In this study, various reinforcement learning algorithms were compared to efficiently and economically operate ESS in a smart grid environment where real-time electricity prices change every 5 minutes. An MDP was designed to determine the charge/discharge amount of ESS aimed at cost savings while extending its lifespan by defining SoC levels. In terms of reward convergence, SAC showed the fastest convergence, followed by TD3 and DDPG. However, due to SAC's randomness, TD3 exhibited the best performance in terms of stability. According to simulation results, DDPG, TD3, and SAC all adjusted their charge/discharge actions based on real-time prices within the target range. From a cost-saving perspective, SAC demonstrated significant cost savings even with limited data for training. TD3 showed slower learning speed with less data but achieved the best performance when trained with more data to gain stable action experience.

Through such research, consumers can learn their energy usage patterns within a smart grid and develop real-time energy management strategies. Particularly, with the integration of personal ESS and existing power wholesale market infrastructure, consumers can assume the role of producers. In this context, DRL agents can predict real-time demand and electricity prices, allowing consumers to profit by selling excess electricity or to mitigate peak loads. This offers economic benefits while also minimizing wasted energy, thus providing environmental advantages. However, DRL requires precise modeling tailored to dynamic environments. Demand patterns and electricity prices vary depending on weather conditions, geographical factors, and cultural differences, making comprehensive modeling a significant challenge. Additionally, large-scale smart grids may require robust infrastructure for data collection and processing. Moreover, measures must be taken to manage DRL, as incorrect learning could lead to issues like power supply interruptions.

Future research could consider scenarios that include unstable renewable energy infrastructure such as solar, thermal, and wind energy. For instance, in the case of solar energy, cloudy or rainy days result in low energy production, requiring additional power from the source. Conversely, if more power is generated than needed, surplus electricity occurs. In such highly dynamic scenarios, DRL algorithms are well-suited for energy management using ESS. However, accounting for weather factors, which are composed of complex variables that significantly impact renewable energy production, presents a significant challenge. Additionally, since weather indirectly affects demand patterns and electricity prices, complex environmental settings must be taken into account.

Conflict of Interest

The authors declare that they have no competing interests.

Funding

This work was supported by the IITP (Institute of Information & Communications Technology Planning & Evaluation)-ICAN (ICT Challenge and Advanced Network of HRD) grant funded by the Korea government (Ministry of Science and ICT) (No. IITP-2025-RS-2022-00156299).

Biography

Hajin Noh

https://orcid.org/0000-0003-3065-9407

She received her B.S. degree in IT Engineering from Sookmyung Women's University in 2023. Since March 2023, she has been pursuing her Master's degree. Her current research interests include reinforcement learning and energy optimization.

Biography

Yujin Lim

https://orcid.org/0000-0002-3076-8040

She received B.S., M.S., and Ph.D. degrees in Computer Science from Sookmyung Women's University, Korea, in 1995, 1997 and 2000, respectively, and Ph.D. degree in Information Sciences from Tohoku University, Japan, in 2013. From 2004 to 2015, she was an associate professor in Department of Information Media, Suwon University, Korea. She joined the faculty of IT Engineering at Sookmyung Women's University, Seoul, in 2016, where currently she is a professor. Her research interests include edge computing, intelligent agent system, and artificial intelligence.

References

1 T. Cui, H. Goudarzi, S. Hatami, S. Nazarian, and M. Pedram, "Concurrent optimization of consumer's electrical energy bill and producer's power generation cost under a dynamic pricing model", in Proceedings of 2012 IEEE PES Innovative Smart Grid Technologies (ISGT), Washington, DC, USA, 2012, pp. 1-6. https://doi.org/10.1109/ISGT.2012.6175810doi:[[[10.1109/ISGT.2012.6175810]]]
2 K. Rahbar, M. R. V edady Moghadam, S. K. Panda, and T. Reindl, "Shared energy storage management for renewable energy integration in smart grid," in Proceedings of 2016 IEEE Power & Energy Society Innovative Smart Grid Technologies Conference (ISGT), Minneapolis, MN, USA, 2016, pp. 1-5. https://doi.org/10.1109/ISGT.2016.7781230doi:[[[10.1109/ISGT.2016.7781230]]]
3 F. Grasso, M. Abdollahi, G. Talluri, and L. Paolucci, "Power control and energy management of grid-scale energy storage systems for smart users," in Proceedings of 2019 AEIT International Annual Conference (AEIT), Florence, Italy, 2019, pp. 1-6. https://doi.org/10.23919/AEIT.2019.8893307doi:[[[10.23919/AEIT.2019.8893307]]]
4 T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra, "Continuous control with deep reinforcement learning," 2015 (Online). Available: https://doi.org/10.48550/arXiv.1509. 02971.doi:[[[10.48550/arXiv.1509.02971]]]
5 S. Fujimoto, H. Hoof, and D. Meger, "Addressing function approximation error in actor-critic methods," Proceedings of Machine Learning Research, vol. 80, pp. 1587-1596, 2018.custom:[[[-]]]
6 T. Haarnoja, A. Zhou, P . Abbeel, and S. Levine, "Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor," Proceedings of Machine Learning Research, vol. 80, pp. 1861-1870, 2018.custom:[[[-]]]
7 P . Bijapur, P . Chakradhar, M. Rishab, S. Srinivas, and B. S. Nagabhushana, "Reinforcement learning for energy storage optimization in the smart grid," in Proceedings of 2020 IEEE International Conference on Power Electronics, Smart Grid and Renewable Energy (PESGRE2020), Cochin, India, 2020, pp. 1-6. https://doi.org/10.1109/PESGRE45664.2020.9070351doi:[[[10.1109/PESGRE45664.2020.9070351]]]
8 Z. Rostmnezhad and L. Dessaint, "Power management in smart buildings using reinforcement learning," in Proceedings of 2023 IEEE Power & Energy Society Innovative Smart Grid Technologies Conference (ISGT) , Washington, DC, USA, 2013, pp. 1-5. https://doi.org/10.1109/ISGT51731.2023.10066398doi:[[[10.1109/ISGT51731.2023.10066398]]]
9 E. Brock, L. Bruckstein, P . Connor, S. Nguyen, R. Kerestes, and M. Abdelhakim, "An application of reinforcement learning to residential energy storage under real-time pricing," in Proceedings of 2021 IEEE PES Innovative Smart Grid Technologies - Asia (ISGT Asia), Brisbane, Australia, 2021, pp. 1-5. https://doi. org/10.1109/ISGTAsia49270.2021.9715712doi:[[[10.1109/ISGTAsia49270.2021.9715712]]]
10 A. Kahraman and G. Yang, "Home energy management system based on deep reinforcement learning algorithms," in Proceedings of 2022 IEEE PES Innovative Smart Grid Technologies Conference Europe (ISGT-Europe), Novi Sad, Serbia, 2022, pp. 1-5. https://doi.org/10.1109/ISGT-Europe54678.2022.9960575doi:[[[10.1109/ISGT-Europe54678.2022.9960575]]]
11 B. Zhang, Z. Yi, Y . Xu, J. Xu, Y . Lu, and Y . Zhou, "Expert incorporated deep reinforcement learning approach for market arbitrage strategy of the battery energy storage," in Proceedings of 2023 IEEE 7th Conference on Energy Internet and Energy System Integration (EI2), Hangzhou, China, 2023, pp. 1709-1714. https://doi. org/10.1109/EI259745.2023.10513107doi:[[[10.1109/EI259745.2023.10513107]]]
12 H. Noh, and Y . Lim, "Performance comparison of reinforcement learning for cost savings in smart grid, " Proceedings of the Annual Symposium of Korea Information Processing Society Conference (KIPS), vol. 31, no. 1, pp. 662-665, 2024. https://kips.or.kr/bbs/confn/article/3656doi:[[[https://kips.or.kr/bbs/confn/article/3656]]]
13 V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, "Playing Atari with deep reinforcement learning," 2013 (Online). Available: https://doi.org/10.48550/arXiv.1312. 5602.doi:[[[10.48550/arXiv.1312.5602]]]
14 Z. D. Tekler, E. Ono, Y . Peng, S. Zhan, B. Lasternas, and A. Chong, "ROBOD, room-level occupancy and building operation dataset," Building Simulation, vol. 15, no. 12, pp. 2127-2137, 2022. https://doi.org/ 10.1007/s12273-022-0925-9doi:[[[10.1007/s12273-022-0925-9]]]

Number	Algorithm	Change of RTP	Action	Optimization of SoC	Constraints of overcharge/discharge	Limitations
Bijapur et al. [7]	Q-learning, SARSA	Yes	Discrete	No	No	Discrete decision only (charge, discharge, hold)
Rostmnezhad and Dessaint [8]	Q-learning	No	Discrete	No	No	Non-dynamic pricing
Brock et al. [9]	DQN	Yes	Discrete	No	No	Discrete charge/discharge levels
Kahraman and Yang [10]	DQN, DDPG, TD3	Yes	Discrete/continuous	Yes	Yes	Does not compare stochastic policy performance
Zhang et al. [11]	DDPG	Yes	Continuous	Yes	Yes	Overcharge/overdischarge constraints pre-defined, not in reward
This work	DDPG, TD3, SAC	Yes	Continuous	Yes	Yes(included in reward)	-

Office dataset		Library dataset
Notation	Value (kWh)	Notation	Value (kWh)
ESS_cap	3.3	ESS_cap	6.8
ESS_init	1.32	ESS_init	2.72
ESS_min	0.66	ESS_min	1.36
ESS_max	2.64	ESS_max	5.44

Making articles easier to read in PMC

Welcome to PubReader!

Optimizing Energy Storage Systems Using Deep Reinforcement Learning in Smart Grids

Article Information

Abstract

1. Introduction

2. Related Work

3. System Model

3.1 Background

3.2 Markov Decision Process

(1)

(2)

(3)

(4)

(5)

(6)

(7)

4. Performance Analysis

4.1 Environment

4.2 Experimental Results

5. Conclusion

Conflict of Interest

Funding

Biography

Hajin Noh

Biography

Yujin Lim

References