Article Information
Corresponding Author: Xinjian Zhao , zhxj198708@126.com
Xinjian Zhao, Information & Telecommunication Branch, State Grid Jiangsu Electric Power Co. Ltd., Nanjing, China, zhxj198708@126.com
Weiwei Miao, Information & Telecommunication Branch, State Grid Jiangsu Electric Power Co. Ltd., Nanjing, China, mww196801@sgcc.cn
Song Zhang, Information & Telecommunication Branch, State Grid Jiangsu Electric Power Co. Ltd., Nanjing, China, zhs199608@sgcc.cn
Youjun Hu, Nari Information & Communication Technology Co. Ltd., Nanjing, China, hyj198111@sgepri.sgcc.cn
Shi Chen, Information & Telecommunication Branch, State Grid Jiangsu Electric Power Co. Ltd., Nanjing, China, chs199509@sgcc.cn
Received: June 11 2024
Revision received: October 14 2024
Accepted: December 27 2024
Published (Print): June 30 2025
Published (Electronic): June 30 2025
1. Introduction
With the gradual increase of the construction of new power systems, an increasing number of user-side electrical energy storage devices and distributed renewable energy sources have been integrated into power systems through third-party aggregation platforms [1]. These multisource, heterogeneous, distributed resources generate massive amounts of power load information [2]. Efficient anomaly detection for power load interaction information is essential for providing reliable data support for the scheduling and control of new power systems.
In recent years, researchers have proposed various solutions to the problem of anomaly detection for power load data. Regarding specific implementation methods, the research results can be categorized into two types: traditional machine-learning-based and deep-learning-based anomaly detection. Anomaly detection algorithms based on deep learning have better detection abilities, but their performance often depends on the size of the dataset, and their computational overhead is large [3–5]. In practice, the power load has certain regional characteristics and imbalances, which makes the dataset always have similar characteristics to a small data volume. However, it is usually necessary to perform simple and fast anomaly detection at the edge of the network. Therefore, in comparison, traditional machine-learning-based anomaly detection methods can detect anomalous data on a limited dataset with a lower overhead, which is more suitable for the existing power grid environment. In anomaly detection for power data, the main methods based on machine learning include isolation forest [6], one-class support vector machine (SVM) [7], local outlier factor (LOF) [8], density-based spatial clustering of applications with noise (DBSCAN) [9], local correlation integral (LOCI) [10], connectivity-based outlier factor (COF) [11], and histogram-based outlier score (HBOS) [12]. Existing machine-learning-based anomaly detection methods typically use different combinations of algorithms to obtain anomaly detection results by classifying or clustering the original power load dataset. However, the actual accessible power load data often have uncertainties in such issues as data integrity and synchronization, and these uncertainties affect the data detection results significantly.
In this study, an anomaly detection scheme for power loads was developed based on robust principal component analysis (PCA) and clustering algorithms. First, user power load data are collected and analyzed to extract features. Then, a robust PCA algorithm is used for preliminary classification, categorizing the power load data into suspected abnormal and suspected normal groups. Based on this classification, an improved clustering algorithm is used to refine further and extract the results.
The main contributions of this study are as follows.
· A phased anomaly detection scheme for power load data was developed that realizes anomaly detection using a robust PCA algorithm and an improved K-means algorithm.
· The proposed scheme was tested with public datasets, and the results show that the scheme has a better detection performance than similar mainstream schemes.
2. Phased Anomaly Detection Method for Power Load Data
Framework of the proposed scheme.
A phased anomaly detection method for power load data was developed, and its framework is shown in Fig. 1. First, intelligent terminals obtain the user power load data and extract features from these data, including Kullback–Leibler (KL) score, flat points, Canberra distance, and crossover points. Second, the robustness-enhanced PCA algorithm is used to analyze the features of power load data, preliminarily dividing them into abnormal and normal groups. Finally, the improved K-means algorithm is used to remove outliers from the classification results obtained from the preliminary classification in the previous stage and to obtain a classification of the power load data with clear boundaries.
2.1 Feature Extraction
In this study, the monthly power load was used to represent the electricity usage of each user, which is defined as the power load of the user every 30 days. Feature extraction is the process of extracting the intrinsic features of a dataset, which reduces the dimensionality of the data and saves computational resources. The extracted features include the average load (the average power load of each platform during this period), variance (the variance of the power load of each platform during this period), horizontal displacement difference (the maximum difference in average load between days), variance difference (the maximum difference in variances between months), KL score (the maximum difference in KL divergence between consecutive months), flat spot (the length of the largest flat interval within each month), and Canberra distance (a numerical measure of distance between points in vector space).
Two of the most essential features are the Canberra distance and flat point. The Canberra distance is considered a weighted version of the Manhattan distance and is calculated using Eq. (1):
where [TeX:] $$x_i \text { and } y_i$$ are different data points in the real value.
2.2 Preliminary Classification based on Robust Principal Component Analysis
A robust PCA algorithm [13] is used to preliminarily classify the users reflected by the power load information into suspected abnormal and normal groups with the following workflow.
Step 1: The feature matrix of the power load data, [TeX:] $$Y_{n \times p},$$ is input, where n is the number of users, and p is the number of features for each user. The algorithm first normalizes the given matrix and then transforms it into an affine space based on singular value decomposition.
Step 2: The anomaly index of each data point [TeX:] $$y_i(i=1,2, \ldots, n)$$ is calculated, as shown in Eq. (2):
where B contains all nonzero vectors, and [TeX:] $$a_{M C D} \text { and } s_{M C D}$$ are the mean and standard deviation calculated using the minimum covariance determinant (MCD) method, respectively.
Step 3: The covariance matrix [TeX:] $$S_0=P_0 L_0 P_0^T$$ is calculated, where [TeX:] $$L_0=\operatorname{diag}\left(\tilde{l}_1, \ldots, \tilde{l}_r\right), r\lt r_1$$ is the feature value matrix, and [TeX:] $$P_0$$ is an orthogonal matrix of [TeX:] $$r_1$$ rows and r columns. The data points are projected on the subspace spanned by the first [TeX:] $$k_0$$ eigenvectors of [TeX:] $$S_0$$, as shown in Eq. (3):
where [TeX:] $$p_{r_1 \times k_0}$$ consists of the first [TeX:] $$k_0$$ columns of [TeX:] $$P_0.$$
Step 4: The MCD estimator is used to estimate the scatter matrix of data points in of [TeX:] $$Y_{n \times k 0}^*$$ robustly. The robust covariance matrix [TeX:] $$S=P_{p \times k} L_{k \times k} P_{p \times k}^T,$$ so the robust principal component matrix can be rewritten as [TeX:] $$T_{n \times k}=\left(Y_{n \times p}-1_n \hat{\mu}^T\right) P_{p \times k} .$$
Step 5: Orthogonal distance is defined as the distance between each observation and its projection onto the new subspace, as shown in Eq. (4):
where [TeX:] $$y_i$$ is the i-th data, and [TeX:] $$\hat{y}_i$$ is the projection data point on the k-dimensional subspace. The score distance is calculated using Eq. (5):
where [TeX:] $$l_j$$ is the set of eigenvalues, and k is the number of principal components.
Step 6: Two threshold values are calculated for the sum of the dataset to separate normal observations from abnormal observations. These are calculated separately using Eqs. (6) and (7):
where [TeX:] $$\hat{\mu}$$ is the average value, [TeX:] $$\hat{\sigma}$$ is the variance for the given data point, 0.975 means 97.5% quantile of the Gaussian distribution, and [TeX:] $$X_k^2$$ is the square of the Mahalanobis distance. Based on the threshold values, the power load data are preliminarily classified into two major groups.
2.3 Anomaly Detection based on the Improved K-Means Algorithm
Because the traditional K-means algorithm initializes cluster centers randomly [14], the results of the algorithm depend highly on the initial selection of cluster centers, and the final result is likely to have a high error. The method of selecting clustering centers based on the relative distance and density between data points makes a better separation of outliers possible. Based on this, an improved K-means clustering algorithm was developed to detect anomalies and generate distinct load types with apparent boundaries to address the slight overlap between the two categories of load data identified in the previous stage. The specific flow of the algorithm is shown in Algorithm 1.
First, the relative distances and densities of the data points are calculated. The
[TeX:] $$D\left(x_i\right)$$ values are ranked in descending order, and the data point
[TeX:] $$c_i$$ with the maximum density is selected as the initial clustering center. The thresholds are then calculated separately for different cases based on the range of the center node density values. The clustering process of the nodes is completed by iteration to obtain the optimal number of clusters K. In the second stage, an outlier factor
[TeX:] $$o_i$$ is computed for each observation in the K clusters, which depends on its distance from the clustering center. The value of
[TeX:] $$o_i$$ ranges from 0 to 1, and the value of μ is 0.95 in this algorithm, which means the data with
[TeX:] $$o_i \geq 0.95$$ are considered abnormal.
This algorithm is used to remove outliers for each of the two categories of users derived in Section 2.1, the data exceeding the thresholds among the suspected normal and suspected abnormal users are finally categorized as abnormal data, and all the rest are finally categorized as normal data.
Anomaly detection based on the improved K-means clustering algorithm
3. Experiment and Result Analysis
The experimental environment was based on the Ubuntu environment using Python. The CPU was Intel Xeon Silver 4208R with 16 GB RAM.
3.1 Experimental Dataset
This experiment used power consumption load data from the Los Alamos Public Utility Department in New Mexico, USA [15]. The data were collected using Landis+Gyr smart meter devices from 1,757 households in North Mesa, Los Alamos, NM, USA. The sampling rate was one observation every 15 minutes. For most customers, the data span approximately 6 years, from July 30, 2013 to December 30, 2019.
Affected by extreme weather, production, life, and equipment failure, and other uncertainties, the power load is characterized by randomness, volatility, and sudden changes. Power load data often show abnormal jumps, manifesting as fluctuation anomalies and extreme value anomalies. The fluctuation anomaly load curve shows a large number of burrs and frequent fluctuations in a short period compared with the normal load fluctuation law, which has a significant jump, as shown in Fig. 2(a). The extreme anomaly load curve is manifested as a load data extreme value abnormality in a certain period (duration is usually minutes) of load spike, valley, or significant peak–valley difference, destroying the curve similarity and periodicity, as shown in Fig. 2(b).
Schematic diagram of anomalies of power load data: (a) fluctuation anomalies and (b) extreme value anomalies.
3.2 Evaluation Metrics
Owing to the abnormal detection of power load information, the interaction is a binary classification problem. For binary classification problems, the performance of a classification model can be evaluated using a confusion matrix, as shown in Table 1. The rows in the table represent the predicted classes, whereas the columns represent the actual classes. Based on the confusion matrix, one can obtain evaluation metrics, including accuracy (ACC), recall rate (RR), false positive rate (FPR), false negative rate (FNR), precision, F1-score, and Bayesian detection rate (BDR).
Confusion matrix in anomaly detection
3.3 Results and Discussion
Experiments were conducted using the proposed approach, and the results were analyzed and compared with some commonly used anomaly detection methods. In the performance metric scores shown in Table 2, the first six performance metrics are directly derived from the confusion matrix, and the last metric, BDR, is calculated using prior knowledge of fraud probability. The proposed approach achieved an accuracy of 91% and a recall rate of 81%, indicating that it could detect the types of user and most of the actual abnormalities.
ACC and recall are commonly used metrics in almost all abnormal electricity usage detection systems. However, these two metrics cannot be used as decisive indicators of scheme performance. The drawback of ACC is that, when the proportions of different categories of samples are highly unbalanced, the larger proportion category often becomes the main factor affecting ACC. For example, when defective samples account for 99% of the total, the classifier predicts that all samples are defective, resulting in an accuracy of 99%. Therefore, it is necessary to verify the performance of the scheme further by calculating and comparing the values of FNR and FPR. Under the same conditions, the FPR value of the proposed scheme is the lowest, and the FNR value is only slightly higher than that of one-class SVM. This indicates that the proposed scheme can only classify a small portion of the normal electricity consumption as abnormal. Another critical performance metric used for qualitative analysis of anomaly detection methods is the F1-score. It helps strike a proper balance between precision and recall because these two metrics are contradictory. The higher the value, the better the predictability of the model, and vice versa. The F1-score obtained by the proposed method is 75%, the highest among the compared techniques. Because a reliable anomaly detection model has a high BDR value, the BDR score is used for the evaluation. In this study, the fraud probability was 16%, and the BDR value of the scheme was 63%, which were much higher than those of the other anomaly detection algorithms. Compared with the deep-learning-based detection model [5], the proposed method still has some advantages for most metrics because of the limited dataset volume.
Performance metrics of proposed approach compared with commonly used anomaly detection algorithms
ROC curves for different schemes.
Fig. 3 shows the receiver operating characteristic (ROC) curves of the proposed scheme, isolation forest, and one-class SVM. The ROC curve of the proposed scheme is closer to the upper left corner. The area under the curve (AUC) scores of these three schemes are provided to give a more comprehensive view of the performance of the method. The AUC values of the three algorithms are 0.81, 0.75, and 0.72, respectively. This implies that the detection effect of the scheme described in this section is better than that of the isolation forest and one-class SVM.
4. Conclusion
A phased anomaly detection method was developed to improve the efficiency of anomaly detection in power load data. First, the most important features of the power load were extracted from the data. Then, the power load was preliminarily classified into normal and abnormal groups based on the robustness-enhanced PCA algorithm. Based on the preliminary classification, an improved K-means clustering algorithm was used to obtain the final classification results. The experimental results show that the proposed method can improve the shortcomings of a single machine-learning method in terms of anomaly detection performance to some extent and provide more reliable anomaly detection results.
However, limited by the K-means algorithm itself, the clustering process cannot reflect the temporal characteristics of the data, which can affect the detection performance. Therefore, in future work, more attention should be paid to the temporal characteristics of power load data to improve the anomaly detection performance.
Conflict of Interest
The authors declare that they have no competing interests.
Funding
This work was supported by the Science and Technology Project of State Grid Jiangsu Electric Power Company Ltd. (Grant No. J2023124).