## Hongyi Cao* , Qiaomu Ren* , Xiuguo Zou* , Shuaitang Zhang* and Yan Qian*## |

No. | Cross-sectional_area [TeX:] $$\left(\mathrm{mm}^{2}\right)$$ | Length (km) | Voltage_class (kV) |
---|---|---|---|

1 | 300 | 20.788 | 110 |

2 | 185 | 4.663 | 35 |

3 | 185 | 2.434 | 35 |

4 | 400 | 0.638 | 220 |

5 | 240 | 2.434 | 35 |

6 | 300 | 1.778 | 110 |

7 | 300 | 1.778 | 110 |

8 | 720 | 104.6 | 500 |

9 | 185 | 4.663 | 35 |

10 | 300 | 4.42 | 110 |

... | ... | ||

10,174 | 240 | 3.439 | 110 |

As each voltage level has different requirements for the material of line, the lines are classified according to the voltage level. We have obtained 2,828 pieces of data from 35 kV line model, 4,447 pieces from 110 kV line model, 2,706 pieces from 220 kV line model, 183 pieces from 500 kV line model and 10 pieces from 1,000 kV line model. Some raw data of classified lines are presented in Table 2.

The big data based data processing method is not suitable to process the data from 500 kV and 1,000 kV line models due to its limited data size (183 pieces of data from 500 kV models; 10 pieces of data from 1,000 kV line models). Therefore, this paper only analyzes the line models under the three voltage levels of 35 kV, 110 kV, and 220 kV.

Table 2.

Classification | No. | Cross-sectional_area [TeX:] $$\left(\mathrm{mm}^{2}\right)$$ | Length (km) |
---|---|---|---|

Data of 35 kV voltage class | 1 | 300 | 13.33 |

2 | 240 | 7.06 | |

3 | 150 | 5 | |

4 | 150 | 3.08 | |

5 | 240 | 8 | |

6 | 240 | 10 | |

7 | 240 | 6.88 | |

... | ... | ||

2,828 | 300 | 7.2 | |

Data of 110 kV voltage class | 1 | 240 | 26.27 |

2 | 300 | 4.56 | |

3 | 300 | 4.56 | |

4 | 300 | 4.56 | |

5 | 300 | 4.56 | |

6 | 300 | 2.54 | |

7 | 300 | 1.50 | |

... | ... | ||

4,447 | 240 | 15.49 | |

Data of 220 kV voltage class | 1 | 400 | 1.09 |

2 | 400 | 1.09 | |

3 | 400 | 0.90 | |

4 | 400 | 0.87 | |

5 | 300 | 0.60 | |

6 | 400 | 0.39 | |

7 | 300 | 0.54 | |

... | ... | ||

2,706 | 240 | 0.44 |

The monitoring and feedback data of the smart power system have some drawbacks such as huge data size, numerous data types, ineffective information, etc. [2]. To obtain the reasonable data fit for this research, the raw data is processed for the power monitoring and feedback which includes conductor length, cross-sectional area, resistance, circuit region, circuit model, etc. Considering the data extraction should conform to the practice of the extraction of cross-sectional area, in the case of one length of conductor with multiple cross-sectional area, the processing method is to divide one length of the conductor into multiple parts and to mark each part by corresponding cross-sectional area.

The missing data and nulls in the raw data were all weeded out in the data preprocessing to avoid influence on the experimental result. Besides, the general data were normalized with the consideration of the density level of cluster data.

Boxplot is a graph used to display the dispersion of a set of data. It can show the maximum, minimum, median, and upper and lower quartiles of a set of data.

A boxplot is drawn according to the following steps:

1) Determine the median, upper quartile (Q3) and lower quartile (Q1) of the line loss data to be processed;

2) Determine the interquartile range of the line loss data from the line model IQR=Q3–Q1;

3) Two line segments are drawn at Q3+1.5*IQR and Q1–1.5*IQR, respectively, which are defined as the inner limits of the line loss data; two line segments are drawn at Q3+3*IQR and Q1–3*IQR, respectively, which are defined as the outer limits of line loss data. The data represented by the points outside the inner limit are all abnormal values, where those between the inner limit and the outer limit are mild ones, and those outside the outer limit are the extreme ones.

[TeX:] $$$$ DBSCAN is an algorithm depicting the compact degree of sample distribution based on a group of “neighborhood” parameters. For a given dataset [TeX:] $$\mathrm{D}=\left\{x_{1}, x_{2}, \ldots, x_{m}\right\},$$ there are following concepts [24]:

1) [TeX:] $$\mathcal{E}$$-neighborhood: For [TeX:] $$x_{j} \in \mathrm{D},$$ its ε neighborhood includes the samples which keep distances from [TeX:] $$x_{\mathrm{j}}$$ no longer than ε in the sample set D, as shown by Eq. (1):

where [TeX:] $$\mathcal{E}$$ is the minimum radius, [TeX:] $$\operatorname{dist}\left(x_{i}, x_{j}\right)$$ denotes the distance from [TeX:] $$x_{i} \text { to } x_{j}, i, j=1,2, \ldots, \mathrm{m}.$$

2) Core object: If [TeX:] $$x_{j}$$ includes at least MinPts samples, i.e., [TeX:] $$\left|\mathrm{N}_{\varepsilon}\left(x_{j}\right)\right| \geq M i n P t s,$$ it is a core object.

3) Density-directly-reachable: If [TeX:] $$x_{j}$$ is located in the ε-neighborhood of [TeX:] $$x_{i}, \text { and } x_{j}$$ is the core object,[TeX:] $$x_{j}$$ is called density-directly-reachable by [TeX:] $$x_{i}$$.

4) Density-reachable: For [TeX:] $$x_{i} \text { and } x_{j},$$ if there exists a sample sequence [TeX:] $$p_{1}, p_{2}, \ldots, p_{n}, \text { wherein, } p_{1}=x_{i}, p_{n}=x_{j}$$ and [TeX:] $$P_{i+1}$$ is density-directly-reachable by [TeX:] $$p_{i}, x_{j}$$ is called density-reachable by [TeX:] $$x_{i}$$

5) Density-linked: For [TeX:] $$x_{i} \text { and } x_{j}, \text { if } x_{k}$$ exists to make [TeX:] $$x_{i} \text { and } x_{j}$$ density-reachable by [TeX:] $$x_{k}, x_{i}$$ is called densitylinked with [TeX:] $$x_{\mathrm{j}}.$$

To research the relation between two neighborhood parameters, clustering analysis is carried out on the power line model data for an East China province, which the results are shown in Table 3. Let us denote the minimum radius as ε and the number of points within the minimum radius as MinPts.

Table 3.

MinPts)Sample set (kV) | Parameter [TeX:] $$$(\varepsilon,$ | Number of clusters | Outlier | Cross-sectional area [TeX:] $$\left(\mathrm{mm}^{2}\right)$$ | Length (km) | ||
---|---|---|---|---|---|---|---|

S_max | S_min | L_max | L_min | ||||

110 | (30, 40) | 4 | 51 | 67.6 | 0.02 | 425 | 120 |

110 | (30, 80) | 3 | 145 | 67.6 | 0.02 | 425 | 185 |

110 | (30, 120) | 3 | 245 | 67.6 | 0.02 | 425 | 210 |

110 | (30, 160) | 2 | 397 | 67.6 | 0.028 | 340 | 210 |

110 | (30, 200) | 2 | 397 | 67.6 | 0.028 | 340 | 210 |

In Table 3, L_max and L_min represent the maximum and minimum of line length, respectively; S_max and S_min represent the maximum and minimum of cross-sectional area, respectively.

Observation of the above table finds that the influence of the two density parameters on the clustering result may have some intrinsic correlation. Based on the above observation and conjecture of data, the following experiments were conducted.

**Experiment 1:**: Firstly, a dataset of operation model dataset with adequately large data size was selected, and each piece of data respectively included two data dimensions of conductor length and cross-sectional area of the conductor. The original data was put in the data model and density parameters were adjusted to obtain different clustering results. Then, the data size in the original dataset was doubled, and the data was put in the data model again. In the meanwhile, the parameter value of MinPts was doubled correspondingly, with the experimental result obtained as shown in Table 4.

Table 4.

MinPts)Sample set (kV) | Parameter [TeX:] $$$(\varepsilon,$ | Number of clusters | Outlier | Cross-sectional area [TeX:] $$\left(\mathrm{mm}^{2}\right)$$ | Length (km) | ||
---|---|---|---|---|---|---|---|

S_max | S_min | L_max | L_min | ||||

110 | (30, 40) | 5 | 72 | 67.6 | 0.02 | 425 | 70 |

110 | (30, 80) | 4 | 102 | 67.6 | 0.02 | 425 | 120 |

110 | (30, 120) | 4 | 104 | 67.6 | 0.02 | 425 | 120 |

110 | (30, 160) | 3 | 258 | 67.6 | 0.02 | 425 | 185 |

110 | (30, 240) | 3 | 490 | 67.6 | 0.02 | 425 | 210 |

110 | (30, 320) | 2 | 794 | 67.6 | 0.028 | 340 | 210 |

110 | (30, 400) | 2 | 794 | 67.6 | 0.028 | 340 | 210 |

**Experiment 2:** Firstly, a data set of an operation model dataset with an adequately large data size was selected, and each piece of data respectively included two data dimensions of conductor length and crosssectional area. The original data was put in the data model and density parameters were adjusted to obtain different clustering results. Then, half of the data were randomly extracted from the raw data and put in the data model for operation. In the meanwhile, the MinPts has been reduced to half of the original value accordingly, with the result as shown in Table 5.

Table 5.

MinPts)Sample set (kV) | Parameter [TeX:] $$$(\varepsilon,$ | Number of clusters | Outlier | Cross-sectional area [TeX:] $$\left(\mathrm{mm}^{2}\right)$$ | Length (km) | ||
---|---|---|---|---|---|---|---|

S_max | S_min | L_max | L_min | ||||

110 | (30, 20) | 6 | 12 | 67.6 | 0.0225 | 425 | 70 |

110 | (30, 40) | 4 | 49 | 67.6 | 0.0225 | 425 | 185 |

110 | (30, 60) | 4 | 50 | 67.6 | 0.0225 | 425 | 185 |

110 | (30, 80) | 3 | 122 | 67.6 | 0.0225 | 340 | 185 |

110 | (30, 100) | 2 | 204 | 67.6 | 0.0225 | 340 | 220 |

Comparison of the above two experiments can result in the following conclusions:

1) The clustering results for the dataset with different sizes under the same neighborhood parameters are different.

2) When the dataset is completely duplicated into twice that of the original dataset, the MinPts in the density parameters is also increased to twice that of the original value, and the clustering results are identical.

3) Increase the MinPts index to twice of the original value when the data size of the new dataset is twice that of the original dataset. Though the obtained clustering result is not identical to the result of the original dataset, however, they are roughly in compliance with each other.

In summary, the results obtained using DBSCAN clustering are related to the type of clustering parameters and the size of clustering data.

Based on the above conclusion, we attempt to find out the optimum density parameters under the dataset with a certain amount of data. When the new data is provided, just compare the new data size with that of the original clustering data, thereby adjust the parameter index of MinPts in the clustering algorithm according to the size of the new data. Then, the practical significance and functions of the two neighborhood parameters can be differentiated, where ε is the radius parameter, which determines the index of clustering density according to the concrete data variables. The MinPts is the data parameter, which determines the index of clustering density according to the data size of the clustering target.

Coupled with the development of the economy and productive forces like technologies, the quality of electric power has played an important role in people’s life [25]. Currently, the power system mainly uses two models: the main grid line model and the main grid transformer model. The former mainly analyzes various attribute properties of the conductor with different voltage classes, and is mainly influenced by the length and cross-sectional area of conductors. The latter is a commonly used model judging the performance of the transformer and is mainly affected by line loss and capacity. For the above two common power data models, the paper analyzes the optimization method for related data and compares the optimization effect of the algorithms.

As a mature and convenient means in traditional data screening and data preprocessing, the boxplot method has always been favored by a great many data researchers. For the traditional data screening of power related big data, the data processing method of boxplot targeting each variable is a reliable means.

The data for the main grid line model for an East China province is screened through the boxplot method and probability weight analysis, with the raw data is as shown in Table 6.

Table 6.

Voltage_class (kV) | L_max (km) | L_min (km) | L_mean (km) | S_max [TeX:] $$\left(\mathrm{mm}^{2}\right)$$ | S_min [TeX:] $$\left(\mathrm{mm}^{2}\right)$$ | S_mean [TeX:] $$\left(\mathrm{mm}^{2}\right)$$ |
---|---|---|---|---|---|---|

35 | 25,555 | 0.01 | 22.25 | 500 | 50 | 203.3761 |

110 | 500 | 0.02 | 9.00 | 630 | 50 | 280.5564 |

220 | 116.69 | 0.032 | 9.90 | 675 | 70 | 383.9208 |

500 | 246.17 | 0.77 | 52.81 | 800 | 300 | 561.4583 |

1,000 | 275.28 | 118.2 | 180.47 | 630 | 630 | 630 |

Different boxplots are established to process the data of main grid line model with different voltage classes. The processed result is shown in Fig. 1.

For the length data in the data of main grid line, most of the frequencies by which the lengths corresponding to the voltage classes of 35 kV, 110 kV, 220 kV, 500 kV appear once, which makes it impossible to take scope using probability. The probability is uniform relatively, so the algorithms such as boxplot or something can be considered for data analysis. According to the characteristics of data structure and processing result, the values between the Q3+1.5*IQR and Q1–0.25*IQR are selected to constitute a box, while the rest ones are called abnormal values. Wherein, the interquartile range IQR = Q3 – Q1. For the data of main grid line model with the five voltage classes, the data respective partition is carried out using boxplot method, with the statistical result of the maximum and minimum as shown in Table 7.

Table 7.

Voltage_class (kV) | L_max (km) | L_min (km) |
---|---|---|

35 | 24.3985 | 1.5048 |

110 | 22.7045 | 1.6643 |

220 | 32.9459 | 0 |

500 | 114.115 | 22.2375 |

1000 | 307.81 | 99.635 |

As the data is distributed discretely, it is advisable to process the cross-sectional area data in the line data of the main grid using boxplot method. Thus the probability weight analysis method is adopted.

The “big value” appearing for the first time within the minimal scope is taken as the lower limit of cross-sectional area data, and the “big value” appearing for the first time within the maximal scope is taken as the upper limit of sectional area data. The output interval is defined according to the two values and the reasonable sectional area data is screened for.

The data statistical result of the cross-sectional area in the experiment is as shown in Table 8.

It is found by dint of the related knowledge of probability statistics that the probability weight and frequency corresponding to the area of 70 are significantly higher than those corresponding to the areas of 50 and 60. Such is the situation that the data of 70 as the lower limit of statistics. Through the above methods, the maximum and minimum for the line model area with 5 voltage classes are obtained, as shown in Table 9.

As the length and the cross-sectional area are two reference parameters which simultaneously act on the result of data screening, it is more reasonable to consider simultaneously these two variables for screening than to screen individual variables respectively, so the DBSCAN-based clustering algorithm is adopted for data analysis.

As raw data differ a lot in order of length and cross-sectional, while the DBSCAN algorithm requires the relatively average order of variables, the raw data is first subject to standardization by using the Eq. (2).

Python is used as a platform for big data processing. For the power grid data with the voltage of 35 kV, 110 kV, 220 kV, the clustering results are shown in Tables 10–12.

Table 8.

Cross-sectional_area [TeX:] $$\left(\mathrm{mm}^{2}\right)$$ | Frequency | Probability |
---|---|---|

50.00 | 1 | 0.0468 |

65.00 | 1 | 0.0468 |

70.00 | 56 | 2.6180 |

82.50 | 3 | 0.1403 |

85.00 | 4 | 0.1870 |

90.00 | 2 | 0.0935 |

92.50 | 7 | 0.3273 |

95.00 | 59 | 2.7583 |

101.67 | 1 | 0.0468 |

105.00 | 6 | 0.2805 |

107.50 | 6 | 0.2805 |

109.00 | 3 | 0.1403 |

110.00 | 3 | 0.1403 |

113.33 | 1 | 0.0468 |

117.50 | 1 | 0.0468 |

120.00 | 118 | 5.5166 |

121.67 | 1 | 0.0468 |

122.50 | 3 | 0.1403 |

123.75 | 3 | 0.1403 |

125.00 | 2 | 0.0935 |

126.67 | 2 | 0.0935 |

127.50 | 13 | 0.6078 |

133.33 | 4 | 0.1870 |

Table 9.

Voltage_class (kV) | S_max [TeX:] $$\left(\mathrm{mm}^{2}\right)$$ | S_min [TeX:] $$\left(\mathrm{mm}^{2}\right)$$ |
---|---|---|

35 | 400 | 70 |

110 | 500 | 150 |

220 | 630 | 185 |

500 | 800 | 300 |

1000 | 630 | 630 |

Table 10.

)Parameter ([TeX:] $$\mathcal{E}$$, MinPts) | Number of clusters | Outlier | Cross-sectional area [TeX:] $$\left(\mathrm{mm}^{2}\right)$$ | Length (km) | ||
---|---|---|---|---|---|---|

S_max | S_min | L_max | L_min | |||

(50, 80) | 2 | 67 | 46.4 | 0.01 | 300 | 25 |

(30, 60) | 4 | 67 | 46.4 | 0.01 | 300 | 25 |

(90, 70 | 1 | 67 | 46.4 | 0.01 | 300 | 25 |

(50, 250) | 2 | 67 | 46.4 | 0.01 | 300 | 25 |

(30, 40) | 5 | 15 | 46.4 | 0.01 | 400 | 25 |

(50, 50) | 3 | 15 | 46.4 | 0.01 | 400 | 25 |

(100, 40) | 2 | 7 | 100 | 0.01 | 400 | 25 |

(10, 30) | 8 | 53 | 36.5 | 0.01 | 400 | 70 |

Table 11.

Parameter ([TeX:] $$\mathcal{E}$$, MinPts) | Number of clusters | Outlier | Cross-sectional area [TeX:] $$\left(\mathrm{mm}^{2}\right)$$ | Length (km) | ||
---|---|---|---|---|---|---|

S_max | S_min | L_max | L_min | |||

(60, 20) | 3 | 23 | 100 | 0.02 | 465 | 35 |

(60, 50) | 3 | 25 | 100 | 0.02 | 465 | 50 |

(40, 30) | 3 | 43 | 79.088 | 0.02 | 426 | 95 |

(50, 50) | 3 | 40 | 79.088 | 0.02 | 465 | 95 |

(80, 30) | 1 | 9 | 100 | 0.02 | 500 | 35 |

(80, 50) | 1 | 9 | 100 | 0.02 | 500 | 35 |

(100, 110) | 1 | 9 | 100 | 0.02 | 500 | 35 |

(30, 40) | 4 | 51 | 67.6 | 0.02 | 425 | 120 |

Table 12.

Parameter ([TeX:] $$\mathcal{E}$$, MinPts) | Number of clusters | Outlier | Cross-sectional area [TeX:] $$\left(\mathrm{mm}^{2}\right)$$ | Length (km) | ||
---|---|---|---|---|---|---|

S_max | S_min | L_max | L_min | |||

(40, 30) | 2 | 0 | 81.545 | 0.032 | 630 | 120 |

(80, 30) | 2 | 0 | 115.126 | 0.032 | 675 | 70 |

(100, 110) | 2 | 0 | 115.126 | 0.032 | 675 | 70 |

(100, 40) | 2 | 0 | 115.126 | 0.032 | 675 | 70 |

(80, 50) | 4 | 2 | 115.126 | 0.032 | 675 | 120 |

(60, 20) | 2 | 2 | 115.126 | 0.032 | 675 | 120 |

(70, 50) | 4 | 6 | 115.126 | 0.032 | 675 | 120 |

(30, 40) | 5 | 60 | 81.545 | 0.032 | 630 | 220 |

The data volume is so limited, with only 183 and 10 pieces of data for voltage levels of 500 kV and 1,000 kV, respectively, that it is not suitable to cluster them using DBSCAN method. Therefore, the clustering will not be performed for the data of 500 kV and 1,000 kV.

The data results under each parameter are empirically judged by the electrical engineers from the State Grid Corporation of China. Tables 10–12 displays the line loss parameters under each voltage level. Table 13 presents the results obtained by the DBSCAN clustering method and the boxplot probability weighting algorithm.

Table 13.

Voltage_class (kV) | [TeX:] $$\mathbf{S}_{-} \max \left(\mathbf{m} \mathbf{m}^{2}\right)$$ | [TeX:] $$\mathbf{s}_{-} \min \left(\mathbf{m} \mathbf{m}^{2}\right)$$ | L_max (km) | L_min (km) |
---|---|---|---|---|

35 | 400 | 25 | 46.4 | 0.01 |

110 | 465 | 95 | 79.088 | 0.02 |

220 | 630 | 220 | 81.545 | 0.032 |

The results obtained using DBSCAN clustering method and boxline probability weighing algorithm are shown in Table 14.

Table 14.

Method | Voltage_class (kV) | [TeX:] $$\mathbf{S}_{-} \max \left(\mathbf{m} \mathbf{m}^{2}\right)$$ | [TeX:] $$\mathbf{S}_{-} \min \left(\mathbf{m} \mathbf{m}^{2}\right)$$ | L_max (km) | L_min (km) |
---|---|---|---|---|---|

Boxplot and probability weight analysis | 35 | 400 | 70 | 24.3985 | 1.50475 |

110 | 500 | 150 | 22.7045 | 1.66425 | |

220 | 630 | 185 | 32.9459 | 0 | |

500 | 800 | 300 | 114.115 | 22.2375 | |

DBSCAN | 35 | 400 | 25 | 46.4 | 0.01 |

110 | 465 | 95 | 79.088 | 0.02 | |

220 | 630 | 220 | 81.545 | 0.032 | |

500 | 630 | 400 | 124.8 | 0.77 |

The comparison of the above two processing methods can reach the following conclusions:

1) In processing the data containing multiple variables, the DBSCAN algorithm can process and screen the multivariable data simultaneously, and the processing result is better than the result of only processing the data with a single variable. By contrast, in the boxplot and probability weighting method, a single variable takes a certain data close to the maximum and minimum values as the potentially suitable range of this parameter. The results obtained using the above methods, for not taking into account the connection of parameters, may cause inaccuracy. For example, for the 220 kV line model, the minimum line length obtained by the boxplot method is 0, which is impossible in actual situations.

2) DBSCAN has a higher data resolution, hence it can provide a more precise result for some practical problems without being influenced by some abnormal values. For example, for the power grid data in the East China province, some numerical values increased abnormally due to the misuse of the unit in the real process. If the traditional probability weight method had been used for analysis, the derived result would have had huge deviations.

To demonstrate the advantages and disadvantages of the above two algorithms, the line data of a grid transformer is used to experimentally examine the above conclusions.

In contrast, the data processing results obtained by the boxplot method are shown in Table 7.

The main grid’s transformer model is a data analysis model which is established based on the influence of the two variables of load and capacity on the result for the transformers with different windings under different voltage classes. Based on the above comparison between the clustering analysis and analysis via other data screening methods, the DBSCAN clustering algorithm is utilized to carry out a clustering analysis on the transformer model data, with clustering result shown in Table 15.

Wherein, the clustering for a certain parameter (e.g., [TeX:] $$\varepsilon=0.1$$0.1, MinPts = 20) obtains the result as shown in Fig. 2. Its ordinate represents the base 10 logarithm of the capacity.

It is observed from the Fig. 2 that the DBSCAN clustering algorithm obviously weeds out some unreasonable data with appropriate data size (take the above black data as example, the practical survey found that it is the waste data produced by users’ incorrect filling of unit) and reserves the reasonable data conforming to the characteristics of power grid.

The thresholds of 110_10 kV transformer for boxplot method are presented in Table 16.

Table 15.

Parameter[TeX:] $$(\varepsilon, \text { MinPts })$$ | Number of clusters | Outlier | Load | Capacity | ||
---|---|---|---|---|---|---|

Load_max (kW) | Load_min (kW) | C_max (kVA) | C_min (kVA) | |||

(0.05, 30) | 2 | 115 | 44.72 | 17.12 | 50,000 | 40,000 |

(0.05, 40) | 2 | 117 | 44.72 | 17.12 | 50,000 | 40,000 |

(0.05, 50) | 2 | 120 | 44.72 | 18.167 | 50,000 | 40,000 |

(0.05, 20) | 4 | 62 | 44.72 | 17.12 | 50,000 | 31,500 |

(0.1, 20) | 2 | 36 | 59.88 | 15 | 63,000 | 31,500 |

(0.1, 40) | 2 | 56 | 52 | 15 | 63,000 | 31,500 |

(0.2, 30) | 1 | 14 | 60 | 10 | 80,000 | 20,000 |

(0.2, 40) | 1 | 14 | 60 | 10 | 80,000 | 20,000 |

(0.3, 30) | 1 | 9 | 96.4 | 10 | 100,000 | 20,000 |

(0.3, 20) | 1 | 9 | 96.4 | 10 | 100,000 | 20,000 |

By repeatedly searching and comparing the raw data of the transformer, it is found that a considerable part of the data presenting the power capacity is incorrectly counted in terms of the unit of their magnitude, resulting in the great offset of the overall data. For the data processing method based on the boxplot, the range of values of parameters can only be obtained by the extremum and median of the raw data, and the threshold offset caused by the magnitude error cannot be ignored. For example, the threshold range C_max=100,000 given in Table 16 is an obvious error caused by magnitude error. For the DBSCAN method, as long as the parameters are appropriately adjusted, the interference of the magnitude error can be eliminated by clustering, as indicated by the results listed in Table 15.

Finally, through the repeated confirmation with the engineers from the State Grid, the parameters of 110_10 kV transformer are set to (0.05, 50), and the thresholds are presented in Table 17.

Currently, the proper data processing means are still absent in the statistics for smart grid data in the domain of power system. Although the traditional boxplot method and the probability weight method can screen the data, there still lacks the correlation between multiple dimensions of data, and the processing result is not perfect. The experimental verification indicates that the DBSCAN-based clustering method can effectively screen the monitoring and feedback data for the line transformer of the smart grid, and result in more satisfactory clustering effect in comparison with other methods. This method can significantly improve the accuracy of big data of the smart grid, thereby guaranteeing the reliability of the whole grid system.

It has been verified that through the above experiments the DBSCAN has a significantly better effect on obtaining the thresholds of multi-dimensional data than the boxplot method which can only analyze the one-dimensional data. Besides, the DBSCAN also solves the problem of dependence on the engineer’s experience to completely determine the line parameters. The DBSCAN can determine the threshold ranges of the related parameters, e.g. line loss, inline model. Through further confirmation of the engineers from the State Grid, the threshold range of the relevant parameters of the line model is determined in this paper.

This paper is supported by the Fundamental Research Funds for the Central Universities of China (No. KYTZ201661), China Postdoctoral Science Foundation (No. 2015M571782), and Jiangsu Agricultural Machinery Foundation (No. GXZ14002), and University Student Entrepreneurship Training Program of Jiangsu Province (No. 201810307030T).

He received the doctorate degree in Nanjing Agricultural University (China) in 2013. He currently works at Nanjing Agricultural University as an associate professor. His interests and research are focused on image processing and pattern recognition, and big data processing based on machine learning. He has authored over 10 technical journals in the area of the big data processing.

She received the doctorate degree in Nanjing Agricultural University (China) in 2014. She currently works at Nanjing Agricultural University as an associate professor. Her interests and research are focused on machine learning and deep learning. She has authored over 10 technical journals in the area of machine learning and deep learning.

- 1 N. Boumkheld, M. Ghogho, M. El Koutbi, "Energy consumption scheduling in a smart grid including renewable energy,"
*Journal of Information Processing Systems*, vol. 11, no. 1, pp. 116-124, 2015.custom:[[[-]]] - 2 X. Fang, S. Misra, G. Xue, D. Yang, "Smart grid: the new and improved power grid: a survey,"
*IEEE Communications Surveys & Tutorials*, vol. 14, no. 4, pp. 944-980, 2011.custom:[[[-]]] - 3 Y. Song, G. Zhou, Y. Zhu, "Present status and challenges of big data processing in smart grid,"
*Power System Technology*, vol. 37, no. 4, pp. 927-935, 2013.custom:[[[-]]] - 4 H. Liu, "Safety problems and maintenance technology analysis of power line overhaul,"
*Science & Technology*, vol. 2018, no. 4, pp. 69-70, 2018.custom:[[[-]]] - 5 Y. Li, D. L. Shi, "Design and research of cloud platform of computer experiment center under OpenStack,"
*China Hi-tech Industrial Development Zone*, vol. 2018, no. 8, pp. 231-231, 2018.custom:[[[-]]] - 6 G. J. Li, "The scientific value of big data research,"
*Chinese Computer Society Newsletter*, vol. 8, no. 9, pp. 8-15, 2012.custom:[[[-]]] - 7 D. S. Kermany, M. Goldbaum, W. Cai, C. C. Valentim, H. Liang, S. L. Baxter, et al., "Identifying medical diagnoses and treatable diseases by image-based deep learning,"
*Cell*, vol. 172, no. 5, pp. 1122-1131, 2018.doi:[[[10.1016/j.cell.2018.02.010]]] - 8
*State Grid Corporation of China, (Online). Available*, http://www.cet.sgcc.com.cn/html/sgit/col1180000040/2012-07/16/201207161649280564272529_1.html - 9 H. Wu, Z. Ma, Y. Jiang, L. Wang, H. Yang, Y. Li, P. Zuo, H. Jia, W. Liu, "Direct observation of the carrier transport process in InGaN quantum wells with a pn-junction,"
*Chinese Physics B*, vol. 25, no. 11, 2016.doi:[[[10.1088/1674-1056/25/11/117803]]] - 10 D. Jiang, X. Bu, B. Sun, G. Lin, H. Zhao, Y. Cai, L. Fang, "Experimental study on ceramic membrane technology for onboard oxygen generation,"
*Chinese Journal of Aeronautics*, vol. 29, no. 4, pp. 863-873, 2016.doi:[[[10.1016/j.cja.2016.06.003]]] - 11 D. Yang, W. Kong, B. Li, X. Lian, "Intelligent vehicle electrical power supply system with central coordinated protection,"
*Chinese Journal of Mechanical Engineering*, vol. 29, no. 4, pp. 781-791, 2016.doi:[[[10.3901/CJME.2016.0401.044]]] - 12 Y. Zhang, "Design and development of online calculation data interface for regional power grid theoretical line loss,"
*Guangxi University*, 2012.custom:[[[-]]] - 13 Y. Li, L. Liu, B. Li, J. Yi, Z. Wang, S. Tian, "Calculation of line loss rate in transformer district based on improved k-means clustering algorithm and BP neural network," in
*Proceedings of the Chinese Society for Electrical Engineering*, 2016;vol. 36, no. 17, pp. 4543-4551. custom:[[[-]]] - 14 D. Z. Chen, Z. Z. Guo, "Distribution system theoretical line loss calculation based on load obtaining and matching power flow,"
*Power System Technology*, vol. 2015, no. 1, pp. 80-84, 2005.custom:[[[-]]] - 15 Q. Zhu, J. Dang, J. Chen, Y. Xu, Y. Li, X. Duan, "Power system transient stability assessment method based on deep belief network," in
*Proceedings of the Chinese Society for Electrical Engineering*, 2018;vol. 38, no. 3, pp. 735-743. custom:[[[-]]] - 16 H. Xu, G. Jiang, M. Yu, T. Luo, Z. Peng, F. Shao, H. Jiang, "3D visual discomfort predictor based on subjective perceived-constraint sparse representation in 3D display system,"
*Future Generation Computer Systems*, vol. 83, pp. 85-94, 2018.doi:[[[10.1016/j.future.2018.01.021]]] - 17 Q. M. Wu, "Research and implementation of data fault-tolerance technology in big data environment,"
*School of Engineering Management and Information TechnologyUniversity of Chinese Academy of Sciences*, 2016.custom:[[[-]]] - 18 H. X. Huang, "Research on data fault-tolerance technology in big data cloud storage,"
*Fuzhou University*, 2016.custom:[[[-]]] - 19 L. X. Zheng, "Innovative thinking of power data statistics based on data mining mode,"
*Hong Kong and Macao Economy*, vol. 2015, no. 29, pp. 118-118, 2015.custom:[[[-]]] - 20 L. Y. Min, Y. F. Cheng, H. Cheng, Z. L. Yang, Y. L. Wu, L. Z. Ge, "Measure the volume ratio of upper and lower canopy layers of fruit trees based on M-K cluster,"
*Journal of Chinese Agricultural Machinery*, vol. 2018, no. 6, pp. 1-9, 2018.custom:[[[-]]] - 21 H. Ding, "Automatic and rapid screening of data in ship noise database,"
*Ship Science and Technology*, vol. 40, no. 4, pp. 37-39, 2018.custom:[[[-]]] - 22 L. Bingzhan, L. Ziping, L. Danbing, "Intelligent BIM operation and maintenance management system based on machine learning and maintenance application of BIM+MR: taking hospital construction as an example,"
*Civil Engineering Information Technology*, vol. 9, no. 6, pp. 22-27, 2017.custom:[[[-]]] - 23 S. Ye, X. Wang, Z. Liu, Q. Qian, "Power system transient stability assessment based on support vector machine incremental learning method,"
*Dianli Xitong Zidonghua (Automation of Electric Power Systems)*, vol. 35, no. 11, pp. 15-19, 2011.custom:[[[-]]] - 24 Z. H. Zhou,
*Machine Learning*, Beijing: Tsinghua University Press, 2015.custom:[[[-]]] - 25 X. Z. Gao, "Research on control strategy and power quality improvement of microgrid,"
*Tianjin University*, 2012.custom:[[[-]]]