## Xiaodan Lv## |

Dataset | Attributes | Clusters | Instances | Data source |
---|---|---|---|---|

Lines | 2 | 4 | 400 | Synthetic |

Luoxuan | 2 | 2 | 252 | Synthetic |

TwoMoon | 2 | 2 | 600 | Synthetic |

BreastCancer | 10 | 2 | 683 | UCI |

Transfusion | 4 | 2 | 748 | UCI |

Ecoli | 7 | 2 | 272 | UCI |

For comparison, six additional clustering algorithms were implemented to validate the IASC algorithm. These algorithms include K-means [1], FCM [1], TSC [16], EIGENGAP [19], DBSCAN [11], and DP [12]. To ensure the principle of one variable, the TSC and IASC parameters were identical. The test environment [28] of this experiment is as follows: central processing unit (CPU) is Intel Core I5-6200U CPU @2.30 GHz 2.4 kHz; memory space is 4 GB; programming environment is MATLAB; programming language is m.

Table 2.

Dataset | K-means | FCM | TSC | EIGENGAP | DBSCAN | DP | IASC |
---|---|---|---|---|---|---|---|

Lines | - | - | - | 5 | 4 | 4 | 2 |

Luoxuan | - | - | - | 3 | 1 | 2 | 2 |

TwoMoon | - | - | - | 2 | 2 | 3 | 2 |

BreastCancer | - | - | - | - | 3 | 1 | 2 |

Transfusion | - | - | - | 4 | 17 | 1 | 2 |

Ecoli | - | - | - | 5 | 1 | 2 | 2 |

First, we conducted a cluster selection experiment. In Table 2, the dashed symbol indicates that the algorithm could not automatically obtain a cluster number. The K-means, FCM, and TSC algorithms were incapable of automatically obtaining cluster amounts on all datasets. This is because these algorithms must manually input clusters in advance. The EIGENGAP algorithm adopts the concept of EIGENGAP. A higher EIGENGAP value indicates a more stable subspace constructed using the selected k eigenvectors. The EIGENGAP algorithm considers the position of the first maximum of the intrinsic gap sequence as the number of categories. Because the eigenvalues of the matrix may be real or complex, the effect of this approach is not ideal. The DBSCAN algorithm calculates the tightness of the samples for classification. However, it performs well only on the Lines and TwoMoon datasets because the clustering effect of DBSCAN is sensitive to the parameters. The DP algorithm selects points with a high [TeX:] $$\delta_i$$ value and relatively high [TeX:] $$\rho_i$$ value as clustering centers manually based on the decision graph. However, cluster results are influenced by many factors, such as human experience and data distribution shapes. Thus, the DP algorithm only obtained the correct number of clusters for three datasets. However, the IASC obtains a reasonable number of clusters on most datasets. The IASC algorithm calculates the corresponding evaluation factor value by iterating through varied k values and outputs the k values corresponding to the maximum evaluation factor as the final number of clusters to achieve automatic clustering. This demonstrated the competitive advantage of the IASC algorithm.

Second, a clustering accuracy experiment was conducted on six datasets. Because EIGENGAP and DBSCAN are sensitive to the parameters, we compared the IASC algorithm with the K-means, FCM, TSC, and DP algorithms. Table 1 contains three two-dimensional manual datasets; therefore, we used graphics to show the clustering results for ease of reading. The results for the manual datasets are shown in Figs. 3–5. In addition, there are three UCI datasets with high-dimensional attributes in Table 1, hence the results of the UCI datasets are displayed in Table 3.

Table 3.

Dataset | K-means | FCM | TSC | DP | IASC |
---|---|---|---|---|---|

BreastCancer | 0.3499 | 0.6032 | 0.3441 | 0.3438 | 0.6428 |

Transfusion | 0.2607 | 0.2928 | 0.5267 | 0.2396 | 0.6845 |

Ecoli | 0.0221 | 0.2022 | 0.1875 | 0.7045 | 0.5110 |

The results in Figs. 3–5 indicate that the IASC algorithm achieves a better clustering effect on the three non-convex manual datasets, comparing with K-means, FCM, TSC, and DP. The clustering accuracy of five algorithms on the UCI datasets is presented in Table 3. The clustering accuracy of IASC is higher than those of K-means, FCM, TSC, and DP on most datasets. For example, on the Transfusion dataset, the K-means clustering accuracy was 0.2607, FCM’s clustering accuracy was 0.2928, TSC’s clustering accuracy was 0.5267, DP’s clustering accuracy was 0.2396, and IASC’s clustering accuracy was 0.6845.

In K-means, the Euclidean distance is used to estimate the similarities between points and centers. The algorithm allows each point to select the category of the center with the smallest distance as its own category. The FCM uses the Euclidean distance to build a cost function. When the cost function reaches its minimum, the algorithm converges and outputs the results. In the TSC, the Euclidean distance is adapted to calculate the similarity of sample points; the closer the Euclidean distance, the higher the similarity. The DP algorithm uses the Euclidean distance build decision diagram to select the cluster center and assigns sample points to different categories. However, the Euclidean distance only considers the local consistency of the spatial distribution of data and does not reflect global consistency; therefore, it is difficult for the above algorithm to achieve good clustering accuracy on non-convex datasets. The IASC algorithm uses density-sensitive distances to estimate the similarity between sample points, which reflects the characteristics of the spatial distribution of the data. This causes the points to be distributed in a high-density area with high similarity. In addition, the last step of the IASC algorithm uses the cosine angle method instead of K-means to classify the feature vectors because the cosine angle is normalized and is more suitable for measuring the similarity between higher-dimensional vectors.

In summary, compared to other algorithms, the IASC algorithm greatly improves clustering accuracy through the density-sensitive similarity measure and cosine angle classification method. This is proof of the superiority of the IASC algorithm.

In this study, we propose the IASC algorithm for data analysis. To achieve automated clustering, the IASC algorithm introduces an evaluation factor into the spectral clustering. The corresponding evaluation factor value was calculated by iteratively varying the k values, and the k value corresponding to the maximum evaluation factor was selected as the final number of clusters.

The IASC algorithm then uses a density-sensitive distance to measure the similarity between samples, which makes the data distributed in a high-density area have a higher similarity. Furthermore, to improve cluster accuracy, the IASC algorithm adopts the cosine-angle method to classify the feature vectors.

It is concluded that the IASC algorithm is capable of automatically obtaining the correct cluster amount and demonstrating better cluster accuracy on most datasets than the other algorithms. Therefore, the IASC is more effective than the TSC algorithm.

- 1 L. Bai, X. Zhao, Y . Kong, Z. Zhang, J. Shao, and Y . Qian, "Survey of spectral clustering algorithms," Computer Engineering and Applications, vol. 57, no. 14, pp. 15-26, 2021. https://doi.org/10.3778/j.issn.1002-8331.2103-0547doi:[[[10.3778/j.issn.1002-8331.-0547]]]
- 2 Z. Xia, X. Wang, L. Zhang, Z. Qin, X. Sun, and K. Ren, "A privacy-preserving and copy-deterrence content-based image retrieval scheme in cloud computing,"
*IEEE Transactions on Information Forensics and Security*, vol. 11, no. 11, pp. 2594-2608, 2016. https://doi.org/10.1109/TIFS.2016.2590944doi:[[[10.1109/TIFS.2016.2590944]]] - 3 K. Xia, X. Gu, and Y . Zhang, "Oriented grouping-constrained spectral clustering for medical imaging segmentation,"
*Multimedia Systems*, vol. 26, pp. 27-36, 2020. https://doi.org/10.1007/s00530-019-00626-8doi:[[[10.1007/s00530-019-00626-8]]] - 4 Z. Yu, H. Chen, J. You, J. Liu, H. S. Wong, G. Han, and L. Li, "Adaptive fuzzy consensus clustering framework for clustering analysis of cancer data," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 12, no. 4, pp. 887-901, 2015. https://doi.org/10.1109/TCBB.2014.2359433doi:[[[10.1109/TCBB.2014.2359433]]]
- 5 X. Jiang, M. Chen, W. Song, and G. N. Lin, "Label propagation-based semi-supervised feature selection on decoding clinical phenotypes with RNA-seq data,"
*BMC Medical Genomics*, vol. 14(Suppl 1), article no. 141, 2021. https://doi.org/10.1186/s12920-021-00985-0doi:[[[10.1186/s12920-021-00985-0]]] - 6 U. Agrawal, D. Soria, C. Wagner, J. Garibaldi, I. O. Ellis, J. M. S. Bartlett, D. Cameron, E. A. Rakha, and A. R. Green, "Combining clustering and classification ensembles: A novel pipeline to identify breast cancer profiles,"
*Artificial Intelligence in Medicine*, vol. 97, pp. 27-37, 2019.doi:[[[10.1016/j.artmed.2019.05.002]]] - 7 D. Xu, C. Li, T. Chen, and F. Lang, "A Novel low rank spectral clustering method for face identification," Recent Patents on Engineering, vol. 13, no. 4, pp. 387-394, 2019. https://doi.org/10.2174/1872212112666180828124211doi:[[[10.2174/187221666180828124211]]]
- 8 S. Wazarkar and B. N. Keshavamurthy, "A survey on image data analysis through clustering techniques for real world applications,"
*Journal of Visual Communication and Image Representation*, vol. 55, pp. 596-626, 2018. https://doi.org/10.1016/j.jvcir.2018.07.009doi:[[[10.1016/j.jvcir.2018.07.009]]] - 9 Z. Ding, J. Li, H. Hao, and Z. R. Lu, "Structural damage identification with uncertain modelling error and measurement noise by clustering based tree seeds algorithm,"
*Engineering Structures*, vol. 185, pp. 301314, 2019. https://doi.org/10.1016/j.engstruct.2019.01.118doi:[[[10.1016/j.engstruct.2019.01.118]]] - 10 Q. Wu, "Research and implementation of Chinese text clustering algorithm," Ph.D. dissertation, Xidian University, Xi’An, China, 2010.doi:[[[10.1109/NCM.2009.234]]]
- 11 M. Ester, H. P. Kriegel, J. Sander, and X. Xu, "A density-based algorithm for discovering clusters in large spatial databases with noise," in
*Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD)*, Portland, OR, USA, 1996, pp. 226-231.custom:[[[https://dl.acm.org/doi/10.5555/3001460.3001507]]] - 12 A. Rodriguez and A. Laio, "Clustering by fast search and find of density peaks," Science, vol. 344, no. 6191, pp. 1492-1496, 2014. https://doi.org/10.1126/science.1242072doi:[[[10.1126/science.124]]]
- 13 A. Ng, M. Jordan, and Y . Weiss, "On spectral clustering: analysis and an algorithm,"
*Advances in Neural Information Processing Systems*, vol. 14, pp. 849-856, 2001.custom:[[[https://dl.acm.org/doi/10.5555/2980539.2980649]]] - 14 M. Fiedler, "Algebraic connectivity of graphs,"
*Czechoslovak Mathematical Journal*, vol. 23, no. 2, pp. 298-305, 1973. http://dx.doi.org/10.21136/CMJ.1973.101168doi:[[[10.21136/CMJ.1973.101168]]] - 15 J. Shi and J. Malik, "Normalized cuts and image segmentation,"
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, vol. 22, no. 8, pp. 888-905, 2000. https://doi.org/10.1109/34.868688doi:[[[10.1109/34.868688]]] - 16 H. Liu, J. Chen, J. Li, L. Shao, L. Ren, and L. Zhu, "Transformer fault warning based on spectral clustering and decision tree," Electronics, vol. 12, no. 2, article no. 265, 2023. https://doi.org/10.3390/electronics12020265doi:[[[10.3390/electronics1205]]]
- 17 L. Hagen and A. B. Kahng, "New spectral methods for ratio cut partitioning and clustering,"
*IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, vol. 11, no. 9, pp. 1074-1085, 1992. https://doi.org/10.1109/43.159993doi:[[[10.1109/43.159993]]] - 18 W. Z. Kong, Z. H. Sun, C. Yang, G. J. Dai, and C. Sun, "Automatic spectral clustering based on eigengap and orthogonal eigenvector,"
*Acta Electronica Sinica*, vol. 38, no. 8, pp. 1880-1885+1891, 2010.doi:[[[10.1109/ICICTA.2010.164]]] - 19 Z. Hu and J. Weng, "Adaptive spectral clustering algorithm based on artificial bee colony algorithm,"
*Journal of Chongqing University of Technology (Natural Science Edition)*, vol. 34, no. 3, pp. 137-144, 2020. https://doi.org/10.3969/j.issn.1674-8425(z).2020.03.020doi:[[[10.3969/j.issn.1674-8425(z).2020.03.020]]] - 20 R. Porter and N. Canagarajah, "A robust automatic clustering scheme for image segmentation using wavelets,"
*IEEE Transactions on Image Processing*, vol. 5, no. 4, pp. 662-665, 1996. https://doi.org/10.1109/83.491343doi:[[[10.1109/83.491343]]] - 21 C. Gao and X. Wu, "An automatic technique to determine cluster number for complex biologic datasets,"
*China Journal of Bioinformatics*, vol. 8, no. 4, pp. 295-298, 2010. https://doi.org/10.3969/j.issn.16725565.2010.04.003doi:[[[10.3969/j.issn.16725565..04.003]]] - 22 H. Chen, X. Shen, J. Long, and Y . Lu, "Fuzzy clustering algorithm for automatic identification of clusters,"
*Acta Electronica Sinica*, vol. 45, no. 3, pp. 687-694, 2017.doi:[[[10.1016/S0148-9062(98)00011-4]]] - 23 L. Wang, L. Bo, and L. Jiao, "Density-sensitive spectral clustering,"
*Acta Electronica Sinica*, vol. 35, no. 8, pp. 1577-1581, 2007.doi:[[[10.1016/j.knosys.2011.01.009]]] - 24 O. Chapelle and A. Zien, "Semi-supervised classification by low density separation," in
*Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics*, Bridgetown, Barbados, 2005, pp. 57-64.custom:[[[https://proceedings.mlr.press/r5/chapelle05b/chapelle05b.pdf]]] - 25 P. Yang, Q. Zhu, and B. Huang, "Spectral clustering with density sensitive similarity function,"
*Knowledge- Based Systems*, vol. 24, no. 5, pp. 621-628, 2011. https://doi.org/10.1016/j.knosys.2011.01.009doi:[[[10.1016/j.knosys..01.009]]] - 26 R. W. Floyd, "Algorithm 97: shortest path,"
*Communications of the ACM*, vol. 5, no. 6, pp. 345-345, 1962. https://doi.org/10.1145/367766.368168doi:[[[10.1145/367766.368168]]] - 27 UCI Machine Learning Repository, "Machine learning datasets," c2023 (Online). Available: https://archive.ics.uci.edu/.custom:[[[https://archive.ics.uci.edu/]]]
- 28 X. Xu, S. Ding, L. Wang, and Y . Wang, "A robust density peaks clustering algorithm with density-sensitive similarity,"
*Knowledge-Based Systems*, vol. 200, article no. 106028, 2020. https://doi.org/10.1016/j.knosys.2020.106028doi:[[[10.1016/j.knosys.2020.106028]]]