1. Introduction
Considering the big data age, we have entered a new era of big data development, facing a new technological revolution and a critical juncture in big data generation, clustering techniques have become increasingly important in a wide range of applications, such as text mining, search engine result classification and image segmentation [1-4]. Over time, many sophisticated clustering techniques have been developed. However, when dealing with some complex datasets, these algorithms often encounter certain problems, including sensitivity to initial values, struggles with handling high-dimensional data, and requiring manual intervention to determine the number of clusters, etc.
To address the aforementioned limitations, Rodriguez and Laio [5] introduced a renowned clustering algorithm called DPC (clustering by fast search and find of density peaks). This algorithm not only rapidly identifies density peak points and assigns samples but also handles data of arbitrary shapes and large scales. Its simplicity and efficiency have garnered significant attention since its inception. In the DPC approach, when calculating local density and relative distance, the similarity measurement relies solely on the geometric distance between objects. Consequently, when working on multi-center and highdimensional datasets, selecting an appropriate similarity measurement based on the shape or size of the datasets is very crucial. Consequently, numerous enhanced concepts for local density calculation have emerged based on DPC [6]. Du et al. [7] incorporated the concept of k-nearest neighbor (KNN) to enhance the robustness and standardized the local density calculation by considering the local distribution of objects. Their algorithm mitigated the impact of cutoff distance dc on clustering results. Liu et al. [8] adopted the concept of nearest neighbor and proposed a new clustering algorithm. The calculation method of local density has been redefined, and relative distance and designing a two-step non central allocation scheme, this algorithm showed some advantages over multi-center and irregular datasets. Ding et al. [9] proposed a dissimilarity-measurement-optimization based density peak clustering (DPC) algorithm. This algorithm considers the distribution around nodes and re-measures the similarity of nodes with probability blocks, thereby changing the singleness of using geometric distance in calculating local density.
The preceding algorithms adopted the "whole-to-local" approach during calculating the local density, narrowing the selection range but incorporating richer information, thus escalating complexity. Additionally, they introduced the concept of KNN to improve the clustering accuracy, and ultimately relied on Euclidean distance. Evidently, the selection of similarity measurement is very crucial. However, there is a scarcity of criteria for selecting measurement methods, leading to subjectivity and potentially compromising clustering accuracy. Aiming to avoid this subjective error, Slonim and Tishby [10] introduced the information bottleneck (IB) theory into the agglomerative clustering, and proposed the AIB algorithm. After that, to expedite clustering, Slonim et al. [11] further proposed an IB-based sequential clustering algorithm, named SIB. The IB takes the information loss in grouping objects as the basis of similarity, omits the selection of similarity measures, and therefore avoids the subjectivity mentioned above. Liu and Wan [12] were inspired by the IB theory and proposed an IB based incremental fuzzy clustering method, which can deal with the clustering problem of high-dimensional and large-scale data. Hu et al. [13] designed an IB theory-based clustering algorithm, which shows good clustering performance for high-dimensional data. These algorithms measure the similarity between objects through mutual information loss, and all of the experimental results showed that clustering based on IB cannot only show higher clustering accuracy than traditional clustering algorithms [14], but also be suitable for processing high-dimensional objects.
To eliminate the arbitrariness in selecting similarity measures and enhance clustering performance on datasets with multiple centers and higher dimensions, we employed the IB theory as our similarity measure. Based on this idea, we proposed an IB-based density peaks clustering algorithm, named IBDPC. In summary, our work mainly includes the following aspects:
· IBDPC inherits the simplicity and efficiency of DPC.
· The incorporation of IB theory mitigates the subjectivity of similarity measure, thereby potentially enhancing clustering accuracy for high-dimensional and multi-center datasets.
· The experiments on ten datasets verify the effectiveness of IBDPC.
The rest of our work in this paper is arranged as follows. In Section 2, we introduced knowledge about DPC and IB theory. In Section 3, our algorithm IBDPC is described in detail. Section 4 includes the contents of our experiments. In Section 5, a brief summary of the paper's work is given.
2. Related Work
2.1 DPC
The DPC algorithm operates under two fundamental presumptions: (1) objects with high local density are candidates for cluster centers, and their nearest neighbors have relatively low local density and (2) the separation between cluster centers is considerable. By invoking the previous assumptions, the DPC algorithm mainly performs clustering through the following steps:
1) Constructing a decision diagram, and then finding the clustering centers, where the abscissa is the value of local density [TeX:] $$(\rho)$$ of each sample, and the ordinate is the minimum distance [TeX:] $$(\delta)$$ from one sample to other samples with greater density;
2) Assigning the membership of the remaining points by rendering.
In DPC, the following formula is used to get the local density
[TeX:] $$\rho_i$$ for each sample point
[TeX:] $$x_i$$:
where [TeX:] $$d_{ij}$$ is a variable measuring the distance between the object [TeX:] $$x_j \text{ and } x_i \text{, and } d_c$$ is a distance threshold. This calculation formula obtains the number of objects whose distance to the object [TeX:] $$x_i$$ is less than the threshold [TeX:] $$d_c$$, that is, the local density value of the object [TeX:] $$x_i$$. The density value increases as there are more objects surrounding the object [TeX:] $$x_i$$.
For small datasets, local density [TeX:] $$\rho_i$$ can be calculated by employing the Gaussian function as follows:
The [TeX:] $$\delta_i$$ value of object [TeX:] $$x_i$$ is generally calculated as the minimum distance from this object to all other objects whose densities are greater. And it is calculated as:
The [TeX:] $$\delta_i$$ of the object with the highest density in the dataset is the maximum distance from the object to other objects in the dataset, which is defined as:
After calculating the values of [TeX:] $$\rho_i \text{ and } \delta_i$$, and constructing the decision diagram, the objects with large density value and relative distance value will be chosen as clustering centers, which are usually located in the upper right corner of the decision diagram. Initially, to locate the cluster center and assign cluster labels, we sort the objects in descending order with the local density values. We then assign the remaining objects to the cluster where the nearest neighbor object is located, and obtain the final clustering results.
2.2 Information Bottleneck
IB theory, a technique rooted in information theory, finds its origins in Shannon's rate distortion theory. Introduced by Tishby et al. [14] in 1999, this theory examines the pertinent information of Y contained in source X, and defines the distortion through the mutual information between them. If the value of mutual information is greater, the distortion will be smaller. Therefore, one particular application of rate distortion theory is the information bottleneck theory.
The integration of IB theory into clustering algorithms has gained traction in recent years. Within this framework, clustering results can be regarded as a representation of the dataset to be processed, and the clustering process tries to make the clustering results compress the information of the original data as much as possible, and retain its relevant feature information as much as possible. Therefore, clustering is regarded as the bottleneck between data and feature information. By invoking the principles of IB, an iterative process is formed to gradually maintain a balance between the compressed representation of data and the preservation of related information.
When compared to traditional clustering algorithms, the strength of IB lies in its ability to circumvent the definition of a distance function, instead leveraging information theory within the clustering process. Consequently, it can avoid the subjective error derived from the random selection of measurement methods. In IB theory, according to the shared likelihood distribution between the objects and the feature information, the information loss generated in the clustering process is used as the measurement standard for clustering, and the objects and related feature information are mainly considered.
3. Information Bottleneck based Density Peak Clustering
Based on IB theory, we propose the IBDPC approach. When calculating the distance from and density of the area, IBDPC introduces the IB theory. Due to the good performance of the IB theory in highdimensional data clustering, it aids IBDPC algorithm in achieving higher accuracy when tackling complex datasets.
The IB addresses measurement distortion by introducing a related variable Y that characterizes the nature of X. Given random variable X and correlated variable Y adhering to the joint distribution of probability P(X,Y), we aim to make X as concise as possible under the condition of trying to preserve the maximum amount of information about Y. A higher compression ratio T for X results in lower mutual information I(T,Y). In addition, to ensure less mutual information loss during the compression process, the amount of compression will be a "bottleneck" of mutual loss.
Considering the sample space and feature space (represented by X and Y, respectively) for classification, the core principle of IBDPC algorithm is to identify a classification approach that incurs the minimal loss of mutual information between samples and features. Mutual information can be used to calculate the amount of information about another random variable contained in one random variable. Assuming that X and Y represent input random variables and output random variables in the information processing system respectively, let [TeX:] $$p(x, y)$$ represent the joint probability distribution, the edge probabilities be [TeX:] $$p(x) \tex{ and } p(y)$$ respectively, and then the mutual information [10] can be measured as:
In IBDPC, local density [TeX:] $$\rho_i$$ is defined as:
Let [TeX:] $$c_x \text{ and } c_y$$ represent two clusters, and [TeX:] $$I_{ij}$$ stand for the loss of information experienced by both [TeX:] $$x_i \text{ and } x_j.$$ The value of [TeX:] $$I_{ij}$$ can be calculated based on information theory using Eq. (7).
where [TeX:] $$t=\frac{\left|x_i\right| * x_i}{\left|x_i \cup x_j\right|}+\frac{\left|x_j\right| * x_j}{\left|x_i \cup x_j\right|} \text {, and }|\cdot|$$ is a function for calculating the number of points in the cluster.
This formula can be used to determine the value of [TeX:] $$\delta_i$$:
The value can be defined as follows for the object with the highest density:
Our IBDPC utilizes the aforementioned information to construct a decision diagram, effectively segregating objects and yielding the final clustering results.
Below are the step-by-step proceedings of our IBDPC algorithm:
Step 1: Compute the mutual information loss among pair-wise objects, where the [TeX:] $$I_{ij}$$ distance is calculated according to Eq. (3).
Step 2: Based on the predefined cutoff distance, determine the local density [TeX:] $$\rho_i$$ of item [TeX:] $$x_i$$.
Step 3: For object [TeX:] $$x_i$$, calculate the value of [TeX:] $$\delta_i$$.
Step 4: Construct the decision graph with the values of [TeX:] $$\rho_i \text{ and } \delta_i$$.
Step 5: Identify the objects in the decision diagram's rightmost area as center points of the cluster.
Step 6: Allocate of remaining points. Arrange the remaining points in descending order based on their densities, and then arrange each object into a cluster that contains its closest neighbor with the highest density.
4. Experiments
To assess the clustering effectiveness of IBDPC, we selected ten datasets for empirical evaluation, with detailed information presented in Tables 1 and 2. In our experiment, we compared our IBDPC with DPC, DPC-KNN, and SNN-DPC (shared-nearest-neighbor-based clustering by fast search and find of density peaks) in terms of clustering accuracy. Specifically, the DPC-KNN algorithm enhances the local density computation approach by integrating the KNN concept into DPC. The quick density peak of the shared nearest neighbor serves as the foundation for the SNN-DPC algorithm. These two algorithms utilize the KNN idea to change the local density calculation method of DPC algorithm, however, easily affected by the k-value in KNN.
In our study, we employed three evaluation metrics to gauge the quality of our clustering results: adjusted mutual information (AMI), adjusted rand index (ARI) and Fowlkes-Mallows index (FMI). Notably, the maximum value for all these three metrics is 1, indicating optical clustering performance. Based on the mutual information between the artificially manufactured cluster vector and the actual cluster vector, AMI calculates similarity. The range of values for ARI, which is used to assess the consistency of the distribution of the two data groups, is [-1, 1]. The FMI metric measures how well the clustering solution and the dataset's actual classification label match each other.
4.1 Results from Experiments using Synthetic Datasets
The first part of the experiments was conducted on five synthetic datasets, including two typical multicenter datasets, 2circles and Fourlines. The results of these five two-dimensional datasets are presented in a visual format. Clustering results on these five artificial datasets are shown as Figs. 1–5 respectively, where different colors represent different clusters.
The analysis of clustering results employing several techniques on the Flame dataset is shown in Fig. 1. These experimental findings demonstrated that all algorithms can accurately cluster the dataset.
The clustering outcomes for the Aggregation dataset are displayed in Fig. 2. Each algorithm functions effectively, can precisely pinpoint the cluster center, and allocate remaining sample points reasonably, which shows that each algorithm performs well on this dataset.
The clustering performance of these four techniques on the R15 dataset is shown in Fig. 3. The clustering results of every algorithm are precise, with the cluster center being correctly identified and the remaining sample points allocated successfully.
The clustering results on 2circles dataset are illustrated as Fig. 4. This dataset is a typical multi-center dataset, featuring multiple density peak points within a class has, which are eligible as centers. Clustering results on this dataset show that the clustering centers obtained by each algorithm are different. The clustering centers of the two rings are distributed on different rings. On the whole, the distribution of clustering centers is correct, but only IBDPC and SNN-DPC algorithms can achieve perfect clustering results. For DPC and DPC-KNN algorithms, although the centers generated are correct, both of them calculate local density based on the Euclidean distance for each object, and then adopt the DPC sample allocation strategy. Once the cluster allocation of sample points with large density is wrong, it may lead to misallocation of nearby objects.
Clustering results on Flame dataset: (a) IBDPC, (b) DPC, (c) DPC-KNN, and (d) SNN-DPC.
Clustering results on Aggregation dataset: (a) IBDPC, (b) DPC, (c) DPC-KNN, and (d) SNN-DPC.
Clustering results on R15 dataset: (a) IBDPC, (b) DPC, (c) DPC-KNN, and (d) SNN-DPC.
Clustering results on 2circles dataset: (a) IBDPC, (b) DPC, (c) DPC-KNN, and (d) SNN-DPC.
Clustering results on Fourlines dataset: (a) IBDPC, (b) DPC, (c) DPC-KNN, and (d) SNN-DPC.
Fig. 5 shows the clustering results on the multi-center Fourlines dataset. In this group of clustering results, only IBDPC and SNN-DPC algorithms yield accurate results. Different from results on the multicenter 2circles dataset, DPC algorithm fails to generate the correct cluster centers, leading to subsequent errors in sample point allocation. The DPC-KNN algorithm obtains four correct cluster centers. However, due to the calculation method of similarity measurement, the high-density sample points may be wrongly assigned to a cluster center, resulting in the error of subsequent nearest neighbor sample points. And therefore, it also failed to produce the correct results.
In conclusion, all algorithms in our experiments can achieve relatively satisfactory results on the three datasets with uniform density distribution: Flame, Aggregation, and R15. However, on two multi-center datasets, 2circles and Fourlines, only results generated by IBDPC and SNN-DPC algorithms are satisfactory, while the outcomes of the other two algorithms are less than ideal. Further analysis showed that, although DPC-KNN introduces the KNN idea to unify the calculation of local density rather than the parameter [TeX:] $$d_c,$$ it still uses an elementary geometric distance measurement to compute the value of local density, without the surrounding environment of data distribution. IBDPC algorithm retains the core idea of DPC, but measures dissimilarity using the IB theory, making it effective for processing multicenter datasets.
4.2 Results from Experiments using Real-World Data
In the second section, we compare the accuracy of these four algorithms using five real-world datasets, including high-dimensional datasets and traditional small sample datasets. The outcomes are shown in Table 3–5.
Comparison in terms of AMI synthetic datasets
Comparison in terms of ARI
Comparison in terms of FMI
In Tables 3–5, on the two classical datasets, Iris and Wine, IBDPC algorithm achieved higher clustering accuracy than DPC-KNN and SNN-DPC algorithms who both use KNN idea to calculate local density, which shows that IBDPC algorithm can achieve better clustering results after using the IB measure instead of traditional Euclidean distance. On the Glass dataset, while the AMI and ARI metrics for the IBDPC algorithm are marginally lower than those of the competing algorithms, its FMI index is comparatively strong, which indicates that the clustering performances of each algorithm are relatively similar. The IBDPC algorithm performs best in the three clustering indexes on the datasets 20NewsGroups and Webdata with higher dimensions. Especially on the Webdata dataset, although the dimension is the highest in our experiments, the clustering accuracy is improved significantly. The value of AMI is 0.7902, ARI is 0.6153, and FMI is 0.6747, which are 5.5%, 18.99%, and 10.57% higher than that of DPC algorithm.
On the classic real-world datasets, the performance of each algorithm is relatively close. On the two datasets, 20NewsGroups and Webdata, with relatively high dimensions, the clustering accuracy of IBDPC clustering algorithm is greatly improved. In summary, IBDPC algorithm can handle datasets with diverse internal architectures, and its overall performance is commendable.
Based on the clustering results of above two parts of experiments, both the original DPC algorithm and the DPC-KNN algorithm successfully finds correct cluster centers on multi-center datasets such as 2circles. By analyzing the clustering strategy of these two algorithms, we know that:
1) While DPC employs the cutoff distance approach to identify the local density of sample points, DPC-KNN algorithm optimizes the method by using the idea of KNN to find the local density of sample points using knowledge about their surroundings.
2) These two algorithms are consistent during allocating subsequent sample points, that is, they find the nearest neighbors for allocation according to the local density of the sample points from high to low.
3) These two approaches are comparable in terms of how local density and relative distance are calculated.
While DPC-KNN algorithm calculates local density and relative distance in the local range of sample points, both of which are determined by Euclidean distance, DPC algorithm is based on overall local density and relative distance. For example, on the 2circles dataset, even though DPC and DPC-KNN can find the correct cluster centers through the decision graph, they still cannot generate satisfactory clustering results in the subsequent allocation. To improve on this problem, we calculate the local density information of the sample points using the IB theory, and we apply mutual information loss to determine how similar the sample points are. We unified the measurement method, and verified the feasibility of our algorithm through experiments, and also performed well on high-dimensional real datasets. In light of this comprehensive analysis, it is evident that IBDPC algorithm has the ability to process datasets with different internal structures, delivering relatively satisfactory results overall.
4.3 T-Test Analysis
For the t-test analysis, we focused on the clustering results obtained by IBDPC on two datasets, Flame and Iris. On the Iris dataset, which has three clusters, the clustering accuracy is examined by t-test, as shown in Table 6. The F-statistic value for this test is 18.474, with a corresponding P-value of 0.000. Since there is a clearly difference between the two population variances, we direct our attention to the second row of t-test results. The T-statistic value is -30.003, with a matching 2-tailed P-value of 0.000, below the 0.05 threshold, indicating that there is a significant difference between Group 0 and Group 1. Likewise, Group 0 and Group 2, as well as Group 1 and Group 2, exhibit substantial differences, respectively.
Table 7 illustrates the t-test results on the Flame dataset, which is comprised of two clusters. In this instance, the F-statistic value is 7.325, with a corresponding P-value of 0.007 and a variance significance below 0.05, which indicates that the homogeneity of variance is not met. In the final row of the t-test results in Table 7, the 2-tailed probability P-value is 0.000, below the 0.05 threshold, and the T-statistic value is -6.905. These findings underscore a considerable difference between Group 0 and Group 1.
This section, based on our t-test analysis, we can confirm the superior clustering quality of IBDPC.
5. Conclusion
In this research, we have refined the DPC algorithm, leading to the introduction of IBDPC, an algorithm tailored for density peak clustering that leverages IB theory. IBDPC retains such advantages of DPC algorithm as quickly finding density peak points and efficiently allocating samples. Meanwhile, it introduces IB theory to calculate the similarity between samples, redefining the local density calculation, and unify the similarity measurement method. The IBDPC improves the clustering accuracy on complex datasets, especially on high-dimensional and multi-center ones. To evaluate the clustering quality of IBDPC, 10 datasets were selected for our comparative experiments. Experimental results showed that IBDPC algorithm could outperform the competitors and achieve satisfactory clustering results.