Article Information
Corresponding Author: Yongli Liu* (yongli.buaa@gmail.com)
Yongli Liu*, School of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, Henan, China, yongli.buaa@gmail.com
Hengda Wang*, School of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, Henan, China, chananwhd@126.com
Tianyi Duan*, School of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, Henan, China, dtyhpu@sina.com
Jingli Chen*, School of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, Henan, China, jinglichen_hpu@163.com
Hao Chao*, School of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, Henan, China, chaohao@hpu.edu.cn
Received: August 4 2017
Revision received: September 29 2017
Revision received: November 15 2017
Accepted: December 12 2017
Published (Print): April 30 2019
Published (Electronic): April 30 2019
1. Introduction
The development of modern information technology changes every passing day. And thus, data abundance and information overload replace information indigence as the new problem puzzling users, for whom it is more difficult to find valuable information. Gradually, people realize that structure and knowledge behind data is much more important. Therefore, we have to manage to organize the massive data effectively. Data mining is a process of knowledge discovery from huge amount of data, and hence becomes a hot topic in many fields. Among the many techniques of data mining, the well-known clustering technique aims at grouping objects into clusters so that the objects in the same cluster are relatively similar, while the objects in different clusters are relatively dissimilar.
So far many clustering algorithms have been proposed [1]. These algorithms mainly include two types: hard clustering algorithms and soft clustering algorithms, whose representatives are K-means [2] and fuzzy C-means (FCM) [3], respectively. The former algorithm partitions each object into just one cluster, and the latter algorithm allows each object to belong to more than one cluster.
However, both K-means and FCM devote their effort to minimizing the within-cluster scatter matrix trace, which can be interpreted as a compactness measure with a within-cluster variation [4,5]. It means these algorithms just try to make objects in the same cluster as similar as possible. As we know, the withincluster similarity is only one aspect of clustering. In the other aspect, a clustering algorithm should be able to ensure that objects in different clusters are as dissimilar as possible. In other words, apart from minimizing the within-cluster scatter matrix trace, a clustering algorithm should maximizing the between-cluster scatter matrix trace which can be interpreted as a separation measure with a betweencluster variation [5].
Based on a fuzzy scatter matrix, Wu et al. [5] proposed the fuzzy compactness and separation (FCS) clustering algorithm. In FCS, the compactness is measured using a fuzzy within-cluster variation, and the separation is measured using a fuzzy between-cluster variation. Thus, the FCS could simultaneously consider the within-cluster compactness and the between-cluster separation. Their experimental results showed that this algorithm was efficient and robust.
With the development of information technology, data is growing exponentially. When processing large-scale data, the FCS could be unavailable and inefficient. In this case, an incremental clustering algorithm becomes extremely essential. Incremental methods process data elements one at a time and typically use much less space than needed to store the whole dataset. Nowadays, many methods [6-17] have been designed to solve large-scale data clustering problems. But it is still a challenge to apply fuzzy clustering algorithms to get well-separated clusters in a computation-saved way. Hore et al. [18] proposed two novel incremental clustering approaches, namely single-pass fuzzy C-means (SPFCM) and online fuzzy C-means (OFCM) [19], which treated large-scale datasets as streaming data. Their performances are very close to what you could get if all the data is clustered at one time.
Motivated by above analysis, we propose two incremental FCS algorithms in this paper, namely SPFCS and OFCS. Based on FCS, these two algorithms could simultaneously consider within-cluster compactness and between-cluster separation, and are more robust and efficient. At the same time, because of the ‘single-pass’ and ‘online’ incremental strategies, these two algorithms could process large-scale data easily.
In conclusion, the characteristics of our SPFCS and OFCS include: (i) these two algorithms could not consider only within-cluster compactness, but also between-cluster separation, (ii) they could process large-scale data by employing ‘single-pass’ and ‘online’ incremental strategies, and (iii) they are more robust to noise and the value of fuzzy parameter.
The remainder of this paper is organized as follows. In Section 2, we provide a literature review of the FCS algorithm and incremental FCM-type clustering algorithms. Section 3 introduces our algorithms in details. Section 4 presents our experiments and discusses the experimental results. Finally, we conclude our work.
2. Related Work
2.1 FCS Algorithm
Before introducing FCS clustering, we list the explanations on the mathematical notations used in this paper in Table 1.
List of mathematical notations
For describing within-cluster compactness and between-cluster separation at the same time, Wu et al. [5] proposed the fuzzy total scatter matrix SFT, the fuzzy within-cluster scatter matrix SFW and fuzzy between-cluster scatter matrix SFB. These three matrices are defined as:
where m is the weighting exponent (m>1), [TeX:] $$\overline{x}=\sum_{i=1}^{N} x_{i} / N$$, and we restrict [TeX:] $$\sum_{c=1}^{C} u_{c i}=1, u_{c i} \in[0,1]$$. Furthermore, [TeX:] $$S_{F T}=S_{F W}+S_{F B}$$ is satisfied among [TeX:] $$S_{F T}, S_{F W} \text { and } S_{F B}$$. For minimizing the within-cluster compactness measure and simultaneously maximizing the between-cluster separation measure, the objective function of FCS is defined as:
where ||xi-vc||2 is the square of Euclidean distance between the ith data point and the centroid of the cth cluster. To guarantee that no two of the cluster kernels will overlap, ηc is chosen as Eq. (5) such that the parameter will control the influence of between-cluster separation.
where 0≤β≤1.0, k=1,...,C.
For minimizing the objective function as Eq. (4), we make some restrictions on it.
For the given data point [TeX:] $$\left\|x_{i}-v_{c}\right\|^{2} \leq \eta_{c}\left\|v_{c}-\overline{x}\right\|^{2}$$, then [TeX:] $$u_{c i}=1, \text { and } u_{c' i}=0$$. That is, each cluster in FCS will have a crisp boundary such that all data points inside this boundary will have a crisp membership value [TeX:] $$u_{i j} \in\{0,1\}$$ and other data points outside this boundary will have fuzzy membership values [TeX:] $$u_{i j} \in[0,1]$$, as shown in Fig. 1.
A sample cluster obtained by FCS.
2.2 Incremental Fuzzy Clustering Algorithm Based on FCM
In this section, we introduce two incremental fuzzy clustering algorithms based on FCM, called SPFCM and OFCM. These two algorithms combine single-pass clustering and online clustering methods separately with the fundamental algorithm weighted fuzzy C-means (WFCM).
2.2.1 WFCM algorithm
WFCM algorithm is a modification of FCM, which weights the centroids obtained by each iteration in the FCM algorithm to ensure that centroids with higher weights are more representative than those with lower weights. For a given dataset [TeX:] $$X=\left[x_{1}, \ldots, x_{n}\right]$$, which needs to be clustered into C groups, the goal of WFCM is to minimize the objective function JWFCM,
where wi is the weight of the i-th data point.
The restrictions below are needed so that Eq. (8) could achieve the minimum value.
2.2.2 Single-pass and online incremental clustering algorithm
Traditional clustering algorithms calculate the entire dataset directly, but they will be not available if the capacity of a single memory is not enough to store the dataset. To solve this problem, Hore et al. [18] proposed two incremental algorithms, SPFCM and OFCM.
In SPFCM, WFCM algorithm is used for clustering. In the starting point, the weight of each point is set to 1, [TeX:] $$W_{d a t a}=[1,1, \dots, 1]^{T}$$. The dataset is then divided into several chunks, [TeX:] $$X=\left[X_{1}, \ldots, X_{h}\right]$$. When the first chunk comes (t=1), the X1, the centroid is obtained as [TeX:] $$\Delta=\left[v_{1}, \ldots, v_{c}\right]$$ and SPFCM calculates its weight as [TeX:] $$w_{c}=\sum_{i=1}^{n}\left(u_{c i}\right) w_{d a t a}$$, where n is the number of data points in the first chunk. After the first chunk is processed (t>1), a new chunk is generated by merging the previous centroid into the next one, [TeX:] $$X^{t^{\prime}}=\left[\Delta^{t-1}, X^{t}\right]$$, and the weight of the new centroid is updated as [TeX:] $$w_{c}^{t}=\sum_{i=1}^{n}\left(u_{c i}\right)\left[w_{c}^{t-1}, w_{d a t a}\right]$$.
Different from SPFCM, the OFCM algorithm classifies each chunk of data individually using FCM, and then centroids of all the chunks are collected and grouped by performing clustering again. Then the centroids of each chunk are updated to [TeX:] $$w=\left[w_{1}^{1}, ., w_{c}^{1}, w_{1}^{2}, \ldots, w_{c}^{2}, \ldots, w_{1}^{h}, \ldots, w_{c}^{h}\right]$$.
3. Incremental Fuzzy Clustering Algorithms Based on a Fuzzy Scatter Matrix
In this section, we introduce two incremental fuzzy clustering algorithms based on a fuzzy scatter matrix, called SPFCS and OFCS, respectively.
For applying single-pass and online clustering methods to FCS, the data points in FCS should be weighted. First, the weighted within-cluster and between-cluster scatter matrices are defined as
Based on above analysis, the objective function of incremental FCS algorithm is designed as:
restricted by the following constraints:
where
The constrained optimization of IFCS in Eq. (13) can be solved by applying Lagrange multiplier method, constrained by Eq. (14), yielding the new objective function as,
Taking the partial derivative of J(U,V) in Eq. (16) with respect to U, and setting the gradient to zero we have
Solving the above Eq. (17), subject to the constraints in Eq. (14), yield the formula for uci as
Likewise, taking the partial derivative of J(U, V) in Eq. (16) with respect to V, and setting the gradient to zero we have
Solving the above equation, we get vc as
The incremental method for SPFCS and OFCS is similar to SPFCM and OFCM, using IFCS as the clustering method.
4. Experiments
4.1 Datasets
We implement experiments on six datasets, including two artificial datasets and four real datasets, whose information is shown as Table 2. In Table 2, the first two datasets are artificial datasets, and the others are real datasets. The two artificial datasets are constructed like the work of Wu et al. [5], and all the four real datasets are from UCI database. The four UCI datasets contain 150, 2310, 3498 and 20000 sample objects, respectively, and can help to show clustering performance of our algorithms under different data scale.
In the Unequal Sample Size dataset, there are 100 data points with two attributes. The samples can be divided into two groups, one of which has 97 samples and the other has 3 samples.
The Noise dataset is a two-dimensional dataset with the sample size n=400. The data points can be divided into two clusters, one of which has a noisy point.
The Iris dataset is a four-dimensional dataset with 150 data points consisted of 50 points from each of three clusters. Each cluster is linearly separable with the other two clusters.
The Statlog Segmentation dataset was drawn randomly from a database of 7 outdoor images. The images were hand-segmented to create a classification for every pixel. It’s a 19-dimensional dataset with 2310 data points. All these samples can be divided into 7 clusters, each of which contains 330 points.
The Pen-Based Recognition of Handwritten Digits dataset was created by collecting 250 samples from 44 writers. We randomly select 3498 samples for the experiment. These 16-dimensional samples can be divided into 10 clusters.
The Letter Recognition dataset is used to identify 26 letters. It’s a 16-dimensional dataset with 20,000 samples which can be divided into 26 clusters.
4.2 Evaluation Criteria
The clustering results of the algorithm are evaluated by F-measure [20] and entropy.
4.2.1 F-measure
F-measure, also known as F-score, is the weighted harmonic mean of Precision and Recall. The value of F-measure is in [0,1]. The greater the value is, the better the clustering performs. The term is defined as
where P denotes the precision, which is the probability that a (randomly selected) retrieved document is relevant, and R denotes the recall, which is the probability that a (randomly selected) relevant document is retrieved in a search.
4.2.2 Entropy
The concept of entropy was originally proposed by the German physicist, Clausius [21] in 1865. It is used to represent the degree of internal chaos of the system. The entropy is defined as
where pi represents the ith occurrence probability of the ith sample. The smaller the entropy values, the better the clustering results.
4.3 Experimental Results
For the USS dataset, we implement SPFCM, SPFCS, OFCM and OFCS algorithms separately with different values combinations of m for (2, 6) and β for 0.1 [10]. The initial sample size of each chunk is set to 75 points. This group of experimental results are illustrated as Fig. 2. In Fig. 2, the blue circles denote objects of one cluster, the red crosses denote objects of the other cluster, and the dark squares are centroids of the corresponding cluster. The two attributes of this dataset correspond to abscissa valuing from 20 to 100 and ordinate valuing from 20 to 70, respectively. Fig. 2(a) and (b) display clustering results of SPFCM with m=2 and m=6, respectively. We can intuitively see that there should be two clusters, however some objects that should be in one cluster are incorrectly divided into the other. In Fig. 2(a) and (b), objects in the right cluster are sparse, and they are more like noisy data in clustering. In this perspective, SPFCM is easily affected by noisy data. Furthermore, by comparing Fig. 2(a) and (b), SPFCM with different values of m generates results with great difference, because centroid of the second cluster has an apparent displacement. Fig. 2(c) and (d) illustrate clustering results of SPFCS with m=2 and m=6, respectively. The results are obviously better than results of SPFCM. In Fig. 2(c) and (d), objects are grouped into two clusters, in accordance with human's cognition. In other words, the so-called noisy data has very little influence on the SPFCS. Simultaneously, Fig. 2(c) is very similar to Fig. 2(d), which shows that different values of m also have very little influence on the SPFCS.
Fig. 2(e) and (f) show clustering results of OFCM with m=2 and m=6, respectively. We observe that clustering performance of OFCM is very similar to that of spFCM. And therefore, the similar conclusion can be drawn that OFCM is sensitive to noisy data and the value of m. In Fig. 2(g) and (h), clustering results of OFCS with m=2 and m=6 are introduced, respectively. In terms of clustering performance in this group of experiments, OFCS is very similar to SPFCS. Different from SPFCM and OFCM, SPFCS and OFCS are more robust to noise and insensitive to the fuzzy index m.
In the Noise dataset, objects have two attributes that correspond to abscissa valuing from -3 to 7 and ordinate valuing from -2 to 2, respectively. These objects are divided into two clusters, separated by 2 in the horizontal axis, shown as Fig. 3. Based on this dataset, we add a noisy data whose coordinate is (100, 0). Fig. 3(a), (c), (e), and (g) illustrate experimental results of SPFCM, SPFCS, OFCM, and OFCS with m=2, respectively, and Fig. 3(b), (d), (f), and (h) display clustering results of SPFCM, SPFCS, OFCM, and OFCS with m=6, respectively. The experimental results show that when m=2, SPFCM and OFCM algorithms obtain accurate clustering results, while when m=6, there exists one data point that is put into wrong cluster for the noisy data. However, SPFCS and OFCS both obtain accurate clustering separation when m is equal to either 2 or 6. It shows that SPFCS and OFCS algorithms are more robust to noise and more insensitive to the value of m than SPFCM and OFCM.
SPFCM, SPFCS, OFCM, and OFCS clustering results on the USS dataset.
SPFCM, SPFCS, OFCM, and OFCS clustering results on the Noise dataset.
On the following four real datasets, we implemented six algorithms—SPFCM, single-pass hyperspherical fuzzy C-means (SPHFCM) [10], SPFCS, OFCM, online hyperspherical fuzzy C-means (OHFCM) [10], and OFCS. On the Iris dataset, the values of m in these six algorithms value 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, and 6. The sample size of each chunk is set to 30 points. And we choose F-measure as the criteria to evaluate the performance of the clustering algorithms. As shown in Fig. 4, when the value of m becomes larger, the performance of SPFCS and OFCS remain steady, while the F-measure of other algorithms drops down. The extreme deviation values of F-Measure of SPFCS, SPFCM, and SPHFCM are 0.047, 0.086, and 0.148, respectively, and their standard deviation values are 0.018, 0.035, and 0.047, respectively. In the OFCS, OFCM, and OHFCM group, the extreme deviation values are 0.011, 0.173, and 0.068, respectively, and the standard deviation values are 0.004, 0.066, and 0.023, respectively. Obviously, in these two experimental groups, the fluctuation of SPFCS and OFCS with different m value is the lowest. Hence, SPFCS and OFCS are more robust to the value of m than the others.
Influence of m on SPFCM, SPHFCM, SPFCS, OFCM, OHFCM, OFCS algorithms for the Iris dataset: (a) single-pass algorithm and (b) online algorithm.
Tables 3 and 4 show the performances of SPFCM, SPHFCM, SPFCS, OFCM, OHFCM and OFCS on the four real datasets, in terms of F-measure and entropy, respectively. It can be seen from these two table that SPFCS and OFCS perform better than the other four algorithms in most cases. Mostly, the F-measure values of SPFCS and OFCS are larger than SPFCM, SPHFCM and OFCM, OHFCM separately and their entropy values are relatively lower. Especially in the PD dataset, the average F-measure value of SPFCS is larger than SPFCM and SPHFCM by 70.21% and 118.70% separately. And the average entropy value of SPFCS is smaller than SPFCM and SPHFCM by 81.38% and 61.14% separately. In the OFCS, OFCM, OHFCM group, the average deviation is 47.04%, 54.99% and 48.08%, 42.48%. So, we conclude that our SPFCS and OFCS algorithms perform better than the other four algorithms in terms of clustering accuracy on the four real datasets.
Clustering performance in terms of F-measure on real datasets
Clustering performance in terms of entropy on real datasets
The SPFCM, SPHFCM, OFCM and OHFCM are FCM-type clustering algorithms. Although they are all incremental clustering algorithms and able to process large-scale data, they only try to minimize the within-cluster scatter matrix trace like FCM. However, both SPFCS and OFCS try to minimize the withincluster scatter matrix trace and maximize the between-cluster scatter matrix trace, which concerns the within-cluster compactness and the between-cluster separation simultaneously. Therefore, these two algorithms are easy to get higher clustering accuracy.
In the last part of our experiments, taking the SS dataset as an example, we implement an experiment with SPFCS and OFCS to investigate the influence of noise. By adding 5%, 10%, 20% and 40% noisy points to the dataset separately, we get the results shown as Fig. 5. Note that, when the proportion of noise added into the dataset increases, the clustering performance of SPFCS, OFCS, SPHFCM and OHFCM remains steady while SPFCM and OFCM declines. By calculating the standard deviation of F-measure value, we can illustrate the fluctuation of these algorithms with different proportion of noise. In Fig. 5(a), the standard deviation values of SPFCS, SPFCM and SPHFCM are 0.018, 0.067 and 0.015 separately. In Fig. 5(b), the values of OFCS, OFCM and OHFCM are 0.015, 0.045 and 0.026 separately. The experimental results show that, compared with SPFCM and OFCM, SPFCS and OFCS are more robust to noise.
The influence of noise on clustering: (a) single-pass algorithm and (b) online algorithm.
5. Conclusion
For revealing latent knowledge hidden behind large-scale data, clustering technique has gained wide attention. In the process of clustering, the FCS algorithm considers synthetically within-cluster compactness and between-cluster separation, and therefore easily produces more accurate clustering results. However, this algorithm is difficult to process large-scale data.
Based on a fuzzy scatter matrix, we extend FCS and propose two incremental fuzzy clustering algorithms, SPFCS and OFCS. First, we weight the centroids obtained from each iteration of the FCS algorithm so that the weighted algorithm can be combined with ‘single-pass’ and ‘online’ algorithms. Then, we implement experiments with several artificial datasets and real datasets. Experimental results show that, compared with SPFCM, OFCM, SPHFCM and OHFCM, the SPFCS and OFCS algorithms have better clustering performance. Also, SPFCS and OFCS are more robust to the fuzzy index m and noise than the SPFCM and OFCM algorithms.
In clustering processes of SPFCS and OFCS, the number of clustering results needs to be specified in advance. It encourages us to design better K-value prediction algorithms in future studies.
Acknowledgement
The authors would like to thank the support of Foundation for University Key Teacher by Henan Province (No. 2015GGJS-068), Fundamental Research Funds for the Universities of Henan Province (No. NSFRF1616), and Foundation for scientific and technological project of Henan Province (No. 172102210279).