1. Introduction
One of the most significant aspects of data mining is the extraction of frequent item sets, and it is also the basic content of the research of data mining. In conventional database, the data is determined, and the probability of emergence is 100%. In the traditional database, the data is determined, and the probability of emergence is 100%, e.g., application fields such as sensor information transfer and moving object data management. Uncertain data is now widely available in the real world. The research of uncertain data cannot be ignored [1]. A hot research subject in data mining is how to mine frequent patterns in uncertain data.
In order to mine the frequent item sets of data, some algorithms have been proposed and widely used. Among them, the Apriori algorithm and the FP-growth algorithm [2-5] are the two most classical algorithms for mining frequent itemsets. These two algorithms cannot deal with the frequent item set mining of uncertain data, because the existence and the precision of the data are not accurate. In order to mine the frequent item sets of uncertain data, some algorithms have been proposed [6]. The most important algorithms are the U-Apriori algorithm, which is based on the Apriori algorithm, and the UFgrowth algorithm, which is based on the FP-growth algorithm. U-Apriori algorithm is proposed by Chui et al. U-Apriori algorithm is the expected support as support, instead of the large set of data to determine the degree of support [7]. However, when mining frequent itemsets, the U-Apriori algorithm still has issues; it requires more than one scan of the database, which affects the efficiency of the algorithm. UFgrowth algorithm is the same as the same probability of the project to be combined, as a tree node, which leads to a large number of redundant nodes [8]. The U-Apriori algorithm has some shortcomings; therefore, research has proposed a concept lattice based on the concept lattice of uncertain data mining, referred to as the UD-Apriori algorithm, which greatly improves efficiency.
In the determination of data, data mining association rules based on concept lattice, a lot of scholars have conducted in-depth research, and it is a hot issue that has been studied [9]. But in the mining of uncertain data, few scholars have combined the concept lattice and the association rules together. Potentially high average-utility sequential pattern mining (PHAUSPs) is a new approach to sequential pattern mining that overcomes the limitations of the previous prospective high-utility sequential pattern mining framework by considering the sequence size [10], which can provide a fair measure of the patterns than the previous works [11] proposing a list-based pattern mining method over uncertain data considering an importance condition in this paper. Raman et al. [12] proposes a new framework which to deal with sequences in uncertain databases considering both weight and support constraints. Lee and Yun [13] provides a precise and efficient method for mining uncertain frequent patterns based on unique data structures and mining approaches, which can also guarantee the correctness of the mining results without any false positives. Lee et al. [14] describes and analyses current state-of-the-art approaches based on tree structures and offers a novel uncertain frequent pattern mining strategy.
Although some scholars have introduced the concept lattice theory, the importance of the concept of the concept lattice is equal and the importance of the same, without taking into account the differences in the content of the case, resulting in low efficiency and long-time [15]. Aiming at the importance of concept lattice connotation differences, put forward the concept of a new concept, connotation difference, to construct the concept lattice will be greatly reduced in the lattice of uncertain frequent item sets in data, because the attribute difference is low structure does not need to be involved in the case, to a certain extent reduces the combination explosion phenomenon. On the other hand, constructing the concept lattice is a condition to scan the database without need of each layer, so that it can save the time of constructing concept lattice. First, provide some definitions, and then describe the construction process of concept lattices, highlighting the differences among them. Furthermore, present the uncertain data mining algorithm, explaining the differences in connotation based on concept lattices. The experiments confirm that the algorithm is effective.
The contributions of this paper are as follows:
· Proposed a new method of uncertain data mining based on the concept lattice of connotation difference degree (c_diff) that reduces the number of database scans.
Experimental outcomes with varying system parameters demonstrate substantial performance improvement of our method over other competing methods. For example, the UD-Apriori algorithm minimizes the running time consumption compared with the U-Apriori algorithm in the case of different threshold values.
The rest of this paper is organized as follows. Section 2 discusses uncertain data. Section 3 proposes a new method of construction method of concept lattice based on difference degree. Mining of uncertain data based on difference concept lattice is described in Section 4. The experimental results are presented in Section 5. Several experiments have been conducted to improve the accuracy of predictions. Finally, the paper concludes in Section 6.
2. Uncertain Data
In the research of uncertain data, the most common model is the possible world mode. The model evolves many possible world instances from an uncertain database (called to determine database instance), and the sum of probabilities of these instances is 1 [1]. For example, a patient is caused by lack of exercise and irregular diet. But the probability of these two causes of different patients is different. The probability of A patients caused by lack of exercise is 90%, and irregular diet is 10%. The probability of B patients caused by lack of exercise is 70%, and irregular diet for 30%.
In a deterministic database, if itemset X represents frequent itemsets, Support(X) indicates the support of item set X, then Support(X) must be larger than the minimum threshold given by the user. In the uncertain database, every possible world exists in the form of probability, and it is not possible to accurately calculate the number of the item sets, as shown in Table 1. Table 2 is the background of uncertain data mining in Table 1. Therefore, in the uncertain database, expected support exp(X) instead of Support(X) [16].
Database of uncertain instance
Uncertain database instances mining background
Definition 1 (Support probability): In the uncertain data, if [TeX:] $$P\left(x, T_i\right)$$ is the probability of possible world [TeX:] $$T_i$$, and the expectation support of itemset is x exp(X), [17] is deduced and proved:
Definition 2 (Probability frequent itemset): In the uncertain data, if the itemset X is a frequent itemset, then exp(X) must be greater than the minimum threshold (min_thre), which is given by the user. Otherwise, itemsets X is not frequent itemsets.
Definition 3. The formal context of data mining is three tuple K = (G, M, I), where G stands for a collection of objects, M is a set of attributes, and I is a two element relationship between G and M. If an object g ∈ G, attribute m ∈ M, the two relationship between them can be expressed by gIm.
A formal context K = (G, M, I), Object set [TeX:] $$A \subseteq G$$, and Attributes [TeX:] $$B \subseteq M$$ defined the mapping:
Definition 4. Two tuple C(A,B) represents the concept of formal context K=(G, M, I), which is to meet f(A) = B and g(B) = A. So, A is the extension of the concept, B is the connotation of the concept.
Property 1. If the formal context K = (G, M, I) exists A, A1, [TeX:] $$A2 \subseteq G$$ and B, B1, [TeX:] $$B2 \subseteq M$$
(1) [TeX:] $$A 1 \subseteq A 2 \Rightarrow f(A 1) \supseteq f(A 2), B 1 \subseteq B 2 \Rightarrow g(B 1) \supseteq g(B 2).$$
(2) [TeX:] $$A \subseteq g(f(A)), B \subseteq f(g(B)).$$
Definition 5. Super concept and sub concept. The given formal context K=(G, M, I), formal concept C1(A1, B1) and C2(A2, B2), if [TeX:] $$A 1 \subseteq A 2 \text { or } B 2 \subseteq B 1,$$ then concept C1 is a sub concept of concept C2, Concept C2(A2, B2) is the parent concept of concept C1(A1, B1), namely (A1, B1) ≤ (A2, B2). In the formal context, the formation of partial relation their parent concept—the concept lattice is called concept lattice [18], namely L(K). If concept C3(A3,B3) does not relationship (A1, B1) ≤ (A3, B3) ≤ (A2, B2), then the concept C1 is the direct sub concept of concept C2, and concept C2(A2, B2) is the direct parent concept of concept C1(A1, B1).
Definition 6. Connotation difference degree. In the uncertain database, the probability of any attribute are not the same that is to say their importance is different, especially in the data mining association rules. The criticality of these properties is different, so the difference between the attribute difference is called connotation difference degree(or connotation attribute difference), namely c_diff.
Property 2. If an attribute appears too many times, its difference will be small.
For example: if A, B, C, D, E, and F are attributes in the uncertain database that A, C ⇒ E and A, D, C ⇒ E existed, which determines the appearance of attribute E and C. If there is A, B ⇒ F and B, D ⇒ F, the attribute F will appear as long as the attribute B appears. The attribute F does not play a decisive role, it will not be of much significance as the antecedent of the rule.
Property 3. If the expectation support of some attribute is small, it will also have a small connotation difference degree.
3. Construction Method of Concept Lattice based on Difference Degree
There is a great relationship between attribute difference and expectation support. In the uncertain database, the expectation support is low and the attribute difference will be low. The low degree of attribute difference will lead to over-fitting in the process of constructing concept lattice, thus, the attributes of low difference degree will be removed, and not involved in the construction of concept lattice. It is not only to reduce the time complexity of concept lattice construction, but also to reduce the complexity of the association rules of uncertain data.
In the process of constructing concept lattice, based on the above definition of these two rules:
1) In the first layer of the concept lattice, the sum of the corresponding attribute probabilities is initialized to their property difference degree c_diff. If the c_diff minimum support threshold [TeX:] $$\varphi$$ is deleted, the attribute is removed from the concept lattice.
(2) The second and the following layers of the concept lattice are calculated by Formula 1, and the expectation support exp(X) is obtained. If exp(X) minimum support threshold [TeX:] $$\varphi$$, then c_diff = exp(X). Otherwise, it is eliminated from concept lattice.
For example, Table 1 is an uncertain database, and Table 2 is the corresponding mining background. If the threshold [TeX:] $$\varphi=0.5$$, the difference degree concept lattice based on Table 2, as shown in Fig. 1.
Fig. 1 shows the specific process of constructing concept lattice. Second layer of Fig. 1 is the mode AB.
The concept lattice of uncertain data mining.
Therefore, the pattern is not scanned and skipped.
Mode CD:
Therefore, the pattern CD is removed from the concept lattice.
4. Mining of Uncertain Data based on Difference Concept Lattice
Association rules are rules such as X ⇒ Y, where [TeX:] $$\mathrm{X}=\mathrm{m}_{\mathrm{i} 1} \cup \mathrm{m}_{\mathrm{i} 2} \cup \ldots \cup \mathrm{m}_{\mathrm{ij}}$$ [TeX:] $$\mathrm{Y}=\mathrm{m}_{\mathrm{i} 1} \cup \mathrm{m}_{\mathrm{i} 2} \cup \ldots \cup \mathrm{m}_{\mathrm{j},}, \mathrm{m}_{\mathrm{ij}}\cup \mathrm{M}$$ is attribute, and [TeX:] $$\mathrm{X} \cap \mathrm{Y}=\emptyset \text {. }$$
The concept of concept lattice L(K) is C1(A1, B1) ≤ C2(A2, B2), and the expectation support of concept C1 and concept C2 is [TeX:] $$\alpha \text { and } \beta(\alpha \geq \varphi) \text {. }$$ Concept C2 is the parent concept of the concept C1, which can be concluded [TeX:] $$\beta \geq \varphi \text {. }$$ If the attribute set is represented by B1-B2 in B1 and not in B2, then there is the B2 ⇒ B1-B2 rule. Its support is α. For a given threshold [TeX:] $$\varphi$$, it is easy to obtain some meaningful association rules by concept lattice.
Given the definition of the concept lattice, it is easy to realize the mining of uncertain data. Based on the above definition of these two rules:
1) If there is a (A1, B1) ≤ (A2, B2) relationship between the concept C1(A1, B1) and C2(A2, B2) of concept lattice L(K), it can be concluded that the association rules of B2 ⇒ B1-B2. Otherwise, it does not exist (A1, B1) ≤ (A2, B2), and it has the non-empty common sub concept C(A, B), then it can get the association rules of B1 ⇒ B-B1 and the association rules of B2 ⇒ B-B2.
2) If there is not a (A1, B1) ≤ (A2, B2) relationship between the concept C1(A1, B1) and C2(A2, B2) of concept lattice L(K), it has the common sub concept C(A, B) but there are no non empty, and it can be concluded that the association rules of B1 ⇒ B2-B and B2 ⇒ B1-B; otherwise, it can be concluded that the association rules of B2-B ⇒ B1 and B1-B ⇒ B2.
The mining of uncertain data rules based on concept lattice can extract concept lattices from different levels, and the purpose is to obtain the association rules of any length to meet the needs of users at different levels. There are still some problems in the generation process of concept lattice based on attribute difference. For example, the low degree of expectation support is not involved in the construction of the lattice, and the deletion. In the process of lattice construction, the expectation of attribute is too high to be involved. In order to ensure the integrity of association rules mining, the attribute expectation is too high to be deleted. It should be part of the frequent item sets, and participate in the rule construction. The algorithm flow chart shown in Fig. 2. The detail proposed algorithm presents with pseudo-codes in Algorithm 1.
Mining of uncertain data based on the concept lattice of connotation difference degree (c_dif)
5. Experimental Results
To ensure the correctness of the experiment, the data set and the U-Apriori algorithm used to run the program were downloaded from the URL http://dbgroup.cs.tsinghua.edu.cn/liyan/u_mining.tar.gz. The experiment utilized a deterministic data set, which was converted into an uncertain data set with evenly distributed uncertainty. The experimental platform consisted of a computer with a Pentium 2.10 GHz CPU, 1GB of memory, and Windows 7 OS, using C++ language to implement the UD-Apriori and UApriori algorithms for solving the frequent itemsets mining problem. The experiment compared the running times under different minimum support thresholds.
Experiment 1: The minimum support threshold is 0.1, 0.2, 0.3, 0.5, 0.8, 0.9, the total number of things used to test the uncertainty of the data, the running time of the situation of the 300000. According to the user to select a different threshold (Support Threshold), running time changes, as shown in Fig. 3.
Fig. 3 shows the UD-Apriori algorithm and U-Apriori algorithm running time comparison. The experimental results display that the UD-Apriori algorithm is more effectual than the U-Apriori algorithm.
Experiment 2: The minimum support threshold values were 0.2, 0.3, 0.4, 0.5, 0.6, and 0.9. Using a dataset with a total of 100K items, the experiment tested how the running time for mining uncertain data itemsets changes with these threshold values. With the selection of different threshold (Support Threshold), the running time changes, as shown in Fig. 4. Fig. 4 is the running time of the UD-Apriori algorithm and U-Apriori algorithm. Fig. 3 experimental outcomes display that the UD-Apriori algorithm minimizes the running time consumption compared with the U-Apriori algorithm in the case of different threshold values.
The results of Experiment 1 and Experiment 2 show that the amount of data is proportional to the time of the algorithm. The greater the amount of data, the faster the running time. These two experiments show that the UD-Apriori algorithm is of high operating efficiency. In particular, the larger the amount of data, the UD-Apriori algorithm is better than the U-Apriori algorithm.
Experiment 3: In order to further verify the effectiveness of the UD-Apriori algorithm, the proposed algorithm is superior to U-Apriori algorithm. From the comparison of the accuracy of the data mining, data mining using the UD-Apriori technique is more accurate than U-Apriori algorithm, as shown in Fig. 5.
Comparison of accuracy between the two algorithms.
6. Conclusion
At present, the concept lattice has been used by many researchers to study the rules of association, and they have achieved significant progress. However, there is less research on association rules analysis in uncertain data mining. The previous researchers are not sure directly correspond to frequent itemsets in data and the connotation of concept lattice, and scan the database one time at a time, wasting time. In this paper, the method of constructing the concept lattice based on the connotation difference and the difference degree can effectively solve the existing problems.