Salim Miloudi* , Sid Ahmed Rahal* and Salim Khiat*Contribution to Improve Database Classification Algorithms for Multi-Database MiningAbstract: Database classification is an important preprocessing step for the multi-database mining (MDM). In fact, when a multi-branch company needs to explore its distributed data for decision making, it is imperative to classify these multiple databases into similar clusters before analyzing the data. To search for the best 2classification of a set of n databases, existing algorithms generate from 1 to (n –n)/2 candidate classifications. Although each candidate classification is included in the next one (i.e., clusters in the current classification are subsets of clusters in the next classification), existing algorithms generate each classification independently, that is, without taking into account the use of clusters from the previous classification. Consequently, existing algorithms are time consuming, especially when the number of candidate classifications increases. To overcome the latter problem, we propose in this paper an efficient approach that represents the problem of classifying the multiple databases as a problem of identifying the connected components of an undirected weighted graph. Theoretical analysis and experiments on public databases confirm the efficiency of our algorithm against existing works and that it overcomes the problem of increase in the execution time. Keywords: Connected Components , Database Classification , Graph-Based Algorithm , Multi-Database Mining 1. IntroductionMany large organizations have different types of business distributed through their branches and each branch may have its own database that collects transactions continuously from customers. For business decision making, traditional data mining techniques called mono-database mining integrate all the data from these databases to amass a huge dataset for knowledge discovery. However, this leads to an expensive search cost for centralized processing and might disguise some important patterns (i.e., frequent itemset and association rules). To overcome the latter problems, these multiple databases have to be classified into clusters of similar databases according to their data correlation. For example, if an inter-sate company has 10 branches including 5 branches for household appliances, 3 branches for clothing and 2 branches for food, we cannot apply existing data mining techniques over the union of the 10 branch databases. We might need to classify them into 3 clusters according to their type of business, and then we can analyze each database cluster individually. Consequently the multi-database mining (MDM) [1-4] task becomes more manageable. In order to classify transactional multiple databases, few algorithms have been proposed in the literature [5-8]. The most efficient algorithm in terms of accuracy and time complexity is due to Adhikari and Rao [7], which finds the best classification of a set of n database in O(m×n2) time such that m (1≤ m ≤ (n2-n)/2) is the number of all candidate classifications. When m = (n2–n)/2 and n is very large, the time complexity will considerably increase. Hence, to reduce the time complexity of the existing algorithms, we present in this paper an improved algorithm that classifies a set of n databases D={D1,D2,…,Dn} by analyzing and determining the connected components of an undirected weighted graph G=(D,E), such that D is the vertex set and E the edge set (definitions are given in Section 3.1). The rest of the paper is organized as follows: Section 2 discusses the existing works and point out their limitations. In Section 3, we describe the proposed approach for classifying the multiple databases. In Section 4, we perform some experiments to analyze and compare the proposed algorithm with the existing works in terms of accuracy and running times. Finally, Section 5 concludes this paper and highlights the future works. 2. Related WorksTraditional MDM process aims to integrate all the data from the multiple databases for knowledge discovery. This approach may not be a good strategy because of the large amount of irrelevant data involved during the process. To overcome the issue related to data irrelevancy, Liu et al. [9,10] have proposed a preprocessing technique called database selection to identify which databases are more likely relevant to a user query (Q). A relevance measure named RF is thus designed to identify relevant databases containing reliable information relative to the user application. Database selection is a good strategy to reduce the amount of explored data while improving the quality of the patterns mined from the multiple databases. However, this approach depends on the userapplication and when the multiple databases are mined without specifying any application, database selection cannot be applied. The previous work motivated Wu et al. [5] to propose a new approach independent-application for classifying multiple databases. Thus, a similarity measure sim has been designed to group similar databases within the same cluster. The similarity between two databases Di and Dj, denoted sim(Di,Dj), is calculated based on items shared between databases Di and Dj. Such similarity measure could be useful to estimate the correlation between large databases, since extracting more information such as frequent itemsets could be costly in terms of CPU time. However, using items may produce low accuracy in finding the correct similarity between two databases. In fact, two databases are similar if they share many transactions. Based on the previous similarity measure, an algorithm named BestClassification [5] has been proposed to search for the best classification of a set of n databases. The previous algorithm calls a procedure GreedyClass to generate a classification for each similarity threshold δ defined initially by the user. Hence, two databases Di and Dj are similar if the similarity value sim(Di,Dj) is above δ. The time complexity of the proposed algorithm is O(h×n4), such that n is the database number and h is the number of classification generated before obtaining the best classification. Even if the experiment results shown in [5] prove that correct classifications are obtained for certain similarity thresholds, the time complexity of the algorithm remains high and becomes sever when the database number increases. Moreover, BestClassification fails to find the best classification when choosing an incorrect similarity step size as shown in [6]. In certain cases, the while-loop of the algorithm doesn’t terminate and may generate an infinite loop without finding any classification. The latter problem is due to the strong dependence of the algorithm on the similarity step size, which is a user-input. Inspired by the same concepts presented in [5], Li et al. [6] have modified the algorithm BestClassification [5] in order to optimize its time complexity and obtain the correct best classification of the multiple databases. In order to avoid missing the best classification (i.e., in case in which an incorrect step size has been selected), the distinct similarity values between the n databases are used as similarity thresholds to generate classifications. Thus, for each distinct similarity value sorted in the increasing order, a classification is produced. The time complexity of their improved algorithm is O(h×n3). Although the experiments results show that the proposed algorithm is efficient and produces always the best complete classification, the similarity measure still need to be improved since it still depends on common items shared between databases. Moreover, the time complexity of the algorithm needs to be optimized. In 2007, Adhikari and Rao [7] have proposed an approach for multi-database clustering that uses a more accurate similarity measure based on the support of frequent itemsets shared between databases. The time complexity of the proposed algorithm is O(m×n2), such that n is the database number and m (1 ≤ m ≤ n2) is the number of all candidate classifications. The experiment results obtained in [7] show that the proposed algorithm is effective and finds correct and optimal classifications only at few similarity levels. However, its time complexity needs to be improved especially when m = n2, which may occur frequently when the similarity values between databases are distinct. In 2013, Liu et al. [8] have proposed an approach for completely classifying databases based on the same similarity measure proposed in [7]. The proposed algorithm is simple to implement but the clustering isn’t effective when the database number increases. In fact, the time complexity of the algorithm is O(t×2n×n) such that n is the database number and t the distinct similarity values between the n databases. According to this study, existing multi-database classification algorithms [5-8] produce hierarchical classifications that verify the following property. Let class(D, δi) and class(D, δj) be two candidate classifications of D={D1,D2,…,Dn} generated at two consecutive similarity level δi and δj .Then for each cluster gx in class(D,sim, δi) , there is certainly a cluster gy in class(D,sim, δi), such that gx⊂gy. Despite of the latter property, the previous algorithms produce each classification independently, that is, instead to use the earlier clusters (i.e., generated in the previous classifications) to build the clusters of the next classification, they generate each classification starting from the initial state where each database forms one cluster {D1},{D2},…,{Dn}. Consequently the existing algorithms are time consuming and do unnecessary work. The above observations motivated us to design an improved database classification algorithm that overcomes the later problems. To achieve our goal, we need to generate each classification from the earlier classification generated so far. The proposed approach and algorithms are presented in the following section. 3. Proposed Approach and AlgorithmsIn this section, we describe the proposed approach for classifying the multiple databases. We assume that each database has been mined using algorithms for frequent itemset discovery [11-13]. Let D={D1,D2,…,Dn} be the set of n database objects to classify, then the problem of generating a classification class(D,sim,δ) at a similarity level δ, can be described in terms of determining the connected components of an undirected weighted graph G=(D,E). We consider the database set D={D1,D2,…,Dn} as the vertex set of G and E the edge set. In the following, the terms database and vertex are used interchangeably. The weight of an edge (Di,Dj) ∈ E is the similarity value between the corresponding databases sim(Di,Dj). At a certain similarity level δ, there is an edge connecting two vertices Di and Dj if and only if, sim(Di,Dj) ≥ δ. Initially, the graph G=(D,E) is empty with n disconnected vertices, i.e, E=∅. Then, similarities are calculated between each vertex pairs (Di,Dj) using an appropriate similarity measure, for i,j=1 to n. After that, edges are added to E starting with the vertex pairs having the highest similarity as follows. Let δ1,δ2,…,δm be the m distinct similarity values between the n databases sorted in the decreasing order (i.e., δ1>δ2>…>δm) and let Eδl={(Di, Dj) ∈ E, sim(Di,Dj)= δl, i,j=1 to n, i≠j} be the list of edges with weight value equal to δl (l=1 to m). For each distinct similarity value δl, a candidate classification class(D,sim,δl) is generated by adding the edge list Eδl to the edge set E such that E=Eδ1∪Eδ2∪…∪ Eδl-1 (i.e., there are l–1 classifications generated earlier). Then, it remains to identify the connected component of G in order to discover the database clusters in class(D,sim,δl). At a given similarity level δl, if each connected component of G, denoted gδlk, forms a clique (i.e., a subset of vertices, each pair of which is connected by an edge in E), then the corresponding classification class(D,sim, δl)={gδl1,gδl2,…,gδlk} is called a complete classification . In this paper we are interested in finding such classifications. For classification assessment, the authors in [7] have proposed a measure goodness based on the intracluster similarity, the inter-cluster distance and the number of clusters. The complete classification with the maximum goodness value is selected as the best classification of the database set. Our classification approach is depicted in Fig. 1. In the following section we give some definitions required to understand the proposed approach. 3.1 The Relevant ConceptsReferring to the related concept proposed by Adhikari and Rao [7], we have definitions as follows. DEFINITION 1. The similarity measure sim between two databases Di, and Dj is calculated as follows. DEFINITION 1. The similarity measure sim between two databases Di, and Dj is calculated as follows.
(1)[TeX:] $$\operatorname { sim } \left( D _ { i } , D _ { j } , \alpha \right)= \frac{\sum _ { I \in \left( \operatorname { FIS } \left( D _ { i } , \alpha \right) \cap \operatorname { FIS } \left( D _ { j } , \alpha \right) \right) } \min \left\{ \operatorname { supp } \left( I , D _ { i } \right) , \operatorname { supp } \left( I , D _ { j } \right) \right\}}{\sum _ { I \in \left( \operatorname { FIS } \left( D _ { i } , \alpha \right) \cup \operatorname { FIS } \left( D _ { j } , \alpha \right) \right) } \max \left\{ \operatorname { supp } \left( I , D _ { i } \right) , \operatorname { supp } \left( I , D _ { j } \right) \right\}} $$where α is the minimum support threshold, I ∈ (FIS(Di,α) ∩ FIS(Dj,α)) denotes that I is a frequent itemset in both Di and Dj and supp(I, Di) is the support of I in database Di .The similarity measure sim has the following properties. PROPERTY 1. sim(Di, Dj, α) satisfies that : (1) 0 ≤ sim(Di, Dj, α) ≤ 1; (2) sim(Di, Dj, α) = sim(Dj, Di, α) ;(3) sim(Di, Di, α) = 1 for i,j=1…n DEFINITION 2. The distance measure dist between databases Di and Dj is defined as follows.
(2)[TeX:] $$\operatorname { dist } \left( D _ { i } , D _ { j } , \alpha \right) = 1 - \operatorname { sim } \left( D _ { i } , D _ { j } , \alpha \right)$$The distance function dist has the following properties. PROPERTY 2. dist(Di, Dj, α) satisfies that : (1) 0 ≤ dist(Di, Dj, α) ≤ 1; (2) dist(Di, Dj, α) = dist(Dj, Di, α) ;(3) dist(Di, Di, α) = 1 for i,j=1…n DEFINITION 3. Let class(D,sim,δ)={gδ1, gδ2,…,gδk} be a classification of k clusters of the database set D={D1,D2,...,Dn} at the similarity level δ. The intra-cluster similarity of class(D,sim,δ) is defined as follows.
(3)[TeX:] $$\operatorname { intra } - \operatorname { sim } ( \operatorname { class } ( D , \delta ) ) = \sum _ { l = 1 } ^ { k } \sum _ { D _ { i } , D _ { j } \in g _ { l } ^ { \delta } } ^ { i \neq j } \operatorname { sim } \left( D _ { i } , D _ { j } , \alpha \right)$$DEFINITION 4. The inter-cluster distance of class(D,sim,δ) at the similarity level δ is defined as follows.
(4)[TeX:] $$inter-dist( c \operatorname { lass } ( D , \delta ) ) = \sum _ { g _ { p } ^ { \delta } , g _ { q } ^ { \delta } } ^ { p \neq q } \sum _ { D _ { i } \in g _ { p } ^ { \delta } , D _ { j } \in g _ { q } ^ { \delta } } ^ { i \neq j } \operatorname { dist } \left( D _ { i } , D _ { j } , \alpha \right)$$The best classification is selected based on maximizing both the intra-cluster similarity and the intercluster distance. Hence, we have the following definition of the goodness measure. DEFINITION 5. The goodness of class(D,sim,δ) is defined as follows.
(5)[TeX:] $$\operatorname { goodness } ( \operatorname { class } ( D , \delta ) ) = | \operatorname { intra } - \operatorname { sim } ( \operatorname { class } ( D , \delta ) ) + i n t e r - \operatorname { dist } ( \operatorname { class } ( D , \delta ) ) - k |$$The classification that gets the maximum value of goodness is selected as the best classification at the similarity level δ. DEFINITION 6. Let G=(D,E) be the graph representing the database set D. The vertex set D={D1,D2,…,Dn} is the set of nodes in the graph G and E=D×D ={( Di, Dj), 1≤i,j≤n} is the set of pairs (Di, Dj) representing the links between vertices in G. DEFINITION 7. At a similarity level δ, a classification class(D,sim,δ) is complete if the following equation is verified :
(6)[TeX:] $$\sum _ { i = 1 } ^ { k } \frac { \left| g _ { i } ^ { \delta } \right| ^ { 2 } - \left| g _ { i } ^ { \delta } \right| } { 2 } = | E |$$Where |gδi| denotes the number of vertices in the connected component gδi and |E| is the number of edges in the graph G. In other words, if each class gδi (i=1 to k) in class(D,sim,δ) is a clique in G then class(D,sim,δ) is a complete classification. The proposed approach will be followed and explained with the help of examples. EXAMPLE. Let D={D1, D2,…, D6} be the database set to classify. For illustration purposes, the similarity values between the six databases are given in Table 1. Table 1.
From the upper triangle of the similarity table, there are five distinct similarity values. Consequently, there are five lists of edges with the same weight. Starting with the highest similarity, the five similarity values are sorted in the decreasing order as follows (0.79, 0.56, 0.50, 0.45, 0.40). Let G=(D,E). Initially |D|=6 and E=∅. Therefore, a trivial complete classification is obtained class(D)={{D1},{D2},{D3},{D4},{D5},{D6}}, with goodness(class(D))= 2.20. For each similarity value δ listed in the decreasing order, a candidate classification class(D,sim,δ) is generated and checked to see whether it is a complete classification as follows. For δ=0.79, we have, E0.79={(D2, D4)}, E=E ∪ E0.79 class(D,sim,0.79)={{D2,D4},{D1},{D3},{D5},{D6}}, which is a complete classification because : [TeX:] $$\sum _ { i = 1 } ^ { k } \frac { \left| g _ { i } ^ { 0.79 } \right| ^ { 2 } - \left| g _ { i } ^ { 0.79 } \right| } { 2 } = \frac { | \{ \mathrm { D } 2 , \mathrm { D } 4 \} | ^ { 2 } - | \{ \mathrm { D } 2 , \mathrm { D } 4 \} | } { 2 } + \frac { | \{ \mathrm { D } 1 \} | ^ { 2 } - | \{ \mathrm { D } 1 \} | } { 2 } + \frac { | \{ \mathrm { D } 3 \} | ^ { 2 } - | \{ \mathrm { D } 3 \} | } { 2 } + \frac { | \{ \mathrm { D } 5 \} | ^ { 2 } - | \{ \mathrm { D } 5 \} | } { 2 } + \frac { | \{ \mathrm { D } 6 \} | ^ { 2 } - | \{ \mathrm { D } 6 \} ) } { 2 } = 1 = |E|$$ with goodness(class(D,0.79))= 3.78 The remaining classifications are summarized in Table 2. There is only one classification that isn’t complete at δ=0.50. This is because, |E|=3 and [TeX:] $$\sum _ { i = 1 } ^ { k } \frac { \left| g _ { i } ^ { 0.50 } \right| ^ { 2 } - \left| g _ { i } ^ { 0.50 } \right| } { 2 } = 4 \neq | E |$$ According to the results in Table 2, the best complete classification of the 6 databases that has the maximum value of goodness is class(D,sim,0.45)={{D2, D4, D6},{D1, D3, D5}} In the following section, we describe the data structures and algorithms used in the proposed classification approach. Table 2.
The problem of generating a classification class(D,sim,δ) at a given similarity level δ depends mainly on the following sub-problems: 1) Find the list of edges with weight value equal to δ, denoted Eδ, and add it to the graph G. 2) Determine the connected components of G and check whether each of which forms a clique. To solve the first problem, we need to use a dictionary data structure to efficiently search for duplicate weights. In our approach, we need to list the m edge lists Eδ in the decreasing order of the weight δ. Therefore, we use a balanced binary search tree (BST) such as red-black trees [14], to implement such ordered dictionary structure. In the following section, we present a procedure to build the m-node BST where each node represents one edge list Eδ. Instead to use pointers to link edges with the same weight value, we use integers to reference the edges and link them into a contiguous table of size (n2–n)/2. 3.2.1 Finding the edge lists E δLet T be the balanced BST object and T.root the root node. Each node in T contains two fields: weight and index, which represent the weight value δ and the index of the first edge in Eδ respectively. Edges with the same weight value δ are linked into a contiguous table V[1..(n2–n)/2]. For example, let x be a node object in T, then V[x.index] contains the index of the second edge whose weight value is x.weight and V[V[x.index]] contains the index of the third one and so on, until we find a negative value (-1) , which represents the end of the list Eδ. In the following procedure, we present the algorithm used to construct the ordered edge lists Eδ. DISCUSSION. Initially (at line 2), the tree T is empty. Then for each databases pair (Di,Dj) (at line 4), a binary search tree is performed at line 8 to see whether δ exists in T. If there is a node x in T such that x.weight = δ then the current edge (i.e., the k-th edge) is inserted at the beginning of Eδ at lines 10–11. Otherwise, a new node, say y, is allocated and inserted into T at lines 13–16 such that y.weight is set to δ and y.ndex is set to k. The auxiliary procedure Tree_Rebalance is called to re-balance the tree after the insertion operation in order to keep the height proportional to the logarithm of the number of nodes in T. The implementation of Tree_Rebalance depends on the balance information and the algorithm used to re-balance the tree when certain properties are violated [14]. Let m be the number of nodes in T such that m ∈[1,(n2–n)/2]. Then, the search, insert and rebalance operations take O(log2(m)). There are (n2–n)/2 similarity values between the n databases. For each similarity value δ a new node is inserted into T if there is no node containing the weight δ. Hence, the time complexity to build T is O(n2log2(m)). The tree T and the index array V corresponding to the example above are shown in Fig. 2. Each node of T contains mainly 4 fields (the similarity value δ, the index of the first edge in the list Eδ, the left and right child pointers). The table to the right represents the array V[1..(n2-n)/2] in which edges with the same weight are linked. Only the most right column (the column index) exists really in the main memory. Other columns are only shown for illustration purpose. PROPOSITION. Let (Di,Dj) be the vertex pair corresponding to the k-th edge for i=1 to n–1, j=i+1 to n and k=1 to (n2–n)/2. Then, the row index i and the column index j of the k-th edge are calculated as follows:
(8)[TeX:] $$\begin{array} { c } { j = ( i + 1 ) + h ( h + 1 ) / 2 - \left( C _ { n } ^ { 2 } - k + 1 \right) } \\ { \text { With } h = \lceil \left( \sqrt { 8 \times \left( C _ { n } ^ { 2 } - k + 1 \right) + 1 } - 1 \right) / 2 \rceil } \end{array}$$Such that [val] is the least integer greater than or equal to val and [TeX:] $$C _ { n } ^ { 2 } = \left( n ^ { 2 } - n \right) / 2$$. Proof. Let M be the triangular similarity table of size n×n. Then, there are Cn2 similarity values in M, which are associated with Cn2 edges indexed from 1 to Cn2. The 1st row of M is associated with n–1 edges (D1,D2),(D1,D3),…(D1,Dn), the 2nd row with n–2 edges (D2,D3), (D2,D4),…(D2,Dn), and so on until the last row, the (n–1)-th row, which is associated with one edge (Dn-1,Dn). Thus, the last h rows of M are associated with h(h+1)/2 edges. For a given index k, there are Cn2-k+1 edges counting from the k-th edge to the last edge, the Cn2-th edge. To find the row index i of the k-th edge, we have to subtract h from n, such that h is the integer value that verifies: [TeX:] $$h ( h + 1 ) / 2 \geq C _ { n } ^ { 2 } - k + 1$$. By solving the inequality we find: [TeX:] $$( h + 1 / 2 ) ^ { 2 } \geq \left( 8 \left( \mathcal { C } _ { n } ^ { 2 } - k + 1 \right) + 1 \right) / 4 , h \geq \left( \sqrt { 8 \times \left( C _ { n } ^ { 2 } - k + 1 \right) + 1 } - 1 \right) / 2$$. Therefore, i=n-h such that [TeX:] $$h = \left[ \left( \sqrt { 8 \times \left( C _ { n } ^ { 2 } - k + 1 \right) + 1 } - 1 \right) / 2 \right]$$. The first column in the row i has the index i+1. To find the column index j of the k-th edge, we have to add the offset [TeX:] $$\left( h ( h + 1 ) / 2 - \left( C _ { n } ^ { 2 } - k + 1 \right) \right)$$ to the column index (i+1). 3.2.2 The database classification algorithmTo solve the second problem, which is determining and maintaining the connected components of G, we use the disjoint-set forest data structure [14]. This data structure keeps a collection of n disjoint updatable trees, where each tree represents one connected component in the graph G=(D,E). The root node of each tree is used as a representative node to identify one connected component. Each node in the disjoint-set forest, say xi (i=1 to n), contains two main fields size and parent, which represent respectively the size of the sub-tree rooted at xi and a pointer to the parent node. In this paper, instead to use pointers, we use integers to link child nodes to their parent nodes. For example, if xj is the parent of xi, then we set xi.parent=j. The data structure has two main operations described as follows. union(xi, xj) : combines the trees identified by the roots xi and xj. That is, it links the root of the smaller tree to the root of the larger tree. After a union operation, the size of the resulting tree is set equal to the sum of the sizes of the input trees. findset(xi) : returns the root of xi’s tree, that is, it follows parent nodes starting from xi’s parent, until reaching the root node. For each edge incident on the i-th vertex, findset is called on the node xi to find the connected component to which the i-th vertex belong. In this paper, we have modified the findset’s implementation in order to identify cliques in G=(D,E). If xi or xi’s parent (say xj) is the root, then findset(xi) returns the root node. Otherwise, xi is linked to xj’s parent and findset(xi) outputs xj’s parent even if it’s not the root. Thus, for each edge incident on the i- the vertex, findset(xi) compresses the path between xi and the root when xi and xi’s parent aren’t the root. Therefore, if xi’s tree represents a connected component of k vertices in G and findset(xi)’s output isn’t the root, then it is clear that there is at least one missing edge between the i-th vertex and the remaining (k-1) vertices of the same component. Consequently, the connected component, which contains the i-th vertex, isn’t a clique in G. In the following we present our classification algorithm that takes as input the objects returned by the procedure BuildEdgeTree. The findset, union and makeset procedures are given too. Discussion. Initially, the graph G=(D,E) has n disjoints graph components. Therefore, we get a trivial classification class(D,sim)={{D1},{D2},…,{Dn}}. Lines 1-7 initialize the variables used in our algorithm. The inter-cluster distance (dist_inter) between the n components is calculated during the call of the procedure BuildEdgeTree. The intra-cluster similarity (sim_intra) has a zero value since there is one database in each cluster. We use the variable goodness_max to maintain the maximum value of goodness found so far. At line 8, the procedure makeset initializes the data structure by creating n rooted trees xi (i=1 to n), each of which contains one node (xi.size=1). To access the n nodes, we use an array A[1..n] such that A[i] holds the node object xi representing the i-th vertex in G. The time complexity of line 8 is O(n). At line 9, the for-loop examines each node e of the edge tree (T) taken in the decreasing order by field weight. There are m nodes in T such that 1≤ m ≤ (n2–n)/2 and each node represents one edge list Eδ . Browsing all the m nodes in the sorted order takes O(m) using an in-order traversal of T. At line 12, the while-loop examines the current edge list Eδ such that δ = e.weight. Edges are listed starting from the first edge, whose index is e.index, until reaching a negative value (–1), which represents the end of the edge list. At line 13, we use Eqs. (7) and (8) to find the two end vertices (Di,Dj) corresponding to the k-th edge (i.e., the current edge). At lines 17–18, the findset procedure is called once for each end-vertex of the k-th edge (Di, Dj) to find the connected components to which Di and Dj belong. The time complexity of lines 17-18 is O(1). Let xi and xj be the nodes returned by findset. At line 19, if xi and xj are two distinct roots, then the union procedure (at line 21) will merge the corresponding trees by making the smaller tree a sub-tree of the larger tree. Otherwise, either the current classification is not complete or the two vertices Di and Dj already belong to the same component. At line 20, we update nb_clique_edges, which is the number of edges required so that each connected component of G=(D,E) becomes a clique. After each union operation, we decrement the number of components nb_component. At line 24, we select the next edge in the list Eδ from the index array V[1..(n2–n)/2]. At line 26, we test whether the current classification is complete. If the number of edges examined so far, nb_edges, is equal to nb_clique_edges, then the current classification is complete and the algorithm continues at the line 29. Otherwise, the current classification is discarded. At lines 29–31, the current complete classification class(D,sim,δ) is assessed using the goodness measure proposed in [7]. While goodness(class, δ) ≥ goodness_max, class(D,sim,δ) is the best classification generated so far and goodness max value is updated. During the course of our algorithm, we need to maintain the best classification generated so far as the disjoint-set forest is updated. To implement such persistent data structure, additionally fields (version and parent_temp) are associated with each node in the disjoint-set forest. These fields are required to maintain a past version of the dynamic set at the time that the best classification was generated. Let x be a node in the disjoint-set forest. The field x.parent_temp is used to keep track of x’s parent node corresponding to the best classification generated so far. The field x.version is used to determine the classification to which x’s parent refers. In our algorithm, each classification class(D,sim, δ) is identified by a version number class_version. Let best_class_version be the version number of the best classification. Before each updating operation in the union/findset procedure, each node x is checked to see whether its current parent is a part of the best classification produced so far. That is, if x.version≤best_class_version, then x’s parent index is saved in the field x.parent_temp before the updating operation. At lines 36–41, we display the best classification of the n databases from the disjoint-set forest. Hence, we output the parent index of each node xi (1≤i≤n) at the time that the best classification was generated. That is, if xi.version≤best_class_version, then xi.parent is returned as the cluster containing the i-th database. Otherwise, xi.parent_temp is returned instead. The algorithm terminates once all the m edge lists Eδ are examined and the m candidate classifications are generated. The average size of each edge list is n2/m. Therefore, exploring all the m edge lists takes O(n2) time such that n is the database number. Therefore, the proposed algorithm finds the best complete classification of the set of n databases in O(n2) time. Our algorithm is optimal when comparing with BestDatabasePartition [7], which takes O(m×n2) time to find the best partition of the multiple databases. The space complexity of our algorithm is estimated as follows. To store the m nodes of T, we need m×(2pointers+real+integer) units. We have to add the space required to store the array index V, that is, (n2–n)/2 integer units. Also, the disjoint-set forest data structure needs n×(4integer) units. Therefore, the memory space required by our algorithm is (2n2+14n+28m) bytes when using a typical compiler that represents a pointer, an integer and a real number using 8 bytes, 4 bytes and 8 bytes respectively. In parallel, BestDatabasePartition needs n2 real units to store the similarity relation table. Assuming that BestDatabasePartition uses a balanced binary search tree to sort all the distinct similarity values, then m×(2pointers+real) units will be required to store the nodes of the tree. Each candidate classification is stored separately. Hence, mn×integer units are required to store all the m candidate classifications. Therefore, the memory space required by BestDatabasePartition is (8n2+20m+4mn) bytes by using the same compiler. For a large values of n and m, we notice that the proposed algorithm consumes less memory space compared with BestDatabasePartition. 4. ExperimentsTo assess the performance of our proposed approach, we conducted some experiments to analyze and compare the execution time of the existing algorithm BestDatabasePartition [7] and our proposed graph-based classification algorithm. All the experiments have been implemented on a 2.70-GHz Pentium processor with a 2-GB of memory, using JAVA Edition 6. We carried out the experiments using two synthetic datasets T10I4D100K and T40I10D100K available in: http://fimi.ua.ac.be/data/. Some characteristics of these datasets are presented in Table 3. We denote by NT, ALT, AFI, NI and DB the number of transactions, the average length of a transaction, the average frequency of an item, the number of items and the database name respectively. For multi-database mining, each of the datasets is divided horizontally into 10 and 20 databases. The multiple databases obtained from T10I4D100K and T40I10D100K are referred to as T1,1, T1,2, …, T1,n and T4,1, T4,2, …, T4,n respectively, such that n=10, 20. Table 3.
By varying the value of the minimum support threshold (α), we get different sets of frequent itemsets using FP-Growth algorithm [13] as shown in Table 4. We denote by {FIS(Ti,1, α), FIS(Ti,2, α),…, FIS(Ti,n, α)} the sets of frequent itemsets reported from the n multi-databases Ti,1, Ti,2, …, Ti,n under α such that i=1,4. Once the data-preparation is done, we classify the multi-databases using our algorithm and BestDatabasePartition [7]. Since it is difficult to measure the performance of an algorithm executed by the Java Virtual Machine, due to the optimization that occurs during the execution, each algorithm is executed 1000 times on the same multiple databases to get the average running time. In the first part of our experiments, we analyzed the impact of minsupp(α) on the classification execution time. Therefore, we set the number of databases n=10 and we varied the value of α to get different sets of frequents itemsets. Then, we executed our algorithm and BestDatabasePartition on the same frequents itemsets. The classification results and the execution times are presented in Table 5 and Fig. 3, respectively. From the results, we can notice that the execution time is relatively constant for both algorithms when the value of α increases. The reason is that n and the number of candidate classification (m=45) didn’t change during the experiments, as shown in Table 5. Moreover, we didn’t take into account the overhead of calculating the similarities between the n databases, since the same similarity measure proposed in [7] has been used for the two algorithms. However, we observe that the execution time of our algorithm is shorter than that of BestDatabasePartition for α=0.03 to 0.06. Table 4.
Table 5.
Table 6.
In the second part of our experiments, we studied how the number of databases n and the number of candidate classification m influence the time complexity of the two algorithms. Hence, we set the value of α=0.03 and we varied n from 3 to 20 databases. The experiment results obtained by the two algorithms are presented in Table 6 and Fig. 4. From the results, we can see that the time complexity and the execution time tend go higher as the value of n increases. However, we notice that the execution time increases rapidly for BestDatabasePartition. The reason is that m becomes larger as the value of n increases, as it is shown in Table 6. Since the time complexity of BestDatabasePartition depends strongly on m, we note a rapid increase of BestDatabasePartition’s execution time for n= 3 to 20. From the experiment results above, we observe that our algorithm and BestDatabasePartition [7] find the same best classification of the n multiple databases. The reason is that the same similarity and goodness measures proposed in [7] have been used for both algorithms. However, the execution time of our algorithm is the shortest. This is because, unlike the existing classification algorithms [5-8], the time complexity of the classification generation procedure in our algorithm depends only on the number of databases (n). In the worst case, n2 edges are processed to find the best classification of a set of n multiple databases. 5. ConclusionIn this paper, we have proposed an improved database classification algorithm for multi-database mining. Different from the existing works, our algorithm searches for the best classification of a database set D by analyzing and determining the connected components of an undirected weighted graph G=(D,E). The experiments have proved the accuracy and efficiency of our approach and the execution time of our algorithm is the shortest. Future work will be directed toward improving the accuracy of the similarity measure between the databases. Also, we will assess our algorithm using real-world datasets and validate the tests on a real multi-databases system. BiographySalim Khiathttps://orcid.org/0000-0002-8075-4536He received a Ph.D. degree in computer science from the university of science and technology–Mohamed Boudiaf Oran USTOMB Algeria in 2015. He is a member of the Signal, Systems and Data Laboratory (LSSD). His research interests include the databases, multi-database mining, ontology, grid and cloud computing. References
|