3. Proposed Approach and Algorithms
In this section, we describe the proposed approach for classifying the multiple databases. We assume that each database has been mined using algorithms for frequent itemset discovery [11-13].
Let D={D1,D2,…,Dn} be the set of n database objects to classify, then the problem of generating a classification class(D,sim,δ) at a similarity level δ, can be described in terms of determining the connected components of an undirected weighted graph G=(D,E).
We consider the database set D={D1,D2,…,Dn} as the vertex set of G and E the edge set. In the following, the terms database and vertex are used interchangeably. The weight of an edge (Di,Dj) ∈ E is the similarity value between the corresponding databases sim(Di,Dj). At a certain similarity level δ, there is an edge connecting two vertices Di and Dj if and only if, sim(Di,Dj) ≥ δ.
Initially, the graph G=(D,E) is empty with n disconnected vertices, i.e, E=∅. Then, similarities are calculated between each vertex pairs (Di,Dj) using an appropriate similarity measure, for i,j=1 to n. After that, edges are added to E starting with the vertex pairs having the highest similarity as follows.
Proposed approach for classifying the n multiple databases.
Let δ1,δ2,…,δm be the m distinct similarity values between the n databases sorted in the decreasing order (i.e., δ1>δ2>…>δm) and let Eδl={(Di, Dj) ∈ E, sim(Di,Dj)= δl, i,j=1 to n, i≠j} be the list of edges with weight value equal to δl (l=1 to m).
For each distinct similarity value δl, a candidate classification class(D,sim,δl) is generated by adding the edge list Eδl to the edge set E such that E=Eδ1∪Eδ2∪…∪ Eδl-1 (i.e., there are l–1 classifications generated earlier). Then, it remains to identify the connected component of G in order to discover the database clusters in class(D,sim,δl).
At a given similarity level δl, if each connected component of G, denoted gδlk, forms a clique (i.e., a subset of vertices, each pair of which is connected by an edge in E), then the corresponding classification class(D,sim, δl)={gδl1,gδl2,…,gδlk} is called a complete classification . In this paper we are interested in finding such classifications.
For classification assessment, the authors in [7] have proposed a measure goodness based on the intracluster similarity, the inter-cluster distance and the number of clusters. The complete classification with the maximum goodness value is selected as the best classification of the database set. Our classification approach is depicted in Fig. 1.
In the following section we give some definitions required to understand the proposed approach.
3.1 The Relevant Concepts
Referring to the related concept proposed by Adhikari and Rao [7], we have definitions as follows. DEFINITION 1. The similarity measure sim between two databases Di, and Dj is calculated as follows.
DEFINITION 1. The similarity measure sim between two databases Di, and Dj is calculated as follows.
where α is the minimum support threshold, I ∈ (FIS(Di,α) ∩ FIS(Dj,α)) denotes that I is a frequent itemset in both Di and Dj and supp(I, Di) is the support of I in database Di .The similarity measure sim has the following properties.
PROPERTY 1. sim(Di, Dj, α) satisfies that :
(1) 0 ≤ sim(Di, Dj, α) ≤ 1; (2) sim(Di, Dj, α) = sim(Dj, Di, α) ;(3) sim(Di, Di, α) = 1 for i,j=1…n
DEFINITION 2. The distance measure dist between databases Di and Dj is defined as follows.
The distance function dist has the following properties.
PROPERTY 2. dist(Di, Dj, α) satisfies that :
(1) 0 ≤ dist(Di, Dj, α) ≤ 1; (2) dist(Di, Dj, α) = dist(Dj, Di, α) ;(3) dist(Di, Di, α) = 1 for i,j=1…n
DEFINITION 3. Let class(D,sim,δ)={gδ1, gδ2,…,gδk} be a classification of k clusters of the database set D={D1,D2,...,Dn} at the similarity level δ. The intra-cluster similarity of class(D,sim,δ) is defined as follows.
DEFINITION 4. The inter-cluster distance of class(D,sim,δ) at the similarity level δ is defined as follows.
The best classification is selected based on maximizing both the intra-cluster similarity and the intercluster distance. Hence, we have the following definition of the goodness measure.
DEFINITION 5. The goodness of class(D,sim,δ) is defined as follows.
The classification that gets the maximum value of goodness is selected as the best classification at the similarity level δ.
DEFINITION 6. Let G=(D,E) be the graph representing the database set D. The vertex set D={D1,D2,…,Dn} is the set of nodes in the graph G and E=D×D ={( Di, Dj), 1≤i,j≤n} is the set of pairs (Di, Dj) representing the links between vertices in G.
DEFINITION 7. At a similarity level δ, a classification class(D,sim,δ) is complete if the following equation is verified :
Where |gδi| denotes the number of vertices in the connected component gδi and |E| is the number of edges in the graph G. In other words, if each class gδi (i=1 to k) in class(D,sim,δ) is a clique in G then class(D,sim,δ) is a complete classification.
The proposed approach will be followed and explained with the help of examples.
EXAMPLE. Let D={D1, D2,…, D6} be the database set to classify. For illustration purposes, the similarity values between the six databases are given in Table 1.
The similarity relation table for the six databases of the example
From the upper triangle of the similarity table, there are five distinct similarity values. Consequently, there are five lists of edges with the same weight. Starting with the highest similarity, the five similarity values are sorted in the decreasing order as follows (0.79, 0.56, 0.50, 0.45, 0.40).
Let G=(D,E). Initially |D|=6 and E=∅. Therefore, a trivial complete classification is obtained class(D)={{D1},{D2},{D3},{D4},{D5},{D6}}, with goodness(class(D))= 2.20.
For each similarity value δ listed in the decreasing order, a candidate classification class(D,sim,δ) is generated and checked to see whether it is a complete classification as follows.
For δ=0.79, we have, E0.79={(D2, D4)}, E=E ∪ E0.79
class(D,sim,0.79)={{D2,D4},{D1},{D3},{D5},{D6}}, which is a complete classification because :
[TeX:] $$\sum _ { i = 1 } ^ { k } \frac { \left| g _ { i } ^ { 0.79 } \right| ^ { 2 } - \left| g _ { i } ^ { 0.79 } \right| } { 2 } = \frac { | \{ \mathrm { D } 2 , \mathrm { D } 4 \} | ^ { 2 } - | \{ \mathrm { D } 2 , \mathrm { D } 4 \} | } { 2 } + \frac { | \{ \mathrm { D } 1 \} | ^ { 2 } - | \{ \mathrm { D } 1 \} | } { 2 } + \frac { | \{ \mathrm { D } 3 \} | ^ { 2 } - | \{ \mathrm { D } 3 \} | } { 2 } + \frac { | \{ \mathrm { D } 5 \} | ^ { 2 } - | \{ \mathrm { D } 5 \} | } { 2 } + \frac { | \{ \mathrm { D } 6 \} | ^ { 2 } - | \{ \mathrm { D } 6 \} ) } { 2 } = 1 = |E|$$ with goodness(class(D,0.79))= 3.78
The remaining classifications are summarized in Table 2.
There is only one classification that isn’t complete at δ=0.50. This is because, |E|=3 and [TeX:] $$\sum _ { i = 1 } ^ { k } \frac { \left| g _ { i } ^ { 0.50 } \right| ^ { 2 } - \left| g _ { i } ^ { 0.50 } \right| } { 2 } = 4 \neq | E |$$
According to the results in Table 2, the best complete classification of the 6 databases that has the maximum value of goodness is class(D,sim,0.45)={{D2, D4, D6},{D1, D3, D5}}
In the following section, we describe the data structures and algorithms used in the proposed classification approach.
Candidate classifications generated from the similarity matrix in Table 1
3.2 Data Structure and Algorithms
The problem of generating a classification class(D,sim,δ) at a given similarity level δ depends mainly on the following sub-problems:
1) Find the list of edges with weight value equal to δ, denoted Eδ, and add it to the graph G.
2) Determine the connected components of G and check whether each of which forms a clique.
To solve the first problem, we need to use a dictionary data structure to efficiently search for duplicate weights. In our approach, we need to list the m edge lists Eδ in the decreasing order of the weight δ. Therefore, we use a balanced binary search tree (BST) such as red-black trees [14], to implement such ordered dictionary structure.
In the following section, we present a procedure to build the m-node BST where each node represents one edge list Eδ. Instead to use pointers to link edges with the same weight value, we use integers to reference the edges and link them into a contiguous table of size (n2–n)/2.
3.2.1 Finding the edge lists E
δ Let T be the balanced BST object and T.root the root node. Each node in T contains two fields: weight and index, which represent the weight value δ and the index of the first edge in Eδ respectively. Edges with the same weight value δ are linked into a contiguous table V[1..(n2–n)/2]. For example, let x be a node object in T, then V[x.index] contains the index of the second edge whose weight value is x.weight and V[V[x.index]] contains the index of the third one and so on, until we find a negative value (-1) , which represents the end of the list Eδ.
In the following procedure, we present the algorithm used to construct the ordered edge lists Eδ.
DISCUSSION. Initially (at line 2), the tree T is empty. Then for each databases pair (Di,Dj) (at line 4), a binary search tree is performed at line 8 to see whether δ exists in T. If there is a node x in T such that x.weight = δ then the current edge (i.e., the k-th edge) is inserted at the beginning of Eδ at lines 10–11. Otherwise, a new node, say y, is allocated and inserted into T at lines 13–16 such that y.weight is set to δ and y.ndex is set to k. The auxiliary procedure Tree_Rebalance is called to re-balance the tree after the insertion operation in order to keep the height proportional to the logarithm of the number of nodes in T. The implementation of Tree_Rebalance depends on the balance information and the algorithm used to re-balance the tree when certain properties are violated [14].
Let m be the number of nodes in T such that m ∈[1,(n2–n)/2]. Then, the search, insert and rebalance operations take O(log2(m)). There are (n2–n)/2 similarity values between the n databases. For each similarity value δ a new node is inserted into T if there is no node containing the weight δ. Hence, the time complexity to build T is O(n2log2(m)).
The tree T and the index array V corresponding to the example above are shown in Fig. 2.
The balanced binary search tree T and the index array V (the column index) of the example.
Each node of T contains mainly 4 fields (the similarity value δ, the index of the first edge in the list Eδ, the left and right child pointers). The table to the right represents the array V[1..(n2-n)/2] in which edges with the same weight are linked. Only the most right column (the column index) exists really in the main memory. Other columns are only shown for illustration purpose.
PROPOSITION. Let (Di,Dj) be the vertex pair corresponding to the k-th edge for i=1 to n–1, j=i+1 to n and k=1 to (n2–n)/2. Then, the row index i and the column index j of the k-th edge are calculated as follows:
Such that [val] is the least integer greater than or equal to val and [TeX:] $$C _ { n } ^ { 2 } = \left( n ^ { 2 } - n \right) / 2$$.
Proof. Let M be the triangular similarity table of size n×n. Then, there are Cn2 similarity values in M, which are associated with Cn2 edges indexed from 1 to Cn2. The 1st row of M is associated with n–1 edges (D1,D2),(D1,D3),…(D1,Dn), the 2nd row with n–2 edges (D2,D3), (D2,D4),…(D2,Dn), and so on until the last row, the (n–1)-th row, which is associated with one edge (Dn-1,Dn). Thus, the last h rows of M are associated with h(h+1)/2 edges. For a given index k, there are Cn2-k+1 edges counting from the k-th edge to the last edge, the Cn2-th edge. To find the row index i of the k-th edge, we have to subtract h from n, such that h is the integer value that verifies: [TeX:] $$h ( h + 1 ) / 2 \geq C _ { n } ^ { 2 } - k + 1$$. By solving the inequality we find: [TeX:] $$( h + 1 / 2 ) ^ { 2 } \geq \left( 8 \left( \mathcal { C } _ { n } ^ { 2 } - k + 1 \right) + 1 \right) / 4 , h \geq \left( \sqrt { 8 \times \left( C _ { n } ^ { 2 } - k + 1 \right) + 1 } - 1 \right) / 2$$. Therefore, i=n-h such that [TeX:] $$h = \left[ \left( \sqrt { 8 \times \left( C _ { n } ^ { 2 } - k + 1 \right) + 1 } - 1 \right) / 2 \right]$$. The first column in the row i has the index i+1. To find the column index j of the k-th edge, we have to add the offset [TeX:] $$\left( h ( h + 1 ) / 2 - \left( C _ { n } ^ { 2 } - k + 1 \right) \right)$$ to the column index (i+1).
3.2.2 The database classification algorithm
To solve the second problem, which is determining and maintaining the connected components of G, we use the disjoint-set forest data structure [14]. This data structure keeps a collection of n disjoint updatable trees, where each tree represents one connected component in the graph G=(D,E). The root node of each tree is used as a representative node to identify one connected component. Each node in the disjoint-set forest, say xi (i=1 to n), contains two main fields size and parent, which represent respectively the size of the sub-tree rooted at xi and a pointer to the parent node. In this paper, instead to use pointers, we use integers to link child nodes to their parent nodes. For example, if xj is the parent of xi, then we set xi.parent=j. The data structure has two main operations described as follows.
union(xi, xj) : combines the trees identified by the roots xi and xj. That is, it links the root of the smaller tree to the root of the larger tree. After a union operation, the size of the resulting tree is set equal to the sum of the sizes of the input trees.
findset(xi) : returns the root of xi’s tree, that is, it follows parent nodes starting from xi’s parent, until reaching the root node. For each edge incident on the i-th vertex, findset is called on the node xi to find the connected component to which the i-th vertex belong.
In this paper, we have modified the findset’s implementation in order to identify cliques in G=(D,E). If xi or xi’s parent (say xj) is the root, then findset(xi) returns the root node. Otherwise, xi is linked to xj’s parent and findset(xi) outputs xj’s parent even if it’s not the root. Thus, for each edge incident on the i- the vertex, findset(xi) compresses the path between xi and the root when xi and xi’s parent aren’t the root. Therefore, if xi’s tree represents a connected component of k vertices in G and findset(xi)’s output isn’t the root, then it is clear that there is at least one missing edge between the i-th vertex and the remaining (k-1) vertices of the same component. Consequently, the connected component, which contains the i-th vertex, isn’t a clique in G.
In the following we present our classification algorithm that takes as input the objects returned by the procedure BuildEdgeTree. The findset, union and makeset procedures are given too.
Discussion. Initially, the graph G=(D,E) has n disjoints graph components. Therefore, we get a trivial classification class(D,sim)={{D1},{D2},…,{Dn}}. Lines 1-7 initialize the variables used in our algorithm. The inter-cluster distance (dist_inter) between the n components is calculated during the call of the procedure BuildEdgeTree. The intra-cluster similarity (sim_intra) has a zero value since there is one database in each cluster. We use the variable goodness_max to maintain the maximum value of goodness found so far.
At line 8, the procedure makeset initializes the data structure by creating n rooted trees xi (i=1 to n), each of which contains one node (xi.size=1). To access the n nodes, we use an array A[1..n] such that A[i] holds the node object xi representing the i-th vertex in G. The time complexity of line 8 is O(n).
At line 9, the for-loop examines each node e of the edge tree (T) taken in the decreasing order by field weight. There are m nodes in T such that 1≤ m ≤ (n2–n)/2 and each node represents one edge list Eδ . Browsing all the m nodes in the sorted order takes O(m) using an in-order traversal of T.
At line 12, the while-loop examines the current edge list Eδ such that δ = e.weight. Edges are listed starting from the first edge, whose index is e.index, until reaching a negative value (–1), which represents the end of the edge list. At line 13, we use Eqs. (7) and (8) to find the two end vertices (Di,Dj) corresponding to the k-th edge (i.e., the current edge).
At lines 17–18, the findset procedure is called once for each end-vertex of the k-th edge (Di, Dj) to find the connected components to which Di and Dj belong. The time complexity of lines 17-18 is O(1).
Let xi and xj be the nodes returned by findset. At line 19, if xi and xj are two distinct roots, then the union procedure (at line 21) will merge the corresponding trees by making the smaller tree a sub-tree of the larger tree. Otherwise, either the current classification is not complete or the two vertices Di and Dj already belong to the same component.
At line 20, we update nb_clique_edges, which is the number of edges required so that each connected component of G=(D,E) becomes a clique. After each union operation, we decrement the number of components nb_component.
At line 24, we select the next edge in the list Eδ from the index array V[1..(n2–n)/2].
At line 26, we test whether the current classification is complete. If the number of edges examined so far, nb_edges, is equal to nb_clique_edges, then the current classification is complete and the algorithm continues at the line 29. Otherwise, the current classification is discarded.
At lines 29–31, the current complete classification class(D,sim,δ) is assessed using the goodness measure proposed in [7]. While goodness(class, δ) ≥ goodness_max, class(D,sim,δ) is the best classification generated so far and goodness max value is updated.
During the course of our algorithm, we need to maintain the best classification generated so far as the disjoint-set forest is updated. To implement such persistent data structure, additionally fields (version and parent_temp) are associated with each node in the disjoint-set forest. These fields are required to maintain a past version of the dynamic set at the time that the best classification was generated. Let x be a node in the disjoint-set forest. The field x.parent_temp is used to keep track of x’s parent node corresponding to the best classification generated so far. The field x.version is used to determine the classification to which x’s parent refers.
In our algorithm, each classification class(D,sim, δ) is identified by a version number class_version. Let best_class_version be the version number of the best classification. Before each updating operation in the union/findset procedure, each node x is checked to see whether its current parent is a part of the best classification produced so far. That is, if x.version≤best_class_version, then x’s parent index is saved in the field x.parent_temp before the updating operation.
At lines 36–41, we display the best classification of the n databases from the disjoint-set forest. Hence, we output the parent index of each node xi (1≤i≤n) at the time that the best classification was generated. That is, if xi.version≤best_class_version, then xi.parent is returned as the cluster containing the i-th database. Otherwise, xi.parent_temp is returned instead.
The algorithm terminates once all the m edge lists Eδ are examined and the m candidate classifications are generated. The average size of each edge list is n2/m. Therefore, exploring all the m edge lists takes O(n2) time such that n is the database number. Therefore, the proposed algorithm finds the best complete classification of the set of n databases in O(n2) time. Our algorithm is optimal when comparing with BestDatabasePartition [7], which takes O(m×n2) time to find the best partition of the multiple databases.
The space complexity of our algorithm is estimated as follows. To store the m nodes of T, we need m×(2pointers+real+integer) units. We have to add the space required to store the array index V, that is, (n2–n)/2 integer units. Also, the disjoint-set forest data structure needs n×(4integer) units. Therefore, the memory space required by our algorithm is (2n2+14n+28m) bytes when using a typical compiler that represents a pointer, an integer and a real number using 8 bytes, 4 bytes and 8 bytes respectively.
In parallel, BestDatabasePartition needs n2 real units to store the similarity relation table. Assuming that BestDatabasePartition uses a balanced binary search tree to sort all the distinct similarity values, then m×(2pointers+real) units will be required to store the nodes of the tree. Each candidate classification is stored separately. Hence, mn×integer units are required to store all the m candidate classifications. Therefore, the memory space required by BestDatabasePartition is (8n2+20m+4mn) bytes by using the same compiler. For a large values of n and m, we notice that the proposed algorithm consumes less memory space compared with BestDatabasePartition.