## Salim Miloudi* , Sid Ahmed Rahal* and Salim Khiat*## |

sim | D_{1} | D_{2} | D_{3} | D_{4} | D_{5} | D_{6} |

D_{1} | 1 | 0.40 | 0.56 | 0.40 | 0.50 | 0.40 |

D_{2} | - | 1 | 0.40 | 0.79 | 0.40 | 0.45 |

D_{3} | - | - | 1 | 0.40 | 0.45 | 0.40 |

D_{4} | - | - | - | 1 | 0.40 | 0.45 |

D_{5} | - | - | - | - | 1 | 0.40 |

D_{6} | - | - | - | - | - | 1 |

From the upper triangle of the similarity table, there are five distinct similarity values. Consequently, there are five lists of edges with the same weight. Starting with the highest similarity, the five similarity values are sorted in the decreasing order as follows (0.79, 0.56, 0.50, 0.45, 0.40).

Let *G=(D,E)*. Initially *|D|=6 and E=∅*. Therefore, a trivial complete classification is obtained *class(D)={{D _{1}},{D_{2}},{D_{3}},{D_{4}},{D_{5}},{D_{6}}}*, with

For each similarity value δ listed in the decreasing order, a candidate classification *class(D,sim,δ)* is generated and checked to see whether it is a complete classification as follows.

For δ=0.79, we have, E_{0.79}={(D_{2}, D_{4})}, E=E ∪ E_{0.79}

*class(D,sim,0.79)={{D _{2},D_{4}},{D_{1}},{D_{3}},{D_{5}},{D_{6}}}*, which is a complete classification because :

[TeX:] $$\sum _ { i = 1 } ^ { k } \frac { \left| g _ { i } ^ { 0.79 } \right| ^ { 2 } - \left| g _ { i } ^ { 0.79 } \right| } { 2 } = \frac { | \{ \mathrm { D } 2 , \mathrm { D } 4 \} | ^ { 2 } - | \{ \mathrm { D } 2 , \mathrm { D } 4 \} | } { 2 } + \frac { | \{ \mathrm { D } 1 \} | ^ { 2 } - | \{ \mathrm { D } 1 \} | } { 2 } + \frac { | \{ \mathrm { D } 3 \} | ^ { 2 } - | \{ \mathrm { D } 3 \} | } { 2 } + \frac { | \{ \mathrm { D } 5 \} | ^ { 2 } - | \{ \mathrm { D } 5 \} | } { 2 } + \frac { | \{ \mathrm { D } 6 \} | ^ { 2 } - | \{ \mathrm { D } 6 \} ) } { 2 } = 1 = |E|$$ with *goodness(class(D,0.79))*= 3.78

The remaining classifications are summarized in Table 2.

There is only one classification that isn’t complete at δ=0.50. This is because, *|E|*=3 and [TeX:] $$\sum _ { i = 1 } ^ { k } \frac { \left| g _ { i } ^ { 0.50 } \right| ^ { 2 } - \left| g _ { i } ^ { 0.50 } \right| } { 2 } = 4 \neq | E |$$

According to the results in Table 2, the best complete classification of the 6 databases that has the maximum value of goodness is *class(D,sim,0.45)={{D _{2}, D_{4}, D_{6}},{D_{1}, D_{3}, D_{5}}}*

In the following section, we describe the data structures and algorithms used in the proposed classification approach.

Table 2.

Similarity level (δ) | Edge set (E _{δ}) |
| goodness(class(D,δ)) |

0.56 | (D_{1}, D_{3}) | {D_{2},D_{4}},{D_{1},D_{3}},{D_{5}},{D_{6}} | 4.90 |

0.50 | (D_{1}, D_{5}) | {D_{2},D_{4}},{D_{1},D_{3},D_{5}},{D_{6}} | 5.90 |

0.45 | (D_{2},D_{6}),(D_{3},D_{5}),(D_{4}, D_{6}) | {D_{2},D_{4},D_{6}},{D_{1},D_{3},D_{5}} | 6.60 |

0.40 | (D (D | {D | 5.80 |

The problem of generating a classification *class(D,sim,δ)* at a given similarity level δ depends mainly on the following sub-problems:

1) Find the list of edges with weight value equal to δ, denoted *E _{δ}*, and add it to the graph

2) Determine the connected components of *G* and check whether each of which forms a clique.

To solve the first problem, we need to use a dictionary data structure to efficiently search for duplicate weights. In our approach, we need to list the m edge lists Eδ in the decreasing order of the weight δ. Therefore, we use a balanced binary search tree (BST) such as red-black trees [14], to implement such ordered dictionary structure.

In the following section, we present a procedure to build the m-node BST where each node represents one edge list *E _{δ}*. Instead to use pointers to link edges with the same weight value, we use integers to reference the edges and link them into a contiguous table of size

Let T be the balanced BST object and T.root the root node. Each node in T contains two fields: weight and *index*, which represent the weight value δ and the index of the first edge in *E _{δ}* respectively. Edges with the same weight value δ are linked into a contiguous table

In the following procedure, we present the algorithm used to construct the ordered edge lists *E _{δ}*.

**DISCUSSION.** Initially (at line 2), the tree T is empty. Then for each databases pair *(D _{i},D_{j})* (at line 4), a binary search tree is performed at line 8 to see whether δ exists in T. If there is a node

Let *m* be the number of nodes in T such that *m ∈[1,(n ^{2}–n)/2]*. Then, the search, insert and rebalance operations take

The tree T and the index array V corresponding to the example above are shown in Fig. 2.

Each node of T contains mainly 4 fields (the similarity value δ, the index of the first edge in the list Eδ, the left and right child pointers). The table to the right represents the array *V[1..(n ^{2}-n)/2]* in which edges with the same weight are linked. Only the most right column (the column index) exists really in the main memory. Other columns are only shown for illustration purpose.

**PROPOSITION.** Let *(D _{i},D_{j})* be the vertex pair corresponding to the

Such that *[val]* is the least integer greater than or equal to val and [TeX:] $$C _ { n } ^ { 2 } = \left( n ^ { 2 } - n \right) / 2$$.

Proof**. **Let M be the triangular similarity table of size n×n. Then, there are Cn2 similarity values in M, which are associated with Cn2 edges indexed from 1 to Cn2. The 1st row of M is associated with *n*–1 edges (*D _{1}*,

To solve the second problem, which is determining and maintaining the connected components of *G*, we use the *disjoint-set forest* data structure [14]. This data structure keeps a collection of *n* disjoint updatable trees, where each tree represents one connected component in the graph *G*=(*D*,*E*). The root node of each tree is used as a representative node to identify one connected component. Each node in the disjoint-set forest, say *x*_{i} (*i*=1 to *n*), contains two main fields *size* and *parent*, which represent respectively the size of the sub-tree rooted at *x*i and a pointer to the parent node. In this paper, instead to use pointers, we use integers to link child nodes to their parent nodes. For example, if *x*_{j} is the parent of *x*_{i}, then we set *x*_{i}.*parent*=*j*. The data structure has two main operations described as follows.

*union*(*x*_{i}*, x*_{j}) : combines the trees identified by the roots *x*_{i} and *x*_{j}. That is, it links the root of the smaller tree to the root of the larger tree. After a *union* operation, the size of the resulting tree is set equal to the sum of the sizes of the input trees.

*findset*(*x*_{i}) : returns the root of *x*_{i}’s tree, that is, it follows parent nodes starting from *x*_{i}’s parent, until reaching the root node. For each edge incident on the *i*-th vertex, *findset* is called on the node *x*_{i} to find the connected component to which the *i*-th vertex belong.

In this paper, we have modified the *findset*’s implementation in order to identify cliques in *G*=(*D*,*E*). If *x*_{i} or *x*_{i}’s parent (say *x*_{j}) is the root, then *findset*(*x*_{i}) returns the root node. Otherwise, *x*_{i} is linked to *x*_{j}’s parent and *findset*(*x*_{i}) outputs *x*_{j}’s parent even if it’s not the root. Thus, for each edge incident on the *i*- the vertex, *findset*(*x*_{i}) compresses the path between *x*_{i} and the root when *x*_{i} and *x*_{i}’s parent aren’t the root. Therefore, if *x*_{i}’s tree represents a connected component of *k *vertices in *G* and *findset*(*x*_{i})’s output isn’t the root, then it is clear that there is at least one missing edge between the *i*-th vertex and the remaining (*k*-1) vertices of the same component. Consequently, the connected component, which contains the *i*-th vertex, isn’t a clique in *G.*

In the following we present our classification algorithm that takes as input the objects returned by the procedure *BuildEdgeTree*. The *findset*, *union* and *makeset* procedures are given too.

**Discussion.** Initially, the graph *G*=(*D*,*E*) has *n* disjoints graph components. Therefore, we get a trivial classification *class*(*D,sim*)={{*D*1},{*D*2},…,{*D*n}}. Lines 1-7 initialize the variables used in our algorithm. The inter-cluster distance (*dist_inter*) between the *n* components is calculated during the call of the procedure *BuildEdgeTree*. The intra-cluster similarity (*sim_intra*) has a zero value since there is one database in each cluster. We use the variable *goodness_max* to maintain the maximum value of *goodness* found so far.

At line 8, the procedure *makeset* initializes the data structure by creating *n *rooted trees *x*i (*i*=1 to *n*), each of which contains one node (*x*i.*size*=1). To access the *n* nodes, we use an array A[1..*n*] such that A[*i*] holds the node object *x*i representing the *i*-th vertex in *G*. The time complexity of line 8 is *O*(*n*).

At line 9, the for-loop examines each node *e *of the edge tree (T) taken in the decreasing order by field *weight*. There are *m *nodes in T such that 1≤ *m *≤ (*n*^{2}–*n*)/2 and each node represents one edge list *E*δ . Browsing all the *m* nodes in the sorted order takes *O*(*m*) using an in-order traversal of T.

At line 12, the while-loop examines the current edge list *E*δ such that *δ *= e.*weight. *Edges are listed starting from the first edge, whose index is e.*index*, until reaching a negative value (–1), which represents the end of the edge list. At line 13, we use Eqs. (7) and (8) to find the two end vertices (*D*i*,D*j) corresponding to the *k*-th edge (i.e., the current edge).

At lines 17–18, the *findset *procedure is called once for each end-vertex of the *k*-th edge (*D*i, *D*j) to find the connected components to which *D*i and *D*j belong. The time complexity of lines 17-18 is *O*(1).

Let *x*i and *x*j be the nodes returned by *findset. *At line 19*,* if *x*i and *x*j are two distinct roots, then the *union* procedure (at line 21) will merge the corresponding trees by making the smaller tree a sub-tree of the larger tree. Otherwise, either the current classification is not complete or the two vertices *D*i and *D*j already belong to the same component.

At line 20, we update *nb_clique_edges*, which is the number of edges required so that each connected component of *G=*(*D,E*) becomes a clique. After each union operation, we decrement the number of components *nb_component*.

At line 24, we select the next edge in the list *E*δ from the index array V[1..(*n*2–*n*)/2].

At line 26, we test whether the current classification is complete. If the number of edges examined so far, *nb_edges*, is equal to *nb_clique_edges*, then the current classification is complete and the algorithm continues at the line 29. Otherwise, the current classification is discarded.

At lines 29–31, the current complete classification *class*(*D,sim,δ*) is assessed using the goodness measure proposed in [7]. While *goodness*(*class, δ*) ≥ *goodness_max*, *class*(*D,sim,δ*) is the best classification generated so far and *goodness max *value is updated.

During the course of our algorithm, we need to maintain the best classification generated so far as the disjoint-set forest is updated. To implement such persistent data structure, additionally fields (*version* and *parent_temp*) are associated with each node in the disjoint-set forest. These fields are required to maintain a past version of the dynamic set at the time that the best classification was generated. Let *x* be a node in the disjoint-set forest. The field x.*parent_temp *is used to keep track of *x*’s parent node corresponding to the best classification generated so far. The field x.*version *is used to determine the classification to which *x*’s parent refers.

In our algorithm, each classification *class*(*D*,*sim*, *δ*) is identified by a version number *class_version*. Let *best_class_version* be the version number of the best classification. Before each updating operation in the *union*/*findset *procedure, each node *x *is checked to see whether its current parent is a part of the best classification produced so far. That is, if *x*.*version*≤*best_class_version*, then *x*’s parent index is saved in the field *x*.*parent_temp* before the updating operation.

At lines 36–41, we display the best classification of the *n* databases from the disjoint-set forest. Hence, we output the parent index of each node *x*i (1≤i≤n) at the time that the best classification was generated. That is, if *x*i.*version*≤*best_class_version*, then* x*i*.parent* is returned as the cluster containing the *i*-th database. Otherwise, *x*i*.parent_temp *is returned instead.

The algorithm terminates once all the *m *edge lists *E*δ are examined and the *m *candidate classifications are generated. The average size of each edge list is *n*^{2}/*m*. Therefore, exploring all the *m *edge lists takes *O*(*n*^{2}) time such that *n* is the database number. Therefore, the proposed algorithm finds the best complete classification of the set of* n* databases in *O*(*n*^{2}) time. Our algorithm is optimal when comparing with *BestDatabasePartition* [7], which takes *O*(*m*×*n*^{2}) time to find the best partition of the multiple databases.

The space complexity of our algorithm is estimated as follows. To store the* m* nodes of T, we need *m*×(2pointers+real+integer) units. We have to add the space required to store the array index V, that is, (*n*^{2}–*n*)/2 integer units. Also, the disjoint-set forest data structure needs *n*×(4integer) units. Therefore, the memory space required by our algorithm is (2*n*^{2}+14*n*+28*m*) bytes when using a typical compiler that represents a pointer, an integer and a real number using 8 bytes, 4 bytes and 8 bytes respectively.

In parallel, *BestDatabasePartition* needs *n*^{2} real units to store the similarity relation table. Assuming that *BestDatabasePartition* uses a balanced binary search tree to sort all the distinct similarity values, then *m*×(2pointers+real) units will be required to store the nodes of the tree. Each candidate classification is stored separately. Hence, *mn*×integer units are required to store all the *m* candidate classifications. Therefore, the memory space required by *BestDatabasePartition* is (8*n*2+20*m*+4*mn*) bytes by using the same compiler. For a large values of *n* and *m*, we notice that the proposed algorithm consumes less memory space compared with *BestDatabasePartition*.

To assess the performance of our proposed approach, we conducted some experiments to analyze and compare the execution time of the existing algorithm BestDatabasePartition [7] and our proposed graph-based classification algorithm. All the experiments have been implemented on a 2.70-GHz Pentium processor with a 2-GB of memory, using JAVA Edition 6.

We carried out the experiments using two synthetic datasets T10I4D100K and T40I10D100K available in: http://fimi.ua.ac.be/data/. Some characteristics of these datasets are presented in Table 3. We denote by NT, ALT, AFI, NI and DB the number of transactions, the average length of a transaction, the average frequency of an item, the number of items and the database name respectively.

For multi-database mining, each of the datasets is divided horizontally into 10 and 20 databases. The multiple databases obtained from T10I4D100K and T40I10D100K are referred to as T_{1,1}, T_{1,2}, …, T_{1,n} and T_{4,1}, T_{4,2}, …, T_{4,n} respectively, such that n=10, 20.

Table 3.

DB | NT | ALT | NI | AFI |

T10I4D100K | 100,000 | 11.102280 | 870 | 1276.124138 |

T40I10D100K | 100,000 | 40.605070 | 942 | 4310.516985 |

By varying the value of the minimum support threshold (α), we get different sets of frequent itemsets using FP-Growth algorithm [13] as shown in Table 4. We denote by {FIS(T_{i,1}, α), FIS(T_{i,2}, α),…, FIS(T_{i,n}, α)} the sets of frequent itemsets reported from the* n* multi-databases T_{i,1}, T_{i,2}, …, T_{i,n} under α such that *i*=1,4.

Once the data-preparation is done, we classify the multi-databases using our algorithm and *BestDatabasePartition *[7]. Since it is difficult to measure the performance of an algorithm executed by the Java Virtual Machine, due to the optimization that occurs during the execution, each algorithm is executed 1000 times on the same multiple databases to get the average running time.

In the first part of our experiments, we analyzed the impact of minsupp(α) on the classification execution time. Therefore, we set the number of databases *n=*10 and we varied the value of α to get different sets of frequents itemsets. Then, we executed our algorithm and *BestDatabasePartition* on the same frequents itemsets. The classification results and the execution times are presented in Table 5 and Fig. 3, respectively. From the results, we can notice that the execution time is relatively constant for both algorithms when the value of α increases. The reason is that *n* and the number of candidate classification (*m=*45) didn’t change during the experiments, as shown in Table 5. Moreover, we didn’t take into account the overhead of calculating the similarities between the *n* databases, since the same similarity measure proposed in [7] has been used for the two algorithms. However, we observe that the execution time of our algorithm is shorter than that of *BestDatabasePartition* for α=0.03 to 0.06.

Table 4.

Dataset (i=1,4) | # of multiple databases (n) | # of transaction in each database | Algorithm | | Sets of frequent itemsets |

Ti | 10 | 10,000 (×10) | FP-Growth | 0.03 | FIS(Ti,1 …,FIS(Ti,10, 0.03) |

0.04 | FIS(Ti,1 …,FIS(Ti,10, 0.04) | ||||

0.05 | FIS(Ti,1 …,FIS(Ti,10, 0.05) | ||||

0.06 | FIS(Ti,1 …,FIS(Ti,10, 0.06) | ||||

20 | 4,500 (×20) | 0.03 | FIS(Ti,1 FIS(Ti,10, 0.03),…, FIS(Ti,20, 0.03) |

Table 5.

Multiple databases | # of candidate classification (m) | | Best classification | goodness | Execution time (s) | |

| Proposed algorithm | |||||

T1,1,…,T1,10 | 45 | 0.03 | {T1,5, T1,9, T1,10} {T1,8} {T1,4, T1,7} {T1,6} {T1,3} {T1,2} {T1,1} | 4.75 | 0.00784 | 0.00227 |

0.04 | {T1,10} {T1,5, T1,9} {T1,8} {T1,7} {T1,6} {T1,4} {T1,3} {T1,2} {T1,1} | 0.096 | 0.00804 | 0.00240 | ||

0.05 | {T1,10} {T1,9} {T1,8} {T1,3, T1,5, T1,7} {T1,6} {T1,4} {T1,2} {T1,1} | 3.99 | 0.00799 | 0.00233 | ||

0.06 | {T1,2, T1,3, T1,4, T1,5, T1,6, T1,7, T1,8, T1,9, T1,10} {T1,1} | 34.71 | 0.00777 | 0.00225 | ||

T4,1,…,T4,10 | 45 | 0.03 | {T4,4, T4,10} {T4,9} {T4,8} {T4,7} {T4,6} {T4,5} {T4,3} {T4,2} {T4,1} | 2.88 | 0.00780 | 0.00228 |

0.04 | {T4,10} {T4,5, T4,9} {T4,8} {T4,7} {T4 ,6} {T4,4} {T4,3} {T4,2} {T4,1} | 4.44 | 0.00769 | 0.00217 | ||

0.05 | {T4,10} {T4,3, T4,9} {T4,8} {T4,7} {T4,6} {T4,5} {T4,4} {T4,2} {T4,1} | 4.82 | 0.00766 | 0.00216 | ||

0.06 | {T4,10} {T4,9} {T4,8} {T4,7} {T4,6} {T4 ,3, T4,5} {T4,4} {T4,2} {T4,1} | 4.62 | 0.00773 | 0.00218 |

Table 6.

| | # of candidate classifications (m) from | | Time complexity | |||||

Proposed algorithm O(n2) | BestDatabasePartition O(m×n2) | ||||||||

T1,3 ,…, T1,n | T4,3 ,…, T4,n | T1,3 ,…, T1,n | T4,3 ,…, T4,n | T1,3 ,…, T1,n | T4,3 ,…, T4,n | T1,3 ,…, T1,n | T4,3 ,…, T4,n | ||

Ti,3 ,…, Ti,n | 3 | 3 | 0.67 | 0.80 | 9 | 27 | |||

4 | 6 | 0.77 | 1.30 | 16 | 96 | ||||

5 | 10 | 0.59 | 1.63 | 25 | 250 | ||||

6 | 15 | 1.46 | 1.80 | 36 | 540 | ||||

7 | 21 | 2.11 | 1.78 | 49 | 1029 | ||||

8 | 28 | 3.09 | 1.60 | 64 | 1792 | ||||

9 | 36 | 5.91 | 1.32 | 81 | 2916 | ||||

10 | 45 | 5.88 | 2.47 | 100 | 4500 | ||||

11 | 55 | 7.73 | 1.45 | 121 | 6655 | ||||

12 | 66 | 8.03 | 2.3 | 144 | 9504 | ||||

13 | 78 | 10.28 | 3.35 | 169 | 13182 | ||||

14 | 91 | 12.54 | 4.58 | 196 | 17836 | ||||

15 | 105 | 15.57 | 6.01 | 225 | 23625 | ||||

16 | 120 | 119 | 18.29 | 7.43 | 256 | 30720 | 30464 | ||

17 | 136 | 135 | 21.31 | 9.05 | 289 | 39304 | 39015 | ||

18 | 153 | 152 | 24.37 | 9.10 | 324 | 49572 | 49248 | ||

19 | 171 | 170 | 28.27 | 11.13 | 361 | 61731 | 61370 | ||

20 | 190 | 189 | 32.17 | 14.90 | 400 | 76000 | 75600 |

In the second part of our experiments, we studied how the number of databases *n* and the number of candidate classification *m *influence the time complexity of the two algorithms. Hence, we set the value of α=0.03 and we varied *n* from 3 to 20 databases. The experiment results obtained by the two algorithms are presented in Table 6 and Fig. 4. From the results, we can see that the time complexity and the execution time tend go higher as the value of *n* increases. However, we notice that the execution time increases rapidly for *BestDatabasePartition*. The reason is that *m* becomes larger as the value of *n* increases, as it is shown in Table 6. Since the time complexity of *BestDatabasePartition* depends strongly on *m*, we note a rapid increase of *BestDatabasePartition*’s execution time for *n*= 3 to 20.

From the experiment results above, we observe that our algorithm and *BestDatabasePartition *[7] find the same best classification of the *n* multiple databases. The reason is that the same similarity and goodness measures proposed in [7] have been used for both algorithms. However, the execution time of our algorithm is the shortest. This is because, unlike the existing classification algorithms [5-8], the time complexity of the classification generation procedure in our algorithm depends only on the number of databases (*n*). In the worst case, *n*^{2} edges are processed to find the best classification of a set of *n* multiple databases.

In this paper, we have proposed an improved database classification algorithm for multi-database mining. Different from the existing works, our algorithm searches for the best classification of a database set D by analyzing and determining the connected components of an undirected weighted graph G=(D,E). The experiments have proved the accuracy and efficiency of our approach and the execution time of our algorithm is the shortest.

Future work will be directed toward improving the accuracy of the similarity measure between the databases. Also, we will assess our algorithm using real-world datasets and validate the tests on a real multi-databases system.

He received a Ph.D. degree in computer science from the university of science and technology–Mohamed Boudiaf Oran USTOMB Algeria in 2015. He is a member of the Signal, Systems and Data Laboratory (LSSD). His research interests include the databases, multi-database mining, ontology, grid and cloud computing.

- 1 S. Zhang, M. J. Zaki, "Mining multiple data sources: local pattern analysis,"
*Data Mining and Knowledge Discovery, 2006*, vol. 12, no. 2-3, pp. 121-125. doi:[[[10.1007/s10618-006-0041-y]]] - 2 S. Zhang, X. Wu, C. Zhang, "Multi-database mining,"
*IEEE Computational Intelligence Bulletin, 2003*, vol. 2, no. 1, pp. 5-13. doi:[[[10.1016/j.cll.2007.10.004]]] - 3 S. Zhang, C. Zhang, X. W u, Knowledge Discovery in Multiple Databases. New Y ork, NY: Springer, 2004.custom:[[[-]]]
- 4 A. Adhikari, P. Ramachandrarao, W. Pedrycz, Developing Multi-database Mining Applications. London: Springer, 2010.custom:[[[-]]]
- 5 X. Wu, C. Zhang, S. Zhang, "Database classification for multi-database mining,"
*Information Systems, 2005*, vol. 30, no. 1, pp. 71-88. doi:[[[10.1016/j.is.2003.10.001]]] - 6 H. Li, X. Hu, Y. Zhang, "An improved database classiﬁcation algorithm for multi-database mining,"
*in Frontiers in Algorithmics. Heidelberg: Springer2009,*, pp. 346-357. doi:[[[10.1007/978-3-642-02270-8_35]]] - 7 A. Adhikari, P. R. Rao, "Efficient clustering of databases induced by local patterns,"
*Decision Support Systems, 2008*, vol. 44, no. 4, pp. 925-943. doi:[[[10.1016/j.dss.2007.11.001]]] - 8 Y. Liu, D. Yuan, Y. Cuan, "Completely clustering for multi-databases mining,"
*Journal of Computational Information Systems, 2013*, vol. 9, no. 16, pp. 6595-6602. doi:[[[10.12733/jcisP0759]]] - 9 H. Liu, H. Lu, J. Yao, "Identifying relevant databases for multidatabase mining,"
*in Research and Development in Knowledge Discovery and Data Mining. Heidelberg: Springer1998,*, pp. 15-18. doi:[[[10.1007/3-540-64383-4_18]]] - 10 H. Liu, H. Lu, J. Y ao, "T oward multi-database mining: identifying relevant databases,"
*IEEE Transactions on Knowledge and Data Engineering, 2001*, vol. 13, no. 4, pp. 541-553. doi:[[[10.1109/69.940731]]] - 11 R. Agrawal, J. C. Shafer, "Parallel mining of association rules,"
*IEEE Transactions on Knowledge and Data Engineering, 1996*, vol. 8, no. 6, pp. 962-969. doi:[[[10.1109/69.553164]]] - 12 R. Agrawal, R. Srikant, "Fast algorithms for mining association rules in large databases," in
*Proceedings of the 20th International Conference on V ery Large Data Bases*, Santiago de Chile, Chile, 1994;pp. 487-499. custom:[[[https://dl.acm.org/citation.cfm?id=672836]]] - 13 J. Han, J. Pei, Y. Yin, R. Mao, "Mining frequent patterns without candidate generation: a frequent-pattern tree approach,"
*Data Mining and Knowledge Discovery, 2004*, vol. 8, no. 1, pp. 53-87. doi:[[[10.1023/B:DAMI.0000005258.31418.83]]] - 14 C. E. Leiserson, R. L. Rivest, T . H. Cormen, Introduction to Algorithms. CambridgeMA: MIT Press, 1990.custom:[[[-]]]