Yu-Jeong Yang* and Ki Yong Lee*Effective and Efficient Similarity Measures for Purchase Histories Considering Product TaxonomyAbstract: In an online shopping site or offline store, products purchased by each customer over time form the purchase history of the customer. Also, in most retailers, products have a product taxonomy, which represents a hierarchical classification of products. Considering the product taxonomy, the lower the level of the category to which two products both belong, the more similar the two products. However, there has been little work on similarity measures for sequences considering a hierarchical classification of elements. In this paper, we propose new similarity measures for purchase histories considering not only the purchase order of products but also the hierarchical classification of products. Unlike the existing methods, where the similarity between two elements in sequences is only 0 or 1 depending on whether two elements are the same or not, the proposed method can assign any real number between 0 and 1 considering the hierarchical classification of elements. We apply this idea to extend three existing representative similarity measures for sequences. We also propose an efficient computation method for the proposed similarity measures. Through various experiments, we show that the proposed method can measure the similarity between purchase histories very effectively and efficiently. Keywords: Hierarchical Classification , Purchase History , Sequence Similarity , Similarity Measure 1. IntroductionIn an online shopping site or offline store, products purchased by each customer over time form the purchase history of the customer. Like a purchase history, where products purchased by a customer are listed in the order of purchase, the data in which items are listed in some order is called a sequence. Because purchase histories, which are a kind of sequences, contain the behavioral characteristics and consumption patterns of customers, many companies analyze these purchase histories to develop marketing strategies or increase their sales. In particular, companies can find customers with similar purchase histories for product recommendation or group customers with similar purchase histories to run targeted marketing [1-4]. Meanwhile, in most retailers, products have a product taxonomy, which represents a hierarchical classification of products, as shown in Fig. 1. Given a product taxonomy, the lower the level of the category to which two products both belong, the more similar the two products. For example, in Fig. 1, “Coke” and “Sprite” both belong to the same group “Beverage”, while “Coke” and “Candy” belong to different groups but belong to the same division “Food”. In this case, “Coke” and “Sprite” are more similar than “Coke” and “Candy” because “Coke” and “Sprite” belong to the same category with a lower level than “Coke” and “Candy”. In other words, “Coke” and “Sprite” are closer in the hierarchical classification tree than “Coke” and “Candy”. So far, many studies have been conducted on measuring the similarity between sequences. However, most of them do not consider a hierarchical classification of elements in sequences. For example, in Fig. 1, most existing measures do not distinguish the difference between “Coke” and “Sprite” and the difference between “Coke” and “Candy”. As a result, they may not be able to accurately distinguish between similar and dissimilar purchase histories. For example, consider three purchase histories In this case, although [TeX:] $$s_{1}=<\text { "Coke", "Sandals">, } s_{2}=<\text { "Sprite", "Sneakers" }>\text { , and } s_{3}=<\text { "Candy", "Camera"> }.$$ [TeX:] $$s_{1}, s_{2}, \text { and } s_{3}$$ consist of completely different products, we can see that [TeX:] $$s_{1} \text { and } s_{2}$$ are more similar than [TeX:] $$s_{1}\text { and } s_{3}$$ because “Coke” and “Sprite” are more similar than “Coke” and “Candy”, and “Sandals” and “Sneakers” are more similar than “Sandals” and “Camera” considering the hierarchical classification of products. However, most existing similarity measures cannot distinguish the difference between s1 and s2 and the difference between s1 and s3 because they only consider whether products composing the purchase histories match or not. As a result, most existing similarity measures produce the same similarity value for s1 and s2 and s1 and s3. Therefore, in this paper, we propose new similarity measures for purchase histories considering not only the purchase order of products but also the hierarchical classification of products. Even if two purchase histories consist of completely different products, the proposed method considers the two purchase histories to be similar if their products are close in the hierarchical classification tree. Unlike the existing methods, where the similarity between two elements in sequences is only 0 or 1 depending on whether two elements are the same or not, the proposed method can assign any real number between 0 and 1 considering the hierarchical classification of elements. We apply this idea to extend three existing representative similarity measures for sequences, i.e., the Levenshtein distance [5], the Needleman-Wunsch similarity [6], and the dynamic time warping (DTW) distance [7]. Besides the three similarity measures, the proposed method can be generally applied to other similarity measures where the similarity between two elements is measured only as 0 or 1. We also propose an efficient computation method for the proposed similarity measures. The proposed computation method uses a segment tree to compute the similarity between two products in a hierarchical classification very efficiently. This allows the proposed measures to be used very efficiently even when the number of purchase histories to be compared is very large. Through various experiments, we show that the proposed measures can measure the similarity between purchase histories very effectively and efficiently, when we are given a hierarchical classification of products. Our contributions are summarized as follows: For the first time, we propose new similarity measures for purchase histories that consider not only the purchase order of products but also the hierarchical classification of products. We extend existing representative similarity measures for sequences to make them consider the hierarchical classification of products. We also propose an efficient computation method for the proposed similarity measures, which allows the proposed measures to be used for a large number of purchase histories. The rest of the paper is organized as follows. Section 2 reviews the related studies, and Section 3 describes the proposed measures in detail. Section 4 presents the performance evaluation of the proposed measures. Finally, Section 5 summarizes and concludes the paper. 2. Related Work2.1 Sequence Similarity MeasuresA sequence is an ordered list of elements. Examples of sequences include DNA sequences, protein sequences, web surfing histories, user activities, and product purchase histories [8–10]. To measure the similarity between sequences, we need to consider the order of elements in sequences [11]. Existing similarity measures for sequences can be largely divided into three categories, according to how the similarity between sequences is calculated: Edit-based similarity measures These measures compute the similarity between sequences by counting the number of edit operations required to make one sequence the same as the other sequence. The Levenshtein distance [5], also known as the edit distance, is a representative similarity measure for strings, which can be seen as sequences of characters. The Levenshtein distance is defined as the minimum number of edit operations required to transform one string to the other. The edit operations used to change a string include insertions, deletions, and substitutions. Each edit operation is performed on one character. Each time a character is added, deleted, or substituted in a string, 1 is added to the distance. Therefore, these measures do not consider the hierarchical classification of elements in sequences. Alignment-based similarity measures These measures measure the similarity between sequences as the cost of aligning or moving the sequences to match the same parts in both sequences. These measures are mainly used in bioinformatics to compare protein sequences or nucleic acid sequences [12]. They are also divided into local and global alignment methods depending on the scope of finding similar parts between two sequences. The Needleman-Wunsch similarity [6] is a representative global alignment method. This distance inserts gaps to align two sequences so that the similarity between them is maximized. It then calculates the distance by adding the cost of inserting gaps and the cost of mismatch between elements. However, these methods also do not consider the hierarchical classification of elements in sequences. Matching-based similarity measures These measures compute the similarity between sequences as the cost of matching each element in a sequence to one or more elements in the other sequence. The DTW distance [7] is a representative matching-based similarity measure. The DTW distance is originally proposed to measure the similarity between two time series sampled at different rate. Fig. 2 shows an example of computing the similarity between two sequences [TeX:] $$<a_{1}, a_{2}, \ldots, a_{7}>\text { and }<b_{1}, b_{2}, \ldots, b_{7}></b_>$$ using the Euclidean distance and DTW distance, respectively. In the Euclidean distance, each element ai is matched only to bi, which is the element in the same position in the other sequence. On the contrary, the DTW distance can match ai to any number of elements at any position in the other sequence. The DTW distance measures the distance between two sequences by computing the minimum cost of matching elements in a sequence to elements in the other sequence. Therefore, it is known that the DTW distance is very effective for measuring the distance between sequences of different lengths [13,14]. However, the DTW distance does not consider the hierarchical classification of elements in sequences when matching them. 2.2 Segment TreeUnlike the previous measures that do not consider the similarity between elements in sequences, the proposed measures consider the similarity between products in purchase histories considering the hierarchical classification of products. In order to quickly compute the similarity between products, the proposed method uses a segment tree [15], whose example is shown in Fig. 3. In a segment tree, as shown in Fig. 3, each node maintains some information about an interval or segment (e.g., sum, minimum, maximum). Furthermore, the interval of a parent node is the sum of the intervals of its child nodes. For example, in Fig. 3, each node in the segment tree maintains a sum value about an interval, which corresponds to the sum of sum values of its child nodes. The segment tree is known to be efficient in finding the minimum value for any given interval because each node stores information about a sub-interval computed in advance [16,17]. In this paper, we construct a segment tree to store information about the hierarchical classification of products and use it to quickly compute the distance between products. 3. Proposed Method3.1 OverviewIn this paper, we propose new similarity measures for purchase histories considering the hierarchical classification of products (i.e., a product taxonomy). We also propose an efficient computation method for the proposed similarity measures. A purchase history is a sequence of products listed in order of purchase. Unlike the previous measures for sequences, the proposed measures consider both the order of products and the similarity between products to compute the similarity between purchase histories. For this purpose, we extend the existing representative similarity measures for sequences (i.e., the Levenshtein distance, the Needleman-Wunsch similarity, and the DTW distance) to consider the hierarchical classification of products. In the next subsections, we first present the basic idea of the proposed measures and then describe how to extend the existing measures to consider a product taxonomy. 3.2 Basic IdeaAs mentioned in Section 2, the previous similarity measures for sequences assign the distance between two products as 0 if they are identical and 1 if they are different. Thus, they do not consider the hierarchical classification of products. On the other hand, by considering the hierarchical classification of products, the proposed measures assign a value between 0 and 1 as the distance between two products. Let dist(x, y) be the distance between two products x and y. Then, in the previous measures, dist(x, y) is defined as follows:
(1)[TeX:] $$\operatorname{dist}(x, y)=\left\{\begin{array}{ll} 0, & \text { if } x=y \\ 1, & \text { if } x \neq y \end{array}\right.$$Let T be a tree representing a given hierarchical classification, whose example is shown in Fig. 1. We call T a hierarchical classification tree. Given T, we define dist(x, y) as follows:
(2)[TeX:] $$\operatorname{dist}(x, y)=\frac{\text { shortestPathLen }(x, y, T)}{\text { longestPathLen }(T)}$$Here, shortestPathLen(x, y, T) represents the length of the shortest path between product x and y in T, and longestPathLen(T) represents the length of the path between the two leaf nodes that are the farthest within T. dist(x, y) has a value between 0 and 1. dist(x, y) has a smaller value as two products x and y are closer in T, i.e., the lower the level of the category to which two products both belong. On the other hand, dist(x, y) has a larger value as two products x and y are farther in T, i.e., the higher the level of the category to which two products both belong. For example, in Fig. 1, dist(“Coke”, “Sprite”) = 2/7 because shortestPathLen(“Coke”, “Sprite”, T) = 2 and longestPathLen(T) = 7. Also, dist(“Coke”, “Candy”) = 4/7 because shortestPathLen(“Coke”, “Candy”, T) = 4 and longestPathLen(T) = 7. Therefore, we can see that [TeX:] $$\text { dist("Coke", "Sprite") }<\text { dist ("Coke", "Candy") }$$ as we expected. Based on this idea, we can extend the existing similarity measures for sequences that assign only 0 or 1 as the distance between two products to reflect the hierarchical classification of products. In the next section, we describe how to extend the three existing representative similarity measures for sequences, i.e., the Levenshtein distance, the Needleman-Wunsch similarity, and the DTW distance. 3.3 Extensions of Existing Similarity MeasuresSuppose we are given two purchase histories [TeX:] $$s_{1}=<x_{1}, x_{2}, \cdots, x_{n}>\text { and } s_{2}=<y_{1}, y_{2}, \cdots, y_{m}>, \text { where } x_{i} \text { and }y_{i}$$ represent the ith purchased product in [TeX:] $$s_{1} \text { and } s_{2},$$ respectively. Given [TeX:] $$s_{1} \text { and } s_{2},$$ the Levenshtein distance, the Needleman-Wunsch similarity, and the DTW distance all create two-dimensional array M of n×m size as shown in Fig. 4. Initially, all elements M[i][j] are set to the specified values depending on the distance. Then, these distances update each element M[i][j] by using its neighboring three elements M[i- 1][j-1], M[i][j-1], and M[i-1][j-1]. Each distance uses a different formula to obtain M[i][j] from M[i-1][j- 1], M[i][j-1], and M[i-1][j-1], which will be described in detail below. After obtaining all M[i][j] for i = 1, 2…, n and j = 1, 2, …, m, M[n][m] is returned as the distance between [TeX:] $$S_{1} \text { and } S_{2}.$$ Now we describe how to extend each distance to consider the hierarchical classification of products. Extension of the Levenshtein distance In the Levenshtein distance, M[i][0] and M[0][j] are initialized to M[i][0] = i and M[0][j] = j, respectively, for i = 1, 2…, n and j = 1, 2, …, m. M[i][j] is then updated using the following equation:
where [TeX:] $$1_{(x i \neq y j)}$$ is the indicator function equal to 0 when [TeX:] $$x_{i}=y_{j}$$ and equal to 1 otherwise. In other words, the Levenshtein distance considers the distance between [TeX:] $$x_{i} \text { and } y_{j}$$ to 0 if [TeX:] $$x_{i}=y_{j}, \text { and } 1 \text { if } x_{i} \neq y_{j}.$$ Thus, we extend Eq. (3) to allow the distance between [TeX:] $$x_{i} \text { and } y_{j}$$ to have a value between 0 and 1 as follow:
(4)[TeX:] $$M[i][j]=\min \left(M[i][j-1]+1, M[i-1][j]+1, M[i-1][j-1]+\operatorname{dist}\left(x_{i}, y_{j}\right)\right)$$where [TeX:] $$\operatorname{dist}\left(x_{i}, y_{j}\right)$$ is defined as Eq. (2), which corresponds to the distance between [TeX:] $$x_{i} \text { and } y_{j}$$ in the hierarchical classification tree T. In this way, we can extend the Levenshtein distance to consider the hierarchical classification of products. Extension of the Needleman-Wunsch similarity In the Needleman-Wunsch similarity, M[i][0] and M[0][j] are initialized to M[i][0] = −i and M[0][j] = −j, respectively, for i = 1, 2…, n and j = 1, 2, …, m. M[i][j] is then updated using the following equation:
(5)[TeX:] $$M[i][j]=\max \left(M[i][j-1]-1, M[i-1][j]-1, M[i-1][j-1]+s\left(x_{i}, y_{j}\right)\right),$$where [TeX:] $$s\left(x_{i}, y_{j}\right)$$ is a match score function that returns 1 if [TeX:] $$x_{i}=y_{j}$$ and 0 otherwise. In other words, the Needleman-Wunsch similarity assigns 1 as the match score between [TeX:] $$x_{i} \text { and } y_{j} \text { if } x_{i}=y_{j}, \text { and } 0 \text { if } x_{i} \neq y_{j}.$$ Thus, we extend Eq. (5) to allow the match score between xi and yj to have a value between 0 and 1 as follow:
(6)[TeX:] $$M[i][j]=\min \left(M[i][j-1]+1, M[i-1][j]+1, M[i-1][j-1]+\left(1-\operatorname{dist}\left(x_{i}, y_{j}\right)\right)\right),$$where [TeX:] $$\operatorname{dist}\left(x_{i}, y_{j}\right)$$ is defined as Eq. (2), which corresponds to the distance between [TeX:] $$x_{i} \text { and } y_{j}$$ in the hierarchical classification tree T. In this way, we can extend the Needleman-Wunsch similarity to consider the hierarchical classification of products. Extension of the DTW distance In the DTW distance, all elements M[i][j] are initialized to ∞ for i = 1, 2…, n and j = 1, 2, …, m. M[i][j] is then updated using the following equation:
(7)[TeX:] $$M[i][j]=d\left(x_{i}, y_{j}\right)+\min (M[i][j-1], \mathrm{M}[i-1][j], M[i-1][j-1]),,$$where [TeX:] $$d\left(x_{i}, y_{j}\right)=\left|x_{i}-y_{j}\right|$$ assuming that xi and yj are numbers. Although the DTW allows a value other than 0 and 1 as the distance between [TeX:] $$x_{i} \text { and } y_{j},$$ the distance is defined only for numbers. Thus, we extend Eq. (7) to allow the distance between two products xi and yj as follow:
(8)[TeX:] $$M[i][j]=\operatorname{dist}\left(x_{i}, y_{j}\right)+\min (M[i][j-1], \mathrm{M}[i-1][j], M[i-1][j-1]),$$where [TeX:] $$\operatorname{dist}\left(x_{i}, y_{j}\right)$$ is defined as Eq. (2), which corresponds to the distance between [TeX:] $$x_{i} \text { and } y_{j}$$ in the hierarchical classification tree T. In this way, we can extend the DTW distance to consider the hierarchical classification of products. 3.4 Efficient Computation Method for Similarity MeasuresIn this subsection, we propose an efficient computation method for the proposed similarity measures presented in Section 3.3. We first describe a naive method and then present an improved version that eliminates the disadvantage of the naïve method. 3.4.1 Naïve methodIn order to compute the proposed similarity measures, we need to compute [TeX:] $$\operatorname{dist}\left(x_{i}, y_{j}\right)$$ repeatedly for all pairs of [TeX:] $$x_{i} \text { and } y_{j} \text { in } s_{1} \text { and } s_{2} \text { . }.$$ Therefore, as [TeX:] $$s_{1} \text { and } s_{2}$$ become longer, the computation cost increases. When we compute [TeX:] $$\operatorname{dist}\left(x_{i}, y_{j}\right),$$ since longestPathLen(T) is a fixed value in [TeX:] $$\operatorname{dist}\left(x_{i}, y_{j}\right),$$ it is very important to efficiently obtain [TeX:] $$\text { shortestPathLen }\left(x_{i}, y_{j}, T\right)$$ to reduce the computation cost. A simple way to obtain [TeX:] $$\text { shortestPathLen }\left(x_{i}, y_{j}, T\right)$$ for given [TeX:] $$x_{i} \text { and } y_{j}$$ is as follows: We first find the path from the root node of T to [TeX:] $$x_{i} \text { and } y_{j},$$ respectively. We then compare the nodes in the two paths one by one to count the number of different nodes. For example, Fig. 5 shows the process of computing shortestPathLen(“Sandals”, “Vests”, T) using this naïve method. The two paths in Fig. 5 represent the path from the root node of T to “Sandals” and “Vests”, respectively. The naïve method then compares the two paths one by one from the root node. This process continues until the two nodes do not match. Finally, the number of mismatched nodes in the two paths represents the length of the shortest path of the two products. Therefore, in Fig. 5, shortestPathLen(“Sandals”, “Vests”, T) = 2 + 3 = 5. However, the naïve method has the disadvantage that the cost of computing [TeX:] $$\text { shortestPathLen }\left(x_{i}, y_{j}, T\right)$$ increases as the size of T increases, because the length of a path from the root node of T to a leaf node increases. In particular, this problem becomes more serious as the lengths of [TeX:] $$s_{1} \text { and } s_{2}$$ increase, because we need to compute [TeX:] $$\operatorname{dist}\left(x_{i}, y_{j}\right)$$ repeatedly for all pairs of [TeX:] $$x_{i} \text { and } y_{j} \text { in } s_{1} \text { and } s_{2}.$$ Therefore, we propose an efficient computation method to minimize this problem. 3.4.2 Proposed computation methodTo minimize the problem of the naïve method, we propose a very efficient method to compute [TeX:] $$\text { shortestPathLen }\left(x_{i}, y_{j}, T\right).$$ The proposed computation method uses a segment tree, which we described in Section 2.2. Before computing [TeX:] $$\text { shortestPathLen }\left(x_{i}, y_{j}, T\right),,$$ the proposed method traverses the hierarchical classification tree T once to build a segment tree, which is then used for computing [TeX:] $$\left.y_{j}, T\right).$$ Fig. 6 shows an example of a segment tree built by the proposed computation method. In order to build a segment tree, we first assign a product number to each leaf node of the hierarchical classification tree T in the order of 0, 1, … from left to right (e.g., Sprite = 0, Coke = 1, Candy = 2, …). In the segment tree built from T, each node is associated with an interval [i, j] of product numbers and stores depth[LCA(i, j, T)], which is the depth of the lowest common ancestor (LCA) of product i and product j in T. Here, the depth of a node is the number of edges from the root node to the node. The LCA of two products is the node with the lowest level that contains both products. For example, in Fig. 6, the node ‘Food’ in the segment tree is associated with the interval [0, 2], which represents the range of product numbers for products belonging to the “Food” category. The node “Food” also stores depth[LCA(0, 2, T)], which is 1 because the depth of the LCA of product 0 (i.e., Sprite) and product 2 (i.e., Candy) is 1 in T. In the naïve method described in Section 3.3.1, the final step in computing [TeX:] $$\text { shortestPathLen }\left(x_{i}, y_{j}, T\right)$$ is to compare the paths from the root node of T to [TeX:] $$x_{i} \text { and to } y_{j}$$ to find the last matching node. In fact, the last matching node corresponds to the LCA of [TeX:] $$x_{i} \text { and } y_{j} \text { . }.$$ Therefore, the proposed method uses the segment tree to quickly find the depth of the LCA of [TeX:] $$x_{i} \text { and } y_{j}$$ and uses it to compute [TeX:] $$\text { shortestPathLen }\left(x_{i}, y_{j}, T\right) ..$$ More specifically, we compute [TeX:] $$\text { shortestPathLen }\left(x_{i}, y_{j}, T\right)$$ \text { shortestPathLen }\left(x_{i}, y_{j}, T\right)
(9)[TeX:] $$\text { shortestPathLen }\left(x_{i}, y_{j}, T\right)=\operatorname{depth}\left[x_{i}\right]+\operatorname{depth}\left[y_{j}\right]-2 \times \operatorname{depth}\left[\operatorname{LCA}\left(p\left(x_{i}\right), p\left(y_{j}\right), T\right)\right],,$$where [TeX:] $$p\left(x_{i}\right) \text { and } p\left(y_{j}\right)$$ are the product number of [TeX:] $$x_{i} \text { and } y_{j},$$ respectively. The length of the path from [TeX:] $$x_{i}$$ to [TeX:] $$\operatorname{LCA}\left(p\left(x_{i}\right), p\left(y_{j}\right), T\right) \text { is the } \operatorname{depth}\left[x_{i}\right]-\operatorname{depth}\left[\operatorname{LCA}\left(p\left(x_{i}\right), p\left(y_{j}\right), T\right)\right],$$ and the length of the path from [TeX:] $$\operatorname{LCA}\left(p\left(x_{i}\right)\right.\left.p\left(y_{j}\right), T\right) \text { to } y_{j} \text { is the } \operatorname{depth}\left[y_{j}\right]-\operatorname{depth}\left[\operatorname{LCA}\left(p\left(x_{i}\right), p\left(y_{j}\right), T\right)\right],$$ so the length of the shortest path between two products [TeX:] $$x_{i} \text { and } y_{j}$$ can be expressed by Eq. (9). The proposed method first generates two one-dimensional arrays by traversing the product classification tree T. The first array stores the nodes in order of visit and the second array stores the depth of each node. In the hierarchical classification tree T, let us assume that the node ID of a parent node is always smaller than those of its child nodes. Then, we can guarantee that the LCA of [TeX:] $$x_{i} \text { and } y_{j}$$ in T is the node with the smallest node ID among the nodes in the traversal path from [TeX:] $$x_{i} \text { to } y_{j}.$$ Therefore, we can build the segment tree using these two arrays, which store the nodes in order of visit and the depth of each node, respectively. When computing [TeX:] $$\text { shortestPathLen }\left(x_{i}, y_{j}, T\right)$$ defined in Eq. (9), we can use the segment tree to quickly obtain the value of [TeX:] $$\operatorname{depth}\left[\operatorname{LCA}\left(p\left(x_{i}\right), p\left(y_{j}\right), T\right)\right].$$ Therefore, [TeX:] $$\text { shortestPathLen }\left(x_{i}, y_{j}, T\right)$$ can be easily obtained without finding the paths from the root node of T to [TeX:] $$x_{i} \text { and } y_{j}$$ and then comparing the nodes in the two paths one by one. Consequently, the cost of computing [TeX:] $$\operatorname{dist}\left(x_{i}, y_{j}\right)$$ is greatly reduced. We will present the performance evaluation of the proposed computation method in Section 4.2 4. Performance EvaluationIn this section, we present the performance evaluation of the proposed measures. First, we evaluated the effectiveness of the proposed similarity measures presented in Section 3.3. We then evaluate the efficiency of the proposed computation method presented in Section 3.4. The proposed method was implemented using Python, and the experiments were conducted on a PC running Windows 10 with Intel Core i7-5829 3.3 GHz CPU and 8 GB memory. To construct a hierarchical classification tree for products, we referred to the actual product classification system of Amazon.com, a representative online shopping site. 4.1 Effectiveness EvaluationFirst, we evaluated whether the proposed similarity measures, which consider the product taxonomy, is more effective in measuring the similarity between purchase histories, compared to the existing measures. Fig. 7 shows the set of purchase histories used in this experiment. The 9 purchase histories [TeX:] $$\text { (i.e., } \left.s_{1}, s_{2}, \ldots, s_{9}\right)$$ are divided into three groups Group 1, Group 2, and Group 3, each of which consists of similar purchase histories. Group 1 is from Amazon customers’ real purchase histories, and Groups 2 and 3 are synthetic purchase histories. Each product in Group 1 represents an actual product sold on Amazon, which is shown in Table 1. Note that purchase histories in each group have products with similar categories in a similar order. We measured the distances between all pairs of [TeX:] $$S 1, S 2, \ldots, S 9$$ using the three existing similarity measures (i.e., the Levenshtein distance, Needleman-Wunsch distance, and DTW distance) and the proposed extended versions of them, and compared the results. Table 1.
Fig. 8 shows the distances between [TeX:] $$S_{1}, S_{2}, \ldots, S 9$$ measured by the Levenshtein distance and the proposed extended version. The original Levenshtein distance computes the similarity between two sequences based on how many products match. Therefore, even if two sequences have similar products in the same order, it considers these two sequences to be completely different. On the other hand, the proposed method measures the similarity between two purchase histories more accurately by considering whether they are composed of products with similar categories. For example, the original Levenshtein distance measures the distance between [TeX:] $$S_{1} \text { and } S_{2}$$ as 3 because they consist of completely different products. However, “Candy” and “Chocolate”, “Compute” and “Camera”, and “Sprite” and “Coke” are similar to each other, respectively, because they belong to the same low-level category. If we use the proposed measure, the distance between [TeX:] $$s_{1} \text { and } S_{2}$$ is measured as 0.75, which is more accurate. Note that, in Fig. 8(b), the purchase histories belonging to the same group have smaller distances (i.e., those in the dotted boxes) compared to other purchase histories. Fig. 9 shows the similarities between [TeX:] $$S 1, S 2, \ldots, S 9$$ measured by the Needleman-Wunsch similarity and the proposed extended version. Like the Levenshtein distance, the Needleman-Wunsch similarity computes the similarity between two sequences based on how many elements are identical. Therefore, also in the case of the Needleman-Wunsch distance, if two sequences consist of totally different products, they are considered completely different sequences, even if their products have similar categories. However, the proposed measure computes the similarity between two purchase histories more accurately by considering the hierarchical classification of products. In Fig. 9(b), we can see that the purchase histories belonging to the same group have higher similarities (i.e., those in the dotted boxes) compared to other purchase histories. Fig. 10 shows the distances between [TeX:] $$S 1, S 2, \ldots, S 9$$ measured by the DTW distance and the proposed extended version. The original DTW distance computes the distance between two purchase histories by adding 1 for each mismatched pair of products. Thus, the original DTW distance cannot differentiate products with similar categories and those with dissimilar categories. However, the proposed extended version of the DTW distance computes the distance between two purchase histories more effectively by considering the product taxonomy, so it determines that the purchase histories belonging to the same group are more similar than others. In Fig. 10(b), we can see that that the purchase histories belonging to the same group have smaller distances (i.e., those in the dotted boxes) compared to other purchase histories. From the results of the experiments, we can see that the proposed measures compute the distance (or similarity) between purchase histories more accurately than the existing similarity measures by considering the product taxonomy. In particular, the proposed measures assign a smaller distance to two purchase histories even if they consist of completely different products as long as they are composed of products with similar categories. 4.2 Efficiency EvaluationNext, we evaluated how much the proposed computation method presented in Section 3.4.2 improves the efficiency of computing the proposed similarity measures compared to the naïve method presented in Section 3.4.1. In this experiment, we measured the time taken to compute the proposed similarity measures (i.e., the proposed extended versions of the Levenshtein distance, Needleman-Wunsch similarity, and DTW distance, respectively) for 100 pairs of purchase histories using the proposed computation method and the naïve method, respectively. For this experiment, we randomly generated synthetic purchase histories with different lengths and built a segment tree. When the number of products in the hierarchical classification tree was 10,000, the time taken to build a segment tree was 25 seconds on average. Considering the fact that the segment is built once and used repeatedly for subsequent similarity evaluations, this time is very short and negligible. Processing time by varying the average length of sequences Fig. 11 shows the time taken to compute the three proposed measures using the proposed computation method and the naïve method, respectively. In this experiment, we varied the average length of purchase histories from 5 to 25. We set the number of products in the hierarchical classification tree to 10,000 and the height of the hierarchical classification tree to 12. In Fig. 11, we can see that the time taken to compute the proposed measures using the naïve method increases rapidly as the average length of purchase histories increases. This is because the naïve method needs to find the paths from the root node to leaf nodes in the hierarchical classification tree and compare the nodes in the paths repeatedly for each pair of products in purchase histories. On the other hand, the proposed computation method reduces the processing time significantly by using a segment tree, which is built in advance, to quickly obtain the shortest distance between two products in the hierarchical classification tree. Note that all the three proposed measures take almost the same time to compute them either when using the proposed computation method or the naïve method, because all the three measures use similar computation processes. Especially, note also that the processing times of the proposed computation method for the three extended measures are almost the same so they are hard to distinguish in Fig. 11. This is because the proposed computation method can obtain the shortest distance between two products almost immediately so it takes very little processing time for all the three extended measures. Processing time by varying the height of the classification tree Fig. 12 shows the time taken to compute the three proposed measures using the proposed computation method and the naïve method, respectively, by varying the height of the hierarchical classification tree from 5 to 17. In this experiment, we set the average length of purchase histories to 20 and the number of products in the hierarchical classification tree to 10,000. In Fig. 12, we can observe that the processing time of the naïve method increases slowly as the height of the hierarchical classification tree increases. This is because the length of paths from the root node to leaf nodes in the hierarchical tree increases. In comparison, the proposed computation method shows almost a constant time regardless of the height of the product classification tree, because the time to obtain the shortest distance between two products in the hierarchical classification tree using the segment tree is not affected by the height of the hierarchical classification tree. Therefore, we can confirm that the proposed computation method is far more efficient than the naïve method to compute the similarity between purchase histories. In Fig. 12, note also that the processing times of the proposed computation method for the three extended measures are almost the same for the same reason as in Fig. 11. Processing time by varying the number of products in the classification tree Fig. 13 shows the time taken to compute the three proposed measures using the proposed computation method and the naïve method, respectively, by varying the number of products in the hierarchical classification tree from 7,000 to 25,000. In this experiment, we set the average length of purchase histories to 20 and the height of the hierarchical classification tree to 12. In Fig. 13, the processing time of the naïve method increases linearly as the number of products in the hierarchical classification tree increases. This is because, in the naïve method, the time taken to calculate the shortest distance between two products increases as the size of the hierarchical tree increases. On the contrary, the processing time of the proposed computation method does not increase because the time to obtain the shortest distance between two products using the segment tree is not affected by the size of the hierarchical classification tree. As a result, in Fig. 13, the processing times of the proposed computation method for the three extended measures are almost the same and hard to distinguish because it takes very little processing time for all the three extended measures. From the result of this experiment, we can again confirm that the proposed computation method reduces the processing time significantly by using the pre-built segment tree. Let us close this section by discussing some issues of the segment tree. First, the shape and size of the segment tree do not affect the performance of the proposed computation method because we can obtain the shortest distance between two products almost immediately without traversing the segment tree, as we have empirically shown in this section. Second, if the segment tree is too big to fit in memory, it can lead to frequent disk accesses. However, because we need to store only three numbers for each node in the hierarchical classification tree to build a segment tree (i.e., node ID, depth, and visit order), the amount of memory required to store a segment tree even with 1,000,000 nodes is only about 12 MB, if we use 4 bytes to store a number. Therefore, we can expect a segment tree to fit into memory in most cases. 5. ConclusionIn this paper, we have proposed new similarity measures for purchase histories considering a hierarchical classification of products. The proposed measures not only consider the purchase order of products in purchase histories but also the similarity between products in purchase histories. Unlike the existing similarity measures that measure the distance between two products only as 0 or 1 depending on whether they match or not, the proposed measures assign a value between 0 and 1 as the distance between two products considering their distance in the hierarchical classification tree. We have extended three existing representative similarity measures for sequences, i.e., the Levenshtein distance, Needleman-Wunsch similarity, and DTW distance, to consider the hierarchical classification of products. Furthermore, we have also proposed an efficient computation method to compute the proposed measures. The proposed computation method uses a segment tree to quickly obtain the shortest distance between two products in the hierarchical classification tree. As a result, the proposed computation method reduces the time to compute the distance between two products in the hierarchical classification tree significantly. Using the proposed computation method, the proposed measures can be evaluated very efficiently regardless of the size of the hierarchical classification tree. In the experiments, we have evaluated whether the proposed measures measure the similarity between purchase histories more effectively than the existing measures. As a result of the experiments, we have confirmed that the proposed measures are more effective than the existing measures in measuring the similarity between purchase histories by considering the hierarchical classification of products. Also, through the experiments measuring the time taken to compute the proposed measures, we have shown that the proposed computation method can reduce the computation time significantly, allowing the proposed measures to be used for a large number of purchase histories. Therefore, the proposed measures can be effectively used for various marketing strategies, such as customer segmentation, which identifies the groups of customers with similar purchase histories, customer search, which finds customers with similar purchase history, and product recommendation, which recommends products purchased by customers with similar purchase histories. BiographyYu-Jeong Yanghttps://orcid.org/0000-0002-7155-5367She received the B.S. degree from the Division of Computer Science, Sookmyung Women’s University, Korea, in 2019. She is now M.S. student in the Department of Computer Science of degrees at Sookmyung Women’s University, Korea. Her current research interests include databases, data mining and big data. BiographyKi Yong Leehttps://orcid.org/0000-0003-2318-671XHe received his B.S., M.S., and Ph.D. degrees in Computer Science from KAIST, Daejeon, Korea, in 1998, 2000, and 2006, respectively. From 2006 to 2008, he worked for Samsung Electronics, Suwon, Korea as a senior engineer. From 2008 to 2010, he was a research assistant professor of the Department of Computer Science at KAIST, Daejeon, Korea. He joined the faculty of the Division of Computer Science at Sookmyung Women’s University, Seoul, in 2010, where currently he is a professor. His research interests include database systems, data mining, big data, and data streams. References
|