## Yu-Jeong Yang* and Ki Yong Lee*## |

Product in Group 1 | Actual product |
---|---|

Bread | King Arthur Flour, Pumpkin Bread + Muffin Mix, Gluten Free, 12 oz |

Camera | Canon EOS 4000D DSLR Camera w/Canon EF-S 18–55 mm F/3.5–5.6 III Zoom Lens + Case + 32 GB SD Card (15pc Bundle) |

Candy | SKITTLES, STARBURST & LIFE SAVERS Valentine's Day Candy Fun Size Variety Mix 22.7-Ounces 80 Pieces |

Chocolate | Betty Crocker Hershey's Milk Chocolate Frosting, Gluten Free, 16 oz |

Coke | Diet Coke Soda Soft Drink, 12 fl oz, 12 Pack |

Computer | HP 21.5-Inch All-in-One Computer, AMD A4-9125, 4 GB RAM, 1 TB Hard Drive, Windows 10 (22-c0010, White) |

Cookie | Pepperidge Farm, Classic Cookie Favorites, 13.25 oz |

Sprite | Sprite Lemon Lime Soda Soft Drinks, 12 fl oz, 12 Pack |

Tea | Ahmad Tea Twelve Teas Variety Gift Box, 60 Foil Enveloped Teabags |

TV | TCL 32S325 32 Inch 720p Roku Smart LED TV (2019) |

Water | Nestle Pure Life Purified Water, 8 fl oz. Plastic Bottles (24 Pack) |

Fig. 8 shows the distances between [TeX:] $$S_{1}, S_{2}, \ldots, S 9$$ measured by the Levenshtein distance and the proposed extended version. The original Levenshtein distance computes the similarity between two sequences based on how many products match. Therefore, even if two sequences have similar products in the same order, it considers these two sequences to be completely different. On the other hand, the proposed method measures the similarity between two purchase histories more accurately by considering whether they are composed of products with similar categories. For example, the original Levenshtein distance measures the distance between [TeX:] $$S_{1} \text { and } S_{2}$$ as 3 because they consist of completely different products. However, “Candy” and “Chocolate”, “Compute” and “Camera”, and “Sprite” and “Coke” are similar to each other, respectively, because they belong to the same low-level category. If we use the proposed measure, the distance between [TeX:] $$s_{1} \text { and } S_{2}$$ is measured as 0.75, which is more accurate. Note that, in Fig. 8(b), the purchase histories belonging to the same group have smaller distances (i.e., those in the dotted boxes) compared to other purchase histories.

Fig. 9 shows the similarities between [TeX:] $$S 1, S 2, \ldots, S 9$$ measured by the Needleman-Wunsch similarity and the proposed extended version. Like the Levenshtein distance, the Needleman-Wunsch similarity computes the similarity between two sequences based on how many elements are identical. Therefore, also in the case of the Needleman-Wunsch distance, if two sequences consist of totally different products, they are considered completely different sequences, even if their products have similar categories. However, the proposed measure computes the similarity between two purchase histories more accurately by considering the hierarchical classification of products. In Fig. 9(b), we can see that the purchase histories belonging to the same group have higher similarities (i.e., those in the dotted boxes) compared to other purchase histories.

Fig. 10 shows the distances between [TeX:] $$S 1, S 2, \ldots, S 9$$ measured by the DTW distance and the proposed extended version. The original DTW distance computes the distance between two purchase histories by adding 1 for each mismatched pair of products. Thus, the original DTW distance cannot differentiate products with similar categories and those with dissimilar categories. However, the proposed extended version of the DTW distance computes the distance between two purchase histories more effectively by considering the product taxonomy, so it determines that the purchase histories belonging to the same group are more similar than others. In Fig. 10(b), we can see that that the purchase histories belonging to the same group have smaller distances (i.e., those in the dotted boxes) compared to other purchase histories.

From the results of the experiments, we can see that the proposed measures compute the distance (or similarity) between purchase histories more accurately than the existing similarity measures by considering the product taxonomy. In particular, the proposed measures assign a smaller distance to two purchase histories even if they consist of completely different products as long as they are composed of products with similar categories.

Next, we evaluated how much the proposed computation method presented in Section 3.4.2 improves the efficiency of computing the proposed similarity measures compared to the naïve method presented in Section 3.4.1. In this experiment, we measured the time taken to compute the proposed similarity measures (i.e., the proposed extended versions of the Levenshtein distance, Needleman-Wunsch similarity, and DTW distance, respectively) for 100 pairs of purchase histories using the proposed computation method and the naïve method, respectively. For this experiment, we randomly generated synthetic purchase histories with different lengths and built a segment tree. When the number of products in the hierarchical classification tree was 10,000, the time taken to build a segment tree was 25 seconds on average. Considering the fact that the segment is built once and used repeatedly for subsequent similarity evaluations, this time is very short and negligible.

**Processing time by varying the average length of sequences**

Fig. 11 shows the time taken to compute the three proposed measures using the proposed computation method and the naïve method, respectively. In this experiment, we varied the average length of purchase histories from 5 to 25. We set the number of products in the hierarchical classification tree to 10,000 and the height of the hierarchical classification tree to 12. In Fig. 11, we can see that the time taken to compute the proposed measures using the naïve method increases rapidly as the average length of purchase histories increases. This is because the naïve method needs to find the paths from the root node to leaf nodes in the hierarchical classification tree and compare the nodes in the paths repeatedly for each pair of products in purchase histories. On the other hand, the proposed computation method reduces the processing time significantly by using a segment tree, which is built in advance, to quickly obtain the shortest distance between two products in the hierarchical classification tree. Note that all the three proposed measures take almost the same time to compute them either when using the proposed computation method or the naïve method, because all the three measures use similar computation processes. Especially, note also that the processing times of the proposed computation method for the three extended measures are almost the same so they are hard to distinguish in Fig. 11. This is because the proposed computation method can obtain the shortest distance between two products almost immediately so it takes very little processing time for all the three extended measures.

**Processing time by varying the height of the classification tree**

Fig. 12 shows the time taken to compute the three proposed measures using the proposed computation method and the naïve method, respectively, by varying the height of the hierarchical classification tree from 5 to 17. In this experiment, we set the average length of purchase histories to 20 and the number of products in the hierarchical classification tree to 10,000. In Fig. 12, we can observe that the processing time of the naïve method increases slowly as the height of the hierarchical classification tree increases. This is because the length of paths from the root node to leaf nodes in the hierarchical tree increases. In comparison, the proposed computation method shows almost a constant time regardless of the height of the product classification tree, because the time to obtain the shortest distance between two products in the hierarchical classification tree using the segment tree is not affected by the height of the hierarchical classification tree. Therefore, we can confirm that the proposed computation method is far more efficient than the naïve method to compute the similarity between purchase histories. In Fig. 12, note also that the processing times of the proposed computation method for the three extended measures are almost the same for the same reason as in Fig. 11.

**Processing time by varying the number of products in the classification tree**

Fig. 13 shows the time taken to compute the three proposed measures using the proposed computation method and the naïve method, respectively, by varying the number of products in the hierarchical classification tree from 7,000 to 25,000. In this experiment, we set the average length of purchase histories to 20 and the height of the hierarchical classification tree to 12. In Fig. 13, the processing time of the naïve method increases linearly as the number of products in the hierarchical classification tree increases. This is because, in the naïve method, the time taken to calculate the shortest distance between two products increases as the size of the hierarchical tree increases. On the contrary, the processing time of the proposed computation method does not increase because the time to obtain the shortest distance between two products using the segment tree is not affected by the size of the hierarchical classification tree. As a result, in Fig. 13, the processing times of the proposed computation method for the three extended measures are almost

the same and hard to distinguish because it takes very little processing time for all the three extended measures. From the result of this experiment, we can again confirm that the proposed computation method reduces the processing time significantly by using the pre-built segment tree.

Let us close this section by discussing some issues of the segment tree. First, the shape and size of the segment tree do not affect the performance of the proposed computation method because we can obtain the shortest distance between two products almost immediately without traversing the segment tree, as we have empirically shown in this section. Second, if the segment tree is too big to fit in memory, it can lead to frequent disk accesses. However, because we need to store only three numbers for each node in the hierarchical classification tree to build a segment tree (i.e., node ID, depth, and visit order), the amount of memory required to store a segment tree even with 1,000,000 nodes is only about 12 MB, if we use 4 bytes to store a number. Therefore, we can expect a segment tree to fit into memory in most cases.

In this paper, we have proposed new similarity measures for purchase histories considering a hierarchical classification of products. The proposed measures not only consider the purchase order of products in purchase histories but also the similarity between products in purchase histories. Unlike the existing similarity measures that measure the distance between two products only as 0 or 1 depending on whether they match or not, the proposed measures assign a value between 0 and 1 as the distance between two products considering their distance in the hierarchical classification tree. We have extended three existing representative similarity measures for sequences, i.e., the Levenshtein distance, Needleman-Wunsch similarity, and DTW distance, to consider the hierarchical classification of products.

Furthermore, we have also proposed an efficient computation method to compute the proposed measures. The proposed computation method uses a segment tree to quickly obtain the shortest distance between two products in the hierarchical classification tree. As a result, the proposed computation method reduces the time to compute the distance between two products in the hierarchical classification tree significantly. Using the proposed computation method, the proposed measures can be evaluated very efficiently regardless of the size of the hierarchical classification tree.

In the experiments, we have evaluated whether the proposed measures measure the similarity between purchase histories more effectively than the existing measures. As a result of the experiments, we have confirmed that the proposed measures are more effective than the existing measures in measuring the similarity between purchase histories by considering the hierarchical classification of products. Also, through the experiments measuring the time taken to compute the proposed measures, we have shown that the proposed computation method can reduce the computation time significantly, allowing the proposed measures to be used for a large number of purchase histories.

Therefore, the proposed measures can be effectively used for various marketing strategies, such as customer segmentation, which identifies the groups of customers with similar purchase histories, customer search, which finds customers with similar purchase history, and product recommendation, which recommends products purchased by customers with similar purchase histories.

She received the B.S. degree from the Division of Computer Science, Sookmyung Women’s University, Korea, in 2019. She is now M.S. student in the Department of Computer Science of degrees at Sookmyung Women’s University, Korea. Her current research interests include databases, data mining and big data.

He received his B.S., M.S., and Ph.D. degrees in Computer Science from KAIST, Daejeon, Korea, in 1998, 2000, and 2006, respectively. From 2006 to 2008, he worked for Samsung Electronics, Suwon, Korea as a senior engineer. From 2008 to 2010, he was a research assistant professor of the Department of Computer Science at KAIST, Daejeon, Korea. He joined the faculty of the Division of Computer Science at Sookmyung Women’s University, Seoul, in 2010, where currently he is a professor. His research interests include database systems, data mining, big data, and data streams.

- 1 M. Sforna, "Data mining in a power company customer database,"
*Electric Power Systems Research*, vol. 55, no. 3, pp. 201-209, 2000.custom:[[[-]]] - 2 C. Rygielski, J. C. Wang, D. C. Yen, "Data mining techniques for customer relationship management,"
*Technology in Society*, vol. 24, no. 4, pp. 483-502, 2002.custom:[[[-]]] - 3 M. Kaur, S. Kang, "Market Basket Analysis: identify the changing trends of market data using association rule mining,"
*Procedia Computer Science*, vol. 85, pp. 78-85, 2016.custom:[[[-]]] - 4 C. Yin, S. Ding, J. Wang, "Mobile marketing recommendation method based on user location feedback,"
*Human-centric Computing and Information Sciences*, vol. 9, no. 14, 2019.custom:[[[-]]] - 5 V. I. Levenshtein, "Binary codes capable of correcting deletions, insertions, and reversals,"
*in Soviet Physics Doklady*, vol. 10, no. 8, pp. 707-710, 1996.custom:[[[-]]] - 6 S. B. Needleman, C. D. Wunsch, "A general method applicable to the search for similarities in the amino acid sequence of two proteins,"
*Journal of Molecular Biology*, vol. 48, no. 3, pp. 443-453, 1970.doi:[[[10.1016/b978-0-12-131200-8.50031-9]]] - 7 D. J. Berndt, J. Clifford, "Using dynamic time warping to find patterns in time series,"
*in Knowledge Discovery in Databases: Papers from the 1994 AAAI W orkshopSeattle, W ashington. Melon Park, CA: AAAI Press*, pp. 359-370, 1994.custom:[[[-]]] - 8 S. Park, N. C. Suresh, B. K. Jeong, "Sequence-based clustering for Web usage mining: a new experimental framework and ANN-enhanced K-means algorithm,"
*Data & Knowledge Engineering*, vol. 65, no. 3, pp. 512-543, 2008.doi:[[[10.1016/j.datak.2008.01.002]]] - 9 E. Zorita, P. Cusco, G. J. Filion, "Starcode: sequence clustering based on all-pairs search,"
*Bioinformatics*, vol. 31, no. 12, pp. 1913-1919, 2015.doi:[[[10.1093/bioinformatics/btv053]]] - 10 M. A. Alqarni, S. H. Chauhdary, M. N. Malik, M. Ehatisham-ul-Haq, M. A. Azam, "Identifying smartphone users based on how they interact with their phones,"
*Human-centric Computing and Information Sciences*, vol. 10, no. 7, 2020.custom:[[[-]]] - 11 M. H. Pandi, O. Kashefi, B. Minaei, "A novel similarity measure for sequence data,"
*Journal of Information Processing Systems*, vol. 7, no. 3, pp. 413-424, 2011.doi:[[[10.3745/JIPS.2011.7.3.413]]] - 12 X. Sun, J. Zhang, "miRNA pattern discovery from sequence alignment,"
*Journal of Information Processing Systems*, vol. 13, no. 6, pp. 1527-1543, 2017.doi:[[[10.3745/JIPS.04.0051]]] - 13 P. Senin, "Dynamic time warping algorithm review,"
*Information and Computer Science DepartmentUniversity of Hawaii at Manoa, Honolulu, HI*, 2008.custom:[[[-]]] - 14 I. Boulnemour, B. Boucheham, "QP-DTW: upgrading dynamic time warping to handle quasi periodic time series alignment,"
*Journal of Information Processing Systems*, vol. 14, no. 4, pp. 851-876, 2018.custom:[[[-]]] - 15 F. P. Preparata, M. I. Shamos,
*Computational Geometry: An Introduction*, NY: Springer Science & Business Media, New York, 1985.custom:[[[-]]] - 16 M. A. Bender, M. Farach-Colton, G. Pemmasani, S. Skiena, P. Sumazin, "Lowest common ancestors in trees and directed acyclic graphs,"
*Journal of Algorithms*, vol. 57, no. 2, pp. 75-94, 2005.doi:[[[10.1016/j.jalgor.2005.08.001]]] - 17 C. F. Su, "High-speed packet classification using segment tree," in
*Proceedings of IEEE Global Telecommunications Conference (Cat. No. 00CH37137)*, San Francisco, CA, 2000;pp. 582-586. custom:[[[-]]]