Se-Chang Oh* and Min Choi**A Simple and Effective Combination of User-Based and Item-Based Recommendation MethodsAbstract: User-based and item-based approaches have been developed as the solutions of the movie recommendation problem. However, the user-based approach is faced with the problem of sparsity, and the item-based approach is faced with the problem of not reflecting users’ preferences. In order to solve these problems, there is a research on the combination of the two methods using the concept of similarity. In reality, it is not free from the problem of sparsity, since it has a lot of parameters to be calculated. In this study, we propose a combining method that simplifies the combination equation of prior study. This method is relatively free from the problem of sparsity, since it has less parameters to be calculated. Thus, it can get more accurate results by reflecting the users rating to calculate the parameters. It is very fast to predict new movie ratings as well. In experiments for the proposed method, the initial error is large, but the performance gets quickly stabilized after. In addition, it showed about 6% lower average error rate than the existing method using similarity. Keywords: Collaborative Filtering , Electronic Commerce , Recommender System , Sparsity 1. IntroductionThe movie recommendation is a typical application of recommendation systems. The solutions for the recommendation systems can be divided into user-based and item-based approaches. Hybrid approaches that combine the two methods also have been studied. The item-based approach is also referred to as the content-based method. It recommends new items with attributes similar to the items the user highly prefer [1]. In this approach, classification or clustering items using some features is performed to find similar items. It has the advantage that it works relatively well even when data are insufficient. It assumes that the user's rating is based on the features indicating the essential characteristics of items. However, there is a problem that the measure of similarity between items is different from the user’s rating. The user-based approach is also referred to as collaborative filtering. It recommends new items that are highly preferred by other users similar to the target user [2]. In this case, the taste of each user can be reflected sufficiently by using the user’s preferences when calculating the similarity between the users [3]. However, this approach has the problem of sparsity caused by incomplete calculation of similarity between the users when the data are insufficient. As a result, it is difficult to secure the reliability. To alleviate this problem, some demographic information is used for classifying the user [4]. Mixed approaches are new attempts to take advantage of both approaches by combining the two approaches. There is an example of the mixed approach using both the genre of movie in the item-based approach and the address of users in the user-based approach [5]. However, the assumption that the movies belonging to the same genre will get a similar rating from people is weak. Furthermore, the demographic classification is difficult to be considered as a grouping of people with similar preferences. For these reasons, we need a more fundamental approach. We can find a fundamental approach in [6]. The similarity between users and between items is computed using users’ rating. Then the user-based filtering and the item-based filtering methods are appropriately combined using this similarity. In this method, however, the time for making prediction is too long due to the complex formulas, and the problem of sparsity still remains due to too many parameters to be calculated. In this study, we propose a fast and highly accurate recommendation method that makes the complex and incomplete factors simple and clear in combining the user-based and the item-based filtering. In addition, this method is relatively free in the problem of sparsity by minimizing the number of parameters to be calculated. 2. Related WorkThis section introduces the movie recommendation method proposed in [6], and analyzes the advantages and disadvantages of this method. This method calculates the similarity between users and between items by using the Pearson correlation coefficient [7]. Then the set of similar users for each user and the set of similar items for each item are obtained based on this similarity. Finally, the estimation of rating is calculated using the rating of users or users’ information for items belonging to the set of similarity. The similarity calculated previously is used as the weight for each estimation when each estimation is combined for calculating the final result. 2.1 Calculation of SimilarityIn this method, the similarity between users and between items is calculated based on the rating of users for items. For this, the similarity Sim(a,b) between user a and user b in the user-based filtering is obtained by the following equation.
(1)[TeX:] $$Sim(a,b) = \frac { \operatorname { Min } \left( \left| I _ { a } \cap I _ { b } \right| , \gamma \right) } { \gamma } \cdot \frac { \sum _ { i \in I _ { a } \cap I _ { b } } \left( r _ { a , i } - \overline { r } _ { a } \right) \cdot \left( r _ { b , i } - \overline { r } _ { b } \right) } { \sqrt { \sum _ { i \in I _ { a } \cap I _ { b } } \left( r _ { a , i } - \overline { r } _ { a } \right) ^ { 2 } } \cdot \sqrt { \sum _ { i \in I _ { a } \cap I _ { b } } \left( r _ { b , i } - \overline { r } _ { b } \right) ^ { 2 } } }$$In Eq. (1), [TeX:] $$I _ { a } , r _ { a , i } \text { and } \overline { r } _ { a }$$ represent the set of items that the user a has evaluated, the rating of the user a for the item i and the average rating of the user a, respectively. In addition, the constant γ is the threshold for [TeX:] $$\left| I _ { a } \cap I _ { b } \right|$$. It is used to prevent the overestimation of similarity when the size of the initial item set is small. The equation calculates the similarity of two users a and b in evaluating on the same item i compared to their average ratings. The similarity is calculated using the Pearson correlation coefficient. In the same way, the similarity Sim(a,b) between item i and item j in the item-based filtering is obtained by the following equation.
(2)[TeX:] $$Sim(i,j) = \frac { \operatorname { Min } \left( \left| U _ { i } \cap U _ { j } \right| , \delta \right) } { \delta } \cdot \frac { \sum _ { u \in U _ { i } \cap U _ { j } } \left( r _ { u , i } - \overline { r } _ { i } \right) \cdot \left( r _ { u , j } - \overline { r } _ { j } \right) } { \sqrt { \sum _ { u \in U _ { i } \cap U _ { j } } \left( r _ { u , i } - \overline { r } _ { i } \right) ^ { 2 } } \cdot \sqrt { \sum u \in U _ { i } \cap U _ { j } \left( r _ { u , j } - \overline { r } _ { j } \right) ^ { 2 } } }$$In Eq. (2), [TeX:] $$U _ { i } \text { and } \overline { r } _ { i }$$ represent the set of users who have evaluated the item i and the average rating for the item i, respectively. In addition, the constant δ is the threshold fo [TeX:] $$\left| U _ { i } \cap U _ { j } \right|$$. It is used to prevent the overestimation of similarity when the size of the initial user set is small. The equation calculates the similarity of two items i and j evaluated by the same user u compared to their average ratings. The similarity is also calculated using the Pearson correlation coefficient. 2.2 Selection of Similar NeighborsBased on the similarity obtained earlier, we can define the set of users similar to a particular user and the set of items similar to a particular item. First, we define the set of users similar to a user u as follows.
The constant η in Eq. (3) is the threshold for the similarity between users. It is the baseline for selecting the users similar to a user u. In the same way, we define the set of items similar to an item m as follows.
The constant θ in this equation is the threshold for the similarity between items. It is the baseline for selecting the items similar to an item m. 2.3 Prediction for RatingThen we can predict a user’s rating for a new item using the sets of similar users and of similar items as follows.
(5)[TeX:] $$\hat{r}_{u, m}=\lambda \cdot\left(\overline{r}_{u}+\frac{\sum_{a \in S(u)} \operatorname{sim}(a, u) \cdot\left(r_{a, m}-\overline{r}_{a}\right)}{\sum_{a \in S(u)} \operatorname{sim}(a, u)}\right)+(1-\lambda) \cdot\left(\overline{r}_{m}+\frac{\sum_{i \in S(m)} \operatorname{sim}(i, m) \cdot\left(r_{u, i}-\overline{r}_{i}\right)}{\sum_{i \in S(m)} \operatorname{sim}(i, m)}\right)$$In Eq. (5), the constant λ is the ratio for combining the rating values predicted both by user-based filtering and by item-based filtering. In user-based filtering, the prediction for rating is calculated as the weighted average of the difference between the rating of user a who belongs to the similar group of the user u for the item m and the average rating of user a. Here, Sim(a,u), the similarity between user a and user u, is used as the weight of the term for user a. In item-based filtering, the prediction for rating is calculated as the weighted average of the difference between the rating of the user u for item i that belongs to the similar group of the item m and the average rating for item i. Here, Sim(i,m), the similarity between item i and item m, is used as the weight of the term for item i. 2.4 Algorithm AnalysisAccording to the experiments, the method proposed in [6] appears to be more accurate than the various movie recommendation methods like similarity fusion [8], cluster-based smoothing [9], aspect model [10], personality diagnosis [11], and user-based Pearson correlation coefficient [12] that are relatively accurate. The algorithm used in this method can be described in Fig. 1. This method has two fundamental problems. First, lots of parameters need to be calculated each time. For example, there are U2 parameters for Sim(a,u), M2 parameters for Sim(i,m), U parameters for [TeX:] $$\overline { r } _ { a }$$ and M parameters for [TeX:] $$\overline { r } _ { i }$$, where U is the number of users, M is the number of items. This leads to the problem of sparsity. In addition, there is a problem in the equation itself. The denominators of Eqs. (1) and (2) can be zero. In this case, the similarity value cannot be calculated. In fact, the denominators of the equations can be zero not only in the early stage when there are not enough data to calculate the equation but also in the later whenever a new user or item is added, and the set S(u) and S(m) in Eqs. (3) and (4) become empty sets frequently because of low similarity even if the denominators are not zero. In these cases, only [TeX:] $$\overline { r } _ { u } \text { and } \overline { r } _ { m }$$ are the basis for prediction of the rating in Eq. (5). As a result, the accuracy of prediction is degraded. Actually, 99.78% of S(u) and 98.4% of S(m) are found to be empty in Eqs. (3) and (4) when we experiment not modifying the threshold values used in [6]. Second, several constants are used to keep the adequacy of the calculated results in intermediate steps such as Eqs. (1), (2), (3), and (4). Also a constant is used as the ratio of combining the finally calculated terms in Eq. (5). There is no absolute criterion for determining these constants. Thus, these can only be determined experimentally. This may be improper partially. Therefore, the performance of the prediction for rating is limited. In [13], two methods were proposed in order to partially solve these problems as follows. First, the constants γ and δ in Eqs. (1) and (2) were omitted. However, the similarity calculated by each method may be overestimated due to this in the beginning. Second, the user-based prediction for rating is calculated as the weighted average of the rating of user a who belongs to the similar group of the user u for the item m. Here, Sim(a,u) is used as the weight. In the same way, the calculation of the item-based prediction of rating is simplified. However, there seems to be a limit in enhancing the accuracy due to such a simplification. 3. Proposed MethodIn this paper, we propose a method that effectively combines the user-based and item-based methods for movie recommendation by solving the two problems of the paper [6] as analyzed previously. The basic idea of the paper is briefly described in [14]. 3.1 Solution to Problem of SparsityFirst, we have to find a way to replace the Sim(a,u) and Sim(i,m) calculated in Eqs. (1) and (2) to solve the problem of sparsity. These values are used as weights for the prediction of rating in Eq. (5). The Sim(a,u) is used as the weight of user a’s rating in the user-based prediction. Likewise, the Sim(i,m) is used as the weight of the rating for item i in item-based prediction. In many cases, the prediction result is distorted by an improper constant value or by a denominator of zero in Eqs. (1) and (2). Therefore, these values are replaced by 1 to prevent this problem in this paper. That is to prevent that incomplete data are used for prediction. The S(u) in Eq. (3) potentially includes all the users because it is hard to distinguish the users similar to user u. In fact, S(u) includes all the users when the threshold η is low, and becomes empty if the threshold η is high. Conceptually, however, if S(u) is defined to be the set of the users who are helpful for the user-based filtering, it can includes the users who evaluated item m. Therefore, it is appropriate to use Um instead of S(u). In the same way, S(m) in Eq. (4) can be defined to be the set of the items that are helpful for the item-based filtering. Therefore, Iu, the set of all the items that have evaluated by user u, would be appropriate to use instead of S(m). Eventually, the Eq. (5) is simplified as the following equation by replacing Sim(a,u) with 1, Sim(i,m) with 1, S(u) with Um, and S(m) with Iu.
(6)[TeX:] $$\hat { r } _ { u , m } = \lambda \cdot \frac { \sum_{ a \in U _ { m }} \left( \overline { r } _ { u } + r _ { a , m } - \overline { r } _ { a } \right) } { \left| U _ { m } \right| } + ( 1 - \lambda ) \cdot \frac { \sum _ { i \in I _ { u } } \left( \overline { r } _ { m } + r _ { u , i } - \overline { r } _ { i } \right) } { \left| I _ { u } \right| }$$In order to combine the results of the user-based and the item-based methods, it is necessary to determine the ratio based on the reliability of the two results. For this, the rational and simple way is to use the ratio of the number of data used for calculation of each result. This is because we can improve the reliability of prediction results when we use more data in general. Therefore, the combination ratio is calculated by the following equation.
(7)[TeX:] $$\lambda = \frac { \left| U _ { m } \right| } { \left| U _ { m } \right| + \left| I _ { u } \right| } , 1 - \lambda = \frac { \left| I _ { u } \right| } { \left| U _ { m } \right| + \left| I _ { u } \right| }$$Applying this ratio, the Eq. (6) can be eventually replaced by the following.
(8)[TeX:] $$\hat { r } _ { u , m } = \frac { \sum _ { a \in U _ { m } } \left( \overline { r } _ { u } + r _ { a , m } - \overline { r } _ { a } \right) + \sum _ { i \in I _ { u } } \left( \overline { r } _ { m } + r _ { u , i } - \overline { r } _ { i } \right) } { \left| U _ { m } \right| + \left| I _ { u } \right| }$$The Eq. (8) calculates the prediction for rating in two ways. First, in the user-based method, the difference between the average rating of user u and user a is calculated, and the rating of user a for item m is added to the difference. Then, the average of these values is calculated for all users a who have rated item m. Second, in the item-based method, the difference between the average rating for item m and item i is calculated, and the rating of user u for item i is added to the difference. Then, the average of these values is calculated for item i that user u have rated for. Finally, these two prediction results are combined using the ratio of the number of data used in each method. The prediction for rating can be calculated faster and more accurate using the Eq. (8) than that using the Eq. (5). 3.3 Algorithm AnalysisThe algorithm used in the proposed method is described in Fig. 2. This new algorithm is compared with the existing algorithms described in Fig. 1 as follows. First, the steps 2–9 of the existing algorithm are not necessary in the new algorithm. The time complexity of the steps is O(R*U*M*logR). In addition, the step 11 of the existing algorithm is corresponding to the step 3 of the new algorithm. The time complexity of this part is O(R*(U+M)). The step 10 of the existing algorithm is corresponding to the step 2 of the new algorithm. The formula to be calculated in this step has been changed from the Eq. (5) into the Eq. (8). The time complexity of these steps are both O(R*(U+M)*logR). Therefore, the time complexity of the new algorithm is O(R*(U+M)*logR), which is much lower than O(R*U*M*logR) of the existing algorithm. 4. Experimental Result4.1 Experimental EnvironmentThe experimental data used in this study are the MovieLens 100K dataset [15]. In the dataset, the number of users is 943, the number of movies is 1,682 and the number of evaluation data is 100,000. Particularly the evaluation data are composed of a tuple (a user ID, item ID, rating, and time stamp). In [6], the dataset is divided into training data and testing data. After training the model, the performance of the model is measured using the test data. However, the experimental method of this paper is not the same as the method used in [6]. Instead, in this paper, the prediction process for the currently read tuple is performed repeatedly, while referencing to the previously read tuples, reading each tuple one by one from the beginning. It is because pre-training of models with sufficient data is not practically possible when applied to the practical movie recommendation system. The reality is that new users and new movies are added continuously. Therefore, the experimental data were taken out orderly to predict the user’s rating from the dataset sorted by the time stamp. That is, the prediction at time t is made based on the data received from time 1 to time t–1. 4.2 Accuracy of PredictionThe performance measure used in the experiment is the mean absolute error (MAE). It is defined as the following equation.
(9)[TeX:] $$\mathrm { MAE } = \frac { \sum _ { u , i } \left| r _ { u , i } - \hat { r } _ { u , i } \right| } { N }$$It is the average of the absolute value of difference between [TeX:] $$r _ { u , m } \text { and } \hat { r } _ { u , m } , \text { where } r _ { u , m }$$ is obtained from each sample in the dataset, and [TeX:] $$\hat { r } _ { u , m }$$ is the prediction result calculated by the Eq. (8). Figs. 3 and 4 show the result of comparing the prediction accuracy of the method used in [6] and the method proposed in this paper. In Fig. 3, the vertical axis is the MAE calculated from the data in the range from the beginning up to each point in the horizontal axis. The curve labeled ‘former method’ represents the performance of the method used in [6], and the curve labeled 'proposed method' represents the performance of the method proposed in this paper. In the graph, the MAE of the ‘former method’ continues to increase over time. On the other hand, the MAE of the ‘proposed method’ is maintaining a constant level. Also the ‘proposed method’ curve shows that the accuracy of prediction is low in the early stage. It is because the amount of data used for the prediction is small. While accumulating experience gradually, however, more accurate predictions become possible based on it. As a result, the ‘proposed method’ appears 0.0494 lower MAE compared to the ‘former method’ based on the evaluation results for the 100,000 item set. This reduction in prediction error is about 6%. 4.3 Efficiency of Prediction AlgorithmThe efficiency of the prediction algorithm is also important no less than the accuracy. This is because the prediction algorithm can be applied to a system that many people deal with many items such as Internet shopping mall. Therefore, it is very important how fast the processing time is increasing with respect to the amount of the data accumulating over time. The following graph compares the cumulative processing time of the ‘former method’ and the ‘proposed method’. In Fig. 4, the processing time of the ‘former method’ is increasing rapidly as the amount of data increases. On the other hand, the processing time of the ‘proposed method’ is increasing very slowly. The time needed to process all of the 100,000 data is 472.496 seconds for the ‘former method’, and it is 26.142 seconds for the ‘proposed method’ which is about 1/18 comparing with the ‘former method’. 5. ConclusionIn this study, we propose a very simplified combining method for the item-based and user-based methods used in movie recommendation. Due to the small number of parameters to be calculated, this method is relatively free from the problem of sparsity, and it can predict new cases more accurately using the user's rating to calculate the parameters. In the experimental results, the performance of the proposed method is quickly stabilized though it is not accurate in the early stage. The average error of the proposed method is 6% lower than the former method that uses similarity measures. When compared in terms of processing speed, it is 18 times faster than the former method in predicting 100,000 cases while learning each case. The graph for the processing time shows that the difference in speed compared to the former method, which is getting larger as more data are accumulated. Our recommendation method can be applied to any type of item in e-Commerce as well as movies. However, the processing speed is very important as well as the accuracy to apply the method to a common e-Commerce system with a vast number of users and items. In this respect, the recommendation method proposed in this paper has obvious advantages. It is necessary to apply this method to various recommendation systems in the future. BiographySe-Chang Ohhttps://orcid.org/0000-0003-0899-7207He received the M.S. and Ph.D. degrees in computer science from the Korea Advanced Institute of Science and Technology (KAIST) in 1990 and 1997, respectively. He worked at LG Corporation Institute of Technology for four years and worked at Ajou University for three years. Now he is a professor in Department of Computer Software at Sejong Cyber University. His research interests include machine learning, pattern recognition and data science. BiographyMin Choihttps://orcid.org/0000-0002-9204-5665He received the M.S. and Ph.D. degrees in Computer Science from the Korea Advanced Institute of Science and Technology (KAIST) in 2003 and 2009, respectively. From 2008 to 2010, he worked for Samsung Electronics as a Senior Engineer. Since 2011, he has been a faculty member of Department of Information and Communication of Chungbuk National University. His current research interests include embedded system, microarchitecture, and cloud computing. References
|