Alleviating Item-Side Cold-Start Problems in Recommender Systems Using Weak Supervision

Recently, recommender systems have been used in various fields. However, they are still plagued by many issues, including cold-start and sparsity problems. The cold-start problem occurs when users are unable to make recommendations to other users owing to a complete lack of information about certain items. This problem can exist both at the user side and the item side. User-side cold-start problems occur when new users access the systems; item-side cold-start problems occur when new items are added to databases. In this study, we addressed the item-side cold-start problem using the concept of weak supervision. First, a new process for identifying feature based representative reviewers in a rater group was designed. Then, we developed a method to predict the expected preferences for new items by combining content-based filtering and the preferences of representative users. Through extensive experiments, we first confirmed that in comparison to existing methods, the proposed approach provided enhanced accuracy, which was evaluated by determining a mean absolute error for the average ratings. Then, we compared the proposed scheme with the collaborative filtering (CF) and neural CF approaches (NCF). The estimation by the proposed approach was 21% and 38% more accurate than CF and NCF in terms of mean absolute error (MAE), respectively. In future, the proposed framework can be applied in various recommender systems as a core function.


I. INTRODUCTION
Since the late twentieth century, the number of Internet users has increased rapidly; many of them interact through online social networks, effectively making the web community similar to actual society. Over the years, there have also been some important changes in the online social network. First, in the past, the general public mainly accepted information from the mass media. However, in recent times, they also receive information from opinion leaders who exert significant influence over others. Therefore, many online articles posted by influential bloggers, because of their significant influence on other users, are used as corporate marketing unwanted data such as spam. Due to these reasons, a webbased recommendation system has been actively developed through various research; this system generally utilizes the similarities between users or items [7], [8] or the associative methods exploiting the similarities of both [9], [10]. These efforts have significantly improved recommendation accuracy. However, many unsolved issues remain in the recommendation system, including cold-start and sparsity problems [11], [12]. The cold-start problem occurs when users are unable to make recommendations owing to a complete lack of information about certain items; the sparsity problem occurs when the database only contains a small amount of information [11]- [14], which affects the recommendation accuracy. The cold-start problem in recommender systems exists both at the user side and the item side [11]- [13], [15]. The user-side cold-start problem occurs when new users access the systems. As new users do not have registered preferences, they are excluded from recommendation processes and do not receive any recommendation results [15]- [17]. The item-side cold-start problem occurs when new items are added to databases. As these items are not registered as preferred by any users, they are excluded from recommendation processes and cannot be promoted to users. In this study, we focused on addressing the item-side cold-start problem.

A. RELATED WORK 1) ITEM-SIDE COLD-START PROBLEMS IN RECOMMENDER SYSTEMS
Various research studies have analyzed the item-side cold-start problem; one of the best-known approaches is the use of item features such as category information [13], [18], [19]. Choi and Han [20] studied the prediction of preferences for new items using the option of representative users extracted from user rating networks. Gunawardana and Meek [21] addressed the item-side cold-start problem based on the relations between the item attributes and proposed the use of the Bolzmann machine for recommendation. Using a movie database as case study, Gantner and Griffiths [22] attempted to cluster new items using the attribute data. This cluster method based on the attribute data mitigated the item-side cold-start problem. Anand et al. [23] proposed the metric method based on the user action history to alleviate the item-side cold-start problem. Sun et al. [24] clustered the items using the attribute data and preferences and created a decision tree that can be applied to new items and existing items and can predict preferences for new items. Feil et al. [25] examined different gamification patterns and their effect on the number of the report ratings provided by the average user. They provided results for the positive effect of instantiating gamification patterns. Volkovs et al. [26] combined content-and neighbor-based models to address the cold-start problem in recommender systems. The approach produced consistent results in actual testing. Wang et al. [27] deployed information from an ad platform in an online shopping domain. They attempted to build a cross-domain system.
Based on the shared users in the system, Wang et al. attempted to resolve cold-start problems.

2) RECOMMENDER SYSTEMS BASED ON DEEP NEURAL NETWORKS
In many studies, supervised learning-based machine learning approaches have been used to address the cold-start problem in recommender systems [28]- [30]. However, this is difficult because recommending an item under cold-start conditions presents similar problems to processing unlabeled data. Owing to the recent popularity of deep learning, it has been applied by various studies for recommender systems [31]- [34]. A typical example is the neural collaborative filtering (NCF), which utilizes a neural network as a collaborative filtering (CF) technique [35]. The approach is based on learning not only the parameters of the model, such as in the traditional deep learning method [31], [32], but also the embedding vector, similar to the Word2vec technique [36]. Based on these embedding vectors, the NCF techniques derive prediction results through two methods: generalized matrix factorization (GMF) and multi-layer perceptron (MLP). The GMF predicts preferences after the user and item vectors have been concatenated. Similar to the GMF, the MLP also concatenates the user and item vectors; however, it additionally forms a fully connected layer. After the data passes through the configured layer, predictions can be derived. Some studies have also focused on a neural matrix factorization (NeuMF) approach [35], which incorporates both the GMF and the MLP methods. In this technique, a prediction is achieved through the concatenation of the last layer of the GMF and MLP. The training of the model begins with the extraction of the user and item vectors. Initially, each vector is initialized with random values, and the vectors are learned through the training data. The resulting user and item vectors are used to predict results. The advantage of this approach is that the MLP can be used to learn the higher-dimensional meaning of latent calculations, rather than simple ones, that predict results as the product of user and item vectors using the matrix factorization (MF) method [37]. Recommender system based on deep learning have been applied to YouTube platforms [38]. The landing pages of YouTube have different configurations for each user account, i.e., a personalized recommender system is used to provide contents. The techniques used on YouTube consist of two modules: the candidate generation module and ranking module. The candidate generation module extracts video candidates using the user profile information and video viewing history information. Next, the ranking module predicts the video candidates, considering recommendation as a classification problem. This technique considers each video as one class; it can handle extremely large class classifications. To improve the learning efficiency, the module addresses important sampling. In addition, the recommendation result is predicted based on the user information and video viewing history information for candidate videos extracted from the candidate generation module in the ranking module. The prediction processes are similar to the NCF because these ranking modules predict recommendation results through the MLP by embedding the users and items. In addition, these techniques have exhibited considerable performance improvements. Thus, deep learning-based research is currently very active in the field of recommender systems, and has already found real applications.

B. CONTRIBUTIONS AND ORGANIZATIONS
Currently, most recommender systems are based on the application of CF approaches to predict user preferences. However, the CF methods that are based on user preferences do not make use of weak supervision [39], [40], which has the advantage of reducing time and costs in building training sets for unlabeled data by empirically generating training data using an external knowledge base [41]. Furthermore, weak supervision can be used when the user preferences cannot be employed or when the recommender system needs to be constructed using information other than preference [39], [42], [43]. Therefore, unlike legacy works, this study addresses the item-side cold-start problems using the concept of weak supervision, which can be used in a content-based recommendation system such as news recommendations. This prediction can provide useful and previously inaccessible information on new items and minimize problems related to the exclusion of new items from the recommendation process. Our contribution can be summarized as follows.
• We proposed a weak supervision-based novel approach that exploits content-based filtering and the activities of representative users to predict the preferences for new items in cold-start situations. First, we employed the content-based filtering approach, which utilizes item features, such as category information, to recommend items. This study focuses on only the genre information as the item feature because its uncertainty (or entropy) is the smallest in the movie data [44], which means that it comparatively guarantees stable results. Second, we applied a method that identifies representative users for one of the item features, that is, we identified the representative users for each genre in this study. We conjectured that the average ratings for new items calculated through the choices of representative users enable more reliable predictions than the average of the ratings of other users.
• Through various tests, we showed that exploiting the ratings of the proposed representative users generates more precise results than other approaches. Furthermore, we found the optimal number of representative users through extensive simulations.
• We proved that the proposed approach outperforms the typical collaborative filtering-based approaches: CF and NCF. The proposed approach was 21% and 38% more accurate than CF and NCF in terms of mean absolute error (MAE). This paper is organized as follows. The proposed algorithm is presented in Section II. The experiment and results are detailed in Section III. IV is the conclusion.

II. PROPOSED APPROACH
In this section, we address how to alleviate the item-side cold-start problems for new items within recommendation systems using the concept of representative reviewers within the rater groups. Toward that end, in this study, we employed a movie database provided by GroupLens, as shown in Table 1, which comprises 3, 883 items and 6, 040 users. The only item feature provided by the GroupLens movie database is the genre information. Each movie in the database belongs to at least one genre, i.e., a Toy Story movie in the database falls under animation, children, and comedy. Table 2 shows all 18 genres of the database.
The proposed approach can be summarized in the three steps in Fig. 1: deriving genre preference, identifying representative users, and generating average preference for an item.

A. DERIVING EACH USER's GENRE PREFERENCES FROM USER-ITEM RATINGS
We derived the genre preferences by utilizing the user ratings on the GroupLens database. To achieve this, the user's genre VOLUME 8, 2020  selection tendency was first calculated; subsequently, based on this, each user's genre preferences were derived. Fig. 2 shows an example of the drawing selection tendency using movie items. User A has selected eight movies, each of which is representative of a mixture of different genres. The first movie, called Movie 1, is associated with Genres G 1 , G 2 , and G 4 . This means, for example, that G 1 is an animated film, G 2 is a children's film, and G 4 is comedy. All the movies on the list have their own genre combination. The selection preference of User A was studied by counting the number of times each genre appears on the list. For example, the selection tendency of Genre G 1 is three because three movies, Movies 1, 4, and 6, fall under it. The selection tendency of Genre G 2 , is six, as Movies 1, 2, 4, 5, 6, and 7 fall under it. We repeated this method for all the genres on the list, following which we organized the selection tendencies of User A. Table 3 illustrates the selection tendencies of User A.

2) GENRE PREFERENCE
After computing the selection tendency, we used it, together with the ratings for each movie, to determine the genre preference. Fig. 3 illustrates the process; the selection tendency of User A, as shown in Table 3, was examined to determine the preference for Genre G 2 . We first checked User A's ratings for each movie, and then allocated ratings to each movie to establish the genre preference for G 2 . In this example, the genre preference of G 2 for Movie 1 is 5. In the movie selection list in Fig. 3, Genre G 2 appears a total of six times (Movies 1, 2, 4, 5, 6, 7). Following the same procedure, we then calculated the average score for each genre in all the selected movies, which is referred to as a genre preference.

B. IDENTIFYING REPRESENTATIVE USERS FOR GENRE
We considered the users who were able to represent their group as representative users, i.e., the preferences of some representative raters can represent that of all other raters. We used the average ratings to identify the representativeness of the ratings of these users because average ratings are a convenient way to determine the views of the majority. Hence, many e-commerce web applications utilize average ratings to show the quality of items.
In their previous study, Choi and Han [2], [20] extracted the representative reviewers from the rater groups for movies or music using (1): where A is a set of items rated by a user, S, |A| is the cardinality of A, R S (i) is the rating of item i by the user, S, and R µ (i) is the average rating of item i in the database. The result of (1) shows the average difference between the item ratings by one user and the average ratings of the items. If a user's outcome for (1) is 0, its rating is the same as the average rating calculated from other users' ratings. From this result, we concluded that users with a result close to 0 can represent other users' ratings for the same items. Therefore, only one user group can represent the rater group. However, unlike (1), in this study, we identified representative users for each genre using (2). We obtained a total of 18 representative groups, as the GroupLens database provides 18 genres. The representative users for the genres were identified as follows.
We first organized the genre preferences for each user into a user-genre matrix. In Fig. 4, the bottom table shows the user-genre matrix for n users and 18 genres. Each row of the matrix indicates the genre preferences of each user. After composing the user-genre matrix, we applied (2) to each genre:  where P S (i) is the preference of genre G i by User S and P µ (i) is the average preference for the genre G i . We obtained a total of 18 result graphs after applying (2) to all the users in the database, as shown in Fig. 5. In the figure, the y-axis indicates the results of (2), and the x-axis represents 6, 040 users, in ascending order of their results. For each genre graph in Fig. 5, we selected users with low scores for (2) as representative users of each genre. The selection process for representative users using a sample genre G 1 is summarized in Fig. 6. The user-genre matrix was the same as in the bottom table of Fig. 4. We first calculated the average genre preference for Genre G 1 . Second, we calculated the difference between the individual user's genre preference and the average. This calculation enabled us to obtain the results of (2) for Genre G 1 and organize the scores in ascending order, as illustrated in Fig. 6. Finally, we selected users with low scores as representative for Genre G 1 .

C. GENERATING AVERAGE PREFERENCE FOR A NEW ITEM
Each new movie falls under at least one category. Fig. 7 illustrates the entire process involved in predicting the  average preference for a new item. This encompasses a total of three steps: The first step is to extract representative users for each genre that appears related to a new item. In Fig. 7, the genre combination of the new item is G 1 , G 2 , and G 8 . We extracted low scorers, i.e., those users with a low score based on (2), for Genres G 1 , G 2 , and G 8 . One important finding in this context was that each genre had different representative users. For example, in the user-genre matrix of Fig. 4, User A had a different level of preference for Genres G 1 and G 2 , and the overall average preference for the two genres was also different. This indicates that User A can be a representative user for Genre G 1 , but not for Genre G 2 . Thus, we had different representative users for each genre.
The second step was to extract the representative users' genre preferences. We then calculated their average preference for each genre with a new item. In the example, we extracted the genre preferences of representative users for G 1 , G 2 , and G 8 . The preference for each genre varies because each genre has different representative users.
The final step was to score the average preference for a new item. In the example, after calculating the average preference for each genre, with regard to the genre preferences extracted from each representative user, we determined the average preference for a new item by dividing the sum of each average VOLUME 8, 2020 into the number of genres that have new items, as shown in (3), where GC n is the genre combination of a new item n, RU i is the set of representative users of the ith genre, S is one representative user of the ith genre, and P S (i) is the genre preference extracted from the representative user S for the ith genre.

III. EXPERIMENTS AND ANALYSIS A. EXPERIMENT DESIGN
To test the proposed approach, we randomly selected 100 items from a total of 3,883 items and considered them as a set of new items. First, we evaluated the accuracy of the proposed approach. To this end, we compared the results obtained through our approach to an average rating. Our comparisons utilize the mean absolute error (MAE [45], as shown in (4): where m is the number of test items, PI is the set of test items, and R n and µ n are the average ratings, as calculated by the proposed approach, and the real average rating for item n, respectively. Fig. 8 illustrates how test results are obtained for a test item. Where we assume that an item I 1 , selected as the test item, has no user responses, i.e., no user-generated ratings. The example consists of three steps: (i) calculating the average rating based on those provided by the user-item matrix for Item I 1 ; (ii) generating the average preference for the test item; and (iii) calculating the MAE between the average rating drawn by the user-item matrix and the average preferences generated by the proposed approach. Then, we evaluated the proposed approach using different groups of representative users. These tests led to a more precise analysis of the number of representative users for predicting the average preferences for new items. In this analysis, each different group configuration has a different FIGURE 9. Example of ten average preferences generated for ten groups.
number of users, which is given by (5): where U is the set of all the users in the database, Ug is the set of the groups, and Ug n is the number of users in the nth group. For example, if |Ug| is 10, we created ten groups by dividing all the users. Thus, each group had approximately 600 users, as the database comprise d6, 040 users. Subsequently, we generated ten average preferences for new items, one for each group. Fig. 9 provides an example consisting of ten groups and the ten average preferences generated for the tests. Based on the positions of each user determined by (2), we divided each group. Thus, in Fig. 9, we utilized the 18 graphs derived from the previous section. There are ten groups of users for each genre. We then generated the average preference for new items based on these groups of users. We generated the average preferences for new items using all the sorted users. In a previous study [46], the top 10% and bottom 10% users in the graph results (representing the results of the representative scores sorted in ascending order) show distinct differences with regard to the representation of their group. In this test, we created groups the number of 600, 500, 400, 300, 200, 100, 50, 40, 30, 20 groups, i.e., comprising 10 to 300 users per group. In each group, we used 10,12,15,20,30,60,120,150,200, and 300 users as the representative number of users. Finally, we compared our method with the CF and NCF approaches, to verify the accuracy of our results. For the CF approach, we utilized the user-based CF strategy [7], [10], which recommends items in two steps. The first step involves selecting similar users using popular similarity measures such as the Pearson correlation coefficient or cosine similarity [8]. We used 0.5 as the threshold for the Pearson correlation and cosine similarity. After selecting the similar users, the CF approach predicted the user preferences for some items. For the NCF approach, we applied the stochastic gradient descent algorithm for optimization and set 0.01 as the learning rate. We also set five layers of the MLP module; each layer had 64 dimension factors. We used 100 epochs in each test and selected the lowest MAE as the results.
To learn the implicit data in NCFs [35], such as Bayesian personalized ranking [47], the label for the unseen data was considered 0. However, when the NCF derives its prediction results, including the unseen data whose users' preferences are considered as 0, can degrade the accuracy of the prediction results. Therefore, we excluded negative sampling that can produce distorted predictions from the experiments. Fig. 10 shows the MAE graphs for each group. In Fig. 10, the y-axis represents the MAE, and the x-axis represents the nth groups. In each group, the MAE gradually increases with the increase in the x-axis. This indicates that a large difference was observed between the average preference generated by high scorers and the real average rating. These test results confirm our assumption that the low scorers identified by eq. (2) can be used to predict the average preference for new items more precisely. Fig. 11 shows the detailed MAE values derived for each group of users. Fig. 12 shows the MAE values for the top five and bottom five groups in each group configuration. As shown in Fig. 12, the following were experimentally verified: (i) The lowest MAE was found in the top 1 of 50 groups configuration. (ii) The MAE values in 50 groups configuration were lower compared to the other groups. This indicates that the most precise average ratings were obtained by selecting 120 users as the set of representative users. (iii) The bottom group was associated with a higher MAE result compared to the top group. The bottom groups, as opposed to the top, contain users whose evaluation scores for each item are not close to the average. This means that the  predictions obtained through the users close to the average ratings were more accurate than the results obtained by the bottom groups. Thus, we can conclude that focusing on the representative users, as opposed to other users, leads to a more precise prediction of the average rating for a new item. Fig. 13 shows the comparison of the CF, NCF, and proposed approach. In our approach, we predicted group preferences by selecting the representative users. The CF approaches, on the VOLUME 8, 2020 other hand, predict personalized preferences, rather than group preferences. Therefore, the proposed and compared approaches were evaluated, with the number of groups and users as a criterion, as shown in Fig. 13. We ensured the comparison was fair. For instance, if the proposed approach was evaluated on 600 groups, which indicates that 10 users were used as the size of the representative user group, we chose 10 users in the CF approach, i.e., 600 groups in the proposed approach match 10 users in the CF approach.

2) COMPARISONS OF PROPOSED ALGORITHM WITH THE CF AND NCF APPROACHES
When the proposed scheme was compared to the CF approach with the Pearson correlation coefficient, they had similar MAE values. However, in the 40 and 50 groups (i.e., 150 and 120 users in each group, respectively), the proposed approach provided more precise results. When the proposed scheme was compared to the CF approach with the cosine correlation coefficient, the proposed approach yielded even more accurate MAE values for all the group configurations. Moreover, the MAE in the NCF were between 0.7 and 0.8, which means that the proposed approach also yielded more accurate MAE values than the NCF approach for all the group configurations. In general, the NCF learns using many parameters compared to the proposed approach or the matrix factorization method. Thus, learning existing information is clearly better. However, predicting cold-start items has high deviation. That is, the recommendation accuracy of the NFC could be better in normal situations, compared to under cold-start conditions, i.e., the prediction by the NFC was more accurate in pairwise learning, compared to in regression-based learning.
These results above indicate that the proposed approach was more effective than the conventional CF and NCF in determining the average ratings for new items. The estimation by the proposed approach was 21% and 38% more accurate than CF and NCF in terms of mean absolute error (MAE). In addition, the predictions of new items using the proposed approach can be utilized, as a result of weak supervision, in deep learning-based approaches such as the NCF. By utilizing the proposed approach, in the deep learning-based approach, the deviation of the results can be reduced with regards to various learning parameters.

IV. CONCLUSION
In this study, we addressed the cold-start problem for new items in the recommendation system by proposing a new approach that can predict average ratings using weak supervision. First, we proposed how to identify representative reviewers in the raters group. Average ratings can be predicted more precisely using the ratings of the proposed representative users compared to the average ratings of other users. Second, we proposed content-based filtering to generate the average ratings of new items. This helped to determine the selection tendency and category preference of users. Finally, through extensive experiments, we proved that the proposed content-filtering-based representative users generated precise results for the cold item recommendation. We compared the proposed approach to the CF and NCF approaches. The results showed that the proposed approach was 21% and 38% more precise than the typical CF and NCF approaches, respectively. As a future work, we will apply this approach to more diverse databases to analyze the results. We will also utilize MF to extract representative users with the user-genre matrix through a more diversified analysis.