An Improved Dynamic Collaborative Filtering Algorithm Based on LDA

Currently, available collaborative filtering (CF) algorithms often utilize user behavior data to generate recommendations. The similarity calculation between users is mostly based on the scores, without considering the explicit attributes of the users with profiles, as these are difficult to generate, or their preferences over time evolve. This paper proposes a collaborative filtering algorithm named hybrid dynamic collaborative filtering (HDCF), which is based on the topic model. Considering that the user’s evaluation of an item will change over time, we add a time-decay function to the subject model and give its variational inference model. In the collaborative filtering score, we generate a hybrid score for similarity calculation with the topic model. The experimental results show that this algorithm has better performance than currently available algorithms on the MovieLens dataset, Netflix dataset and la.fm dataset.


I. INTRODUCTION
With the advent of the Internet, the availability of information continues to rapidly increase. This abundance makes it challenging for users to effectively find the information they need in the high volume of data they can access. To address this problem, recommendation systems have been proposed to help users quickly obtain useful information depending on their past preferences or other sources.
Most established recommendation algorithms start with finding a set of customers. This set of customers has purchased and rated products that overlap with the products that the current user has purchased and rated. The algorithm aggregates the products from these similar customers, excludes the products that the user has purchased or rated, and recommends the remaining products to the user.
The purpose of collaborative filtering (CF) is to suggest new items or to predict the utility of a certain item for a particular user based on the user's previous preferences and the opinions of other like-minded users. Researchers have proposed many CF algorithms that can be divided into two main categories: user-based and item-based. In this research, we focus on user-based methods.
The associate editor coordinating the review of this manuscript and approving it for publication was Ting Li . In recent years, an increasing number of studies have used topic models and review texts to generate recommendations. In the recommendation algorithm, to find the nearest neighbor of the target user, we need to measure the similarity between users. However, the number of users and items in the recommendation system is very large, users often only score a small number of items, and the user rating data are extremely sparse, which makes the nearest neighbor set obtained by the traditional similarity measurement method not accurate enough, resulting in a reduction in the recommendation quality of the algorithm. In addition, the time factor is very important context information in recommendation systems. Users' interests vary greatly in different periods of time. The topics in a document evolve over time. We compare the inaugural addresses of two presidents in different periods and present them in the form of word clouds in Figure 1. The higher the word frequency, the larger the display size. It can be seen in the figure that although they were all inaugural addresses, they used different words, and the use of high-frequency words was also quite different. Therefore, when the time span of the corpus is large, the topic model obviously needs to consider the time factor and the dynamic change in language. However, most methods ignore the time factor. Therefore, we propose a hybrid dynamic collaborative filtering algorithm (HDCF) that can capture the evolution of topics in a collaborative filtering algorithm. Equation (1) the traditional CF algorithm assigns the same weight to all items of interest, ignoring the influence of user time on interest when calculating the similarity. In fact, the user's attention to the item will change over time, and the attention will affect the user's interest. Therefore, the algorithm should consider the time dimension. Aiming at the above problems, this paper introduces a time-decay function in the LDA model that gives different weights to items according to the time users look at them. The corpus is divided into time slices according to the time attribute of the document. The distribution of current topic words is determined by the distribution of previous time, the time-decay rate and the weight of words. We also give the approximate variational posterior of this model. Equation (2) in the processing of similarity computation, we calculate a hybrid user similarity score, that is, the integration of the topic model and traditional similarity.
In this way, our approach is different from the established recommendation algorithms to improve the recommendation quality.
This paper is organized as follows. Section II describes related work, including CF algorithms and topic models in the literature. Section III describes the proposed algorithm, and Section IV presents the results of applying this algorithm to different datasets. In Section V, we conclude and discuss further research directions.

II. RELATED WORK
Recommender systems are often based on CF, which relies only on past user behavior. For example, their previous transactions or product ratings do not require the creation of explicit profiles. Notably, CF techniques do not require domain knowledge and avoid the need for extensive data collection. In addition, relying directly on user behavior allows the discovery of complex and unexpected patterns that would be difficult or impossible to profile using known data attributes. As a consequence, CF has attracted much attention in the past decade, resulting in significant progress and adoption by some successful commercial systems.
Along with the deep study of recommendation systems, researchers introduced the context environment into recommendation systems, such as time, location, mood, active state, and network condition. Traditional CF techniques cannot track users' preferences over a period of time [1]- [9]. Therefore, temporal dynamics emerged in recommendation systems. Considering the time interval between purchases, Wang et al. proposed an opportunity model. The model determines the items to be recommended and the best time to recommend a specific product [10]. Mustansar Ali Ghazanfar et al. presented a novel structure learning technique called the kernel-mapping recommender (KMR) to make reliable recommendations under sparse, cold-start, and long-tail scenarios. In 2011, they presented a fast incremental algorithm for building the model. In 2012, they constructed user-based and item-based KMR, combining user-based and item-based KMR and kernels from feature information [11], [12]. Cheng JiuJun proposed a user spatiotemporal behavior pattern method combining mobile personalized attributes and context information. Considering the popularity of users' interests, the popularity of subjects and the impact of users, they built a model based on users' interests [13]. Nour EI Islem Karabadji considered the recommended range when selecting the appropriate group. This helped to make accurate and diverse recommendations simultaneously [14]. Niu Z proposed a new knowledge-based topic model that combined the Dirichlet tree and integrated the must-links into topic modeling for object discovery. In particular, to better deal with the polysemy phenomenon of visual vocabulary, a must-link was redefined such that it only constrained one or more specific topics rather than all topics, thus significantly improving topic coherence [15]. Zhang Xiong proposed a method that further improved recommendation accuracy. Within a semantic environment in a document, entities of different topics were present. Another entity that VOLUME 9, 2021 appeared in the same document at the same time was used to help disambiguate the referenced content to a certain extent [16]. Zhang and Liu presented a cross-domain recommender system based on kernel-induced knowledge transfer. This method effectively transfers knowledge through overlapping entities and alleviates data sparsity issues [17], [18]. Das J proposed a CF method based on clustering, which used two kinds of hierarchical space to divide the data structure: K -d tree and quadtree. They clustered or partitioned the user space according to the user's location and then used the generated clustering results to predict the target user's score [19]. Zhang, Pengfei, et al. proposed decomposing the matrix into two nonnegative matrices and then integrating the time weighting into the evaluation matrix in the collaborative filtering algorithm [20]. Yu X et al. proposed two auxiliary domains, i.e., user-side domain and item-side domain, to solve the sparsity problem. In this method, both the user side and item side can not only share information but also infer domain-independent user and item features. They also proposed another cross-domain collaborative filtering algorithm to alleviate the sparsity problem. This method first formulates the recommendation problem as a classification problem in the target domain. Then, Funk-SVD decomposition is used to extract extra user and item features. Finally, the C4.5 decision tree algorithm is used to predict missing ratings [21], [22]. Wang joined LDA and a listwise model to generate collaborative filtering results [23]. Zhang used a multichannel feature vector to calculate the similarity between items [24].

III. PROPOSED APPROACH
The basic idea of CF algorithms is to provide item recommendations or predictions based on the opinions of other like-minded users. The opinions of users can be obtained explicitly from the users or by using implicit measures. In a typical CF scenario, there is a list of m users U={u 1 ,u 2 . . . u m } and a list of n items I={i 1 ,i 2 . . . i n }. Each user u i has a list of items r ui , which the user has expressed their opinions about. CF algorithms represent the entire m×n user-item data as a ratings matrix. Each entry a i;j represents the preference score (ratings) of the ith user on the jth item. Each individual rating is within a numerical scale; the value zero indicates that the user has not yet rated the item.
A key step in CF algorithms is to calculate the similarity between users. It is typically based on a cosine approach or a correlation in similarity computation. However, they cannot find deeper relations between words. A topic model is another option. In this section, we first introduce the basic idea of LDA, then introduce our proposed method, give its variational inference model, then introduce the similarity computation of the HDCF model, and finally the collaborative filtering prediction model.

A. UNDERLYING EXISTING LDA
LDA is an unsupervised machine learning technique that can be used to identify potentially hidden topic information in a large-scale corpus. This method assumes that each word is extracted from a hidden topic.
The LDA model can be expressed as a probability graph model, as shown in Figure 2. The shadow circle in the figure represents the observed variable, the nonshadow circle represents the potential variable, the arrow represents the conditional dependency between the two variables, the box represents the replicate sampling, and the number of replicates is in the lower right corner of the box.
The LDA assumes that the prior distribution of the document topics is the Dirichlet distribution. That is, for each document d, its topic distribution θ d is where α is the proportion parameters. The LDA assumes that the prior word distribution in a topic is a Dirichlet distribution. That is, for each topic k, its word distribution ϕ is: where β is a vector of the topic parameters. For each word W d,n in document d, we can obtain its topic number Z dn distribution from the topic distribution θ d as follows: For this topic number, we obtain the probability distribution of word W d,n as follows: A k-dimensional Dirichlet random variable θ can take values in the (k − 1)-simplex. It can be expressed as the following probability density on this simplex: where the parameters α 1 , α 2 . . . α k > 0 and (x) are the gamma functions. Given the parameters α and β, the joint distribution of a topic mixture is: Integrating over θg and summing over z, the distribution of a document is: Taking the product of the probabilities of single documents, the probability of a corpus is: Each topic z corresponds to a distribution over an item, and each user has a distribution θ on a potential topic. The approach for rating items can be described as follows: the users choose a topic based on their interests and then choose an item according to the distribution over this topic. More specifically, the process is as follows: 1. For each document d, choose θ ∼ Dir (α). 2. For each word w for document d (a) Choose a topic z n ∼ multinomial (θ).
(b) Choose a word w n from p (w n |z n , β), a multinomial probability conditioned on the topic z n.

B. GENERATING THE HDCF ALGORITHM
Most current topic models are based on the assumption that documents in a corpus are not sequential; in other words, documents in a corpus are interchangeable. In fact, this simplified assumption is inappropriate and inconsistent with the actual situation. 1) A large number of corpora, such as scientific and technological literature databases and news databases, are temporal corpora, and the documents in them have the time attribute. Some specific text information can only appear in a certain time period. In addition, many corpora span hundreds of years; for these corpora, it is obviously inappropriate to ignore the time order attribute. 2) Language changes over time, and the topic is bound to evolve over time.
The traditional CF algorithm assigns the same weight to all items of interest, ignoring the influence of user time on interest when calculating the similarity. This is obviously unreasonable. Because the attention of users to items will change with the retention time of items, the attention affects the interest of users. Therefore, the algorithm should consider the time dimension to improve the accuracy of the recommendation prediction. Generally, the recently followed items with high attention of users show their recent interests and hobbies, so users have a high degree of interest in such items, whereas users have a low degree of interest in items.
Aiming at the above problems, this paper introduces the HDCF model on the basis of reference [25] and gives different weights to items according to the time users look at the items. Based on the weight of the time attribute, the changes in users' interests can be reflected to realize the time-based similarity calculation of interests. The shorter the time for users to consume items, the higher their corresponding interests will be, which should have a higher weight, and vice versa. The difference between our model and reference [25] lies in: The process of HDCF algorithm is divided into 4 phases as shown in Figure 3. This method not only uses time-delay function, but also gives the derivation process of Variational Inference in phase 2. The nearest neighbor set is used instead of all data sets, so as to reduce the sparsity of data in phase 3. To verify the reliability and accuracy of the algorithm, the number of baselines is increased in the experiment, and the experimental results std. deviation (with error bars) are given.
According to the time attribute of the document, the corpus is divided into T time slices. Time slice 1 is the oldest time slice, time slice T is the latest time slice, and a subset of documents on each time slice is D ={D 1 ,D 2 . . . D T }. Introducing the weight of time attribute r(r >0). Let the last browsing time of the topic be t i , the current time be t, and the distribution of current topic words is determined by the distribution of previous times, the time-decay rate and the weight of words.
where W (t) is the number of words of time slice t and r(t) is the rate of time decay.   Additionally, the distribution of topics is affected by the time-decay rate.
Thus, the dynamic topic model is established as shown in Figure 4. Let each document on a time slice correspond to a ϕ t and θ d t by the LDA model. For our HDCF method, the generation process of documents on time slice t is as follows:

C. VARIATIONAL INFERENCE
According to the Bayesian algorithm, the posterior probability of document D at time slice t is Because of the coupling between θ, z, β, the above probability cannot be directly calculated, so we introduce variational inference. We assume that latent variable θ d t is formed by an independent distribution γ t and that latent variable Z d n,t is formed by an independent distribution φ d t . In reference [25], the evolution of topicterm probability with Gaussian noise can be expressed as β t |β t−1 ∼ N (β t , σ 2 I ) . With Gaussian ''variational observations'', we denoteβ k,1 , · · ·β k,T as the variational distribution ofβ k,1 , · · · β k,T that retained the sequential structure of the topic. Thus, using the standard mean-field approximation, the approximate variational posterior is The rest of the iterative and updating process of our model is very similar to that of the traditional LDA, but the two are obviously different. The distribution of topics in our method is related to the degree of time decay, which makes the iteration and updating process on each document associated with the previous slice process.

D. SIMILARITY COMPUTATION OF HDCF MODEL
An important step in CF algorithms is to calculate the similarity between users. This result is used to establish a proximitybased neighborhood between the target user and some similar users. The main goal of neighborhood formation is to find an ordered list of l users N = {n 1 ,n 2 . . . n l } for each user u. For example, sim(u,N 1 ) is maximum, sim(u,N 2 ) is the next maximum, and so on. The proximity between two users is usually measured by a cosine-based similarity. However, both the cosine and correlation approaches utilize a bag of words. They cannot mind the semantic relationships between words.
In our method, we first define the matching degree δ t between document d and topic over word w n on time slice t as where Z d n,t is the distribution of words on topic and W t is the number of words on time slice t.
Then, the r ij of document d i for word w j in the CF algorithm is modified by Then, we regard the topic distribution as the rating of the document-word matrix and use the KL (Kullback-Leibler) divergence to evaluate the similarity of users.
Considering the advantages of traditional similarity and topic similarity, we use a hybrid similarity for words in this paper.
where the similarity sim c i,j is calculated by the cosine measurement, the adjusted cosine similarity sim p i,j improves the above limitation by subtracting the user's average score, correlation-based similarity sim ac i,j measures similarity with the Pearson correlation coefficient, r i is the average of ri, λ is adjusting parameter.

E. PREDICTION COMPUTATION OF HDCF MODEL
Another important step in the CF algorithm is to generate recommendations based on predictions. After calculating the similarity between the items, the user can predict the unrated items based on their similarity.
To solve the problem of data sparsity, we use the nearest target users' neighbors method for prediction [27]. Setting NBS u as the nearest neighbor for user u, the prediction r ui of user u for item i are obtained by the ratings of user u for the nearest neighbors set NBS u .
where sim(u,n) is the similarity between user u and user n, r n,i is the rating of user n for item i, and r n is the average rating of user u and user n. This method calculates the prediction on user u for item i by computing the sum of the user's ratings for items that are similar to i. For the influence of context time, a rating the user made long ago should be less influential than a rating made now. To accomplish this, a time delay function f u,i of active user u for tag t i is used in the prediction computation.
In this method, the prediction on user u for item i is calculated by the average of the user's ratings. The ratings are weighted by the similarity between items, which is similar to item i. However, the recently followed items with high user attention show their recent interests and hobbies, so users have a high degree of interest in such items, whereas users have a low degree of interest in items. Therefore, considering the time influence on the prediction results, the time-decay function is introduced into the prediction process in this paper.
Recommendation prediction is modified as

IV. EXPERIMENTS
Our experiment is divided into two parts. The first part determines all kinds of parameters required by the algorithm, including the number of topics, neighbor size, and hybrid similarity parameter. We compare performance with other traditional methods in the second part.

A. DATASET
We use the MovieLens dataset for our first part of the experiments. This dataset consists of 100,000 ratings (1-5) for 943 users of 1,682 movies. We use 80% of the dataset as the training set and 20% as the test set. In addition to the MovieLens dataset, we also use the Netflix dataset and last.fm dataset in the comparison experiments.

B. EVALUATION METRICS
There are numerous measures available to evaluate the recommendation quality. We use the mean absolute error (MAE) and the root-mean-square error (RMSE) to evaluate our algorithm. They are common measures in recommender systems. MAE is an average of the absolute errors between predictions and eventual outcomes, while RMSE is the square root of the average of the square of all of the errors. MAE and RMSE are given by Obviously, the lower the value of MAE or RMSE is, the higher the accuracy of the recommended results.

C. PARAMETER OPTIMIZATION EXPERIMENTS
We started our experiments by dividing the dataset into training and test sections (80% for training and 20% for testing). Then, we measured the sensitivity of some parameters before   running the main experiment. These parameters include the number of topics K , neighborhood size, number of clusters, and the hybrid user similarity parameter. Based on the results, we fixed the optimum values of these parameters and used them for the rest of the experiments.
There are three main parameters in the LDA: α, β, and K . Parameter α denotes the distribution of topics for each document, β denotes the distribution of words for each topic, and K denotes the number of topics. We set α = 0.1 and β = 50/K in our experiment. We set the number of topics K to range from 10 to 100, incrementing by 10 each time. The MAE results are shown in Figure 5.
It is clear in Figure 5 that MAE performance fluctuates slightly with different numbers of topics. The minimum value   of MAE occurs when K=50. As a result, we set the number of topics K=50 in the following experiment.
We also give the first five distributions of topics and words over the topics in TABLE 1 when the number of topics    k = 20. It is interesting that even though the topic is latent and there is no direct real-world explanation, the top item corresponding to each topic allows us to have some form of interpretation of the item found by the model.
The first line is the first 5 topics with their probability. The remaining line is the word distribution over the corresponding topic. The topic name and the word is in bold.
The neighborhood size and the number of clusters have a significant influence on the prediction quality. To determine these parameters, we change the neighborhood size from 10 to 160 and the cluster number from 5 to 50. Figure 4 depicts our experimental results. The neighborhood size and the number of clusters affect the prediction quality.
As shown in Figure 6, the MAE is lower when the number of clusters is larger. Therefore, we choose 50 clusters to compare our proposed algorithm with the other algorithms.
Our proposed algorithm uses the topic model and some traditional similarity approach to calculate a hybrid user   similarity. It is calculated as Equation (16). We set the parameter λ =0.1, 0.2, 0.3, 0.4, 0.5. Figure 5 shows the results.
From Figure 7, we determine that the hybrid parameter λ is 0.5.
We present our experimental results on the MovieLens 10 M dataset, Netflix dataset and last.fm dataset after we obtain the optimal values of the parameters. The la.fm dataset contains social networking, tagging, and music artist listening information from a set of 2K users from Last.fm online music system. The Netflx dataset contains over 100 million ratings. The ratings were collected between October 1998 and December 2005. Each rating has a customer id, a movie id, the date of the rating, and the value of the rating. In these two dataset, we generate topic model for ''title'' and ''tag'' while the number of topic is K = 5. We use these three baselines to demonstrate the ability to semantically interpret LDA. CF-cos represents CF with cosine as the similarity measure. CF-pcc represents CF with the Pearson correlation coefficient as the similarity measure. CF-ac represents CF with adjusted cosine as the similarity measure. We also compare three methods [23], [24], [28], CF-LDA, CF-MCFV and ROST to   TABLE 2, TABLE 3 and TABLE 4. The RMSE results  on different datasets are shown in TABLE 5, TABLE 6 and  TABLE 7.
In addition, we use the t-test to test the significance and give one-sample test results in Table 8. We also give std. deviation (with error bars) with MAE in Figure 8, Figure 9 and Figure 10. We conclude from these results. The MAE and RMSE of the proposed method are significantly lower than those of the other methods. Experimental results show that the proposed method has high recommendation accuracy.

V. CONCLUSION
Collaborative filtering (CF) is a popular recommendation algorithm that makes predictions and recommendations based on the ratings or behaviors of other users in the system. The traditional CF algorithm assigns the same weight to all items of interest, ignoring the influence of user time on interest when calculating the similarity. This paper introduces the HDCF model and gives different weights to items according to the time users look at the items. In the future, we will continue to study how to improve the dynamic topic model to improve the quality of recommendations.