Knowledge Discovery and Recommendation With Linear Mixed Model

We give a concise tutorial on knowledge discovery with linear mixed model in movie recommendation. The versatility of mixed effects model is well explained. Commonly used methods for parameter estimation, confidence interval estimate and evaluation criteria for model selection are briefly reviewed. Mixed effects models produce sound inference based on a series of rigorous analysis. In particular, we analyze millions of movie rating data with LME4 R package and find solid evidences for a general social behavior: the young tend to be more censorious than senior people when evaluating the same object. Such a social behavior phenomenon can be used in recommender systems and business data analysis.


I. INTRODUCTION
With the dramatic evolvement of digital concept and data analysis, humans are constantly attaching new physical meanings to the massive data. Main sources of such data can be the figures in checkup reports, the specific grade in academic transcripts, or even inconspicuous behaviors derived from individuals. As a result, people nowadays might be shocked by their huge amount of personal data, such as their historical visits to some places, or detailed evaluations about some commodities on the internet. The volume of such data is bound to jump up exponentially in the data centric-era [1]. The main focus of our study is how to unearth useful information or knowledge from digital data that grows rapidly. Such an exploring process is called knowledge discovery [2].
Recommender systems (RS), a popular and updated method for knowledge discovery, is achieving a resounding success in e-commerce nowadays [3]. Taobao, a Chinese on-line shopping platform, has the ability to learn from customers' consumption behaviors and shopping records. With those data its system can apply models and corresponding The associate editor coordinating the review of this manuscript and approving it for publication was Fabrizio Messina . algorithms to predict. The prediction from models provides recommendation of products for customers. From consumer's perspective, when confronting overwhelming items online, users can readily find and purchase desirable items they like with the help of recommender system in Taobao [4]. Obviously, one of the key challenges of a recommender system should be how to offer accurate recommendations with high confidence. In order to achieve this goal, the optimization of algorithms and repeated experiments will be unavoidable steps.
One application of recommender systems is in entertainments. Watching movie has become a popular entertainment and an increasingly number of audiences are getting used to posting their remarks about movies online [5]. In this circumstance, the database of movie ratings is created and its volume is expanding rapidly. Relying on the recommendation system, business corporations will gain lots of commercial benefits from their efforts on creating sophisticated algorithms and models. Netflix emphasized on the importance of a better recommender system by setting up the Netflix Price, which is a contest providing 1 million dollars for the team that can make the best improvement in recommender system [5]. To put it simply, the developers should consider what kind of information the databases can indicate and how to obtain these information by using a simpler knowledge discovery method. Moon, S. (2010) analyzed the movie database and revealed the relationship between movie genres and movie ratings. He concluded that sequel movies can achieve lower rating than the original ones based on the decline of viewers' interests in sequels [5], [6]. The database delivers the information that original and innovative movies can be more attractive so that they inspire movie firms to produce more original works. However, such a conclusion is too general to give detailed and accurate recommendations since it does not consider the fact that individuals possess diverse traits and tastes.
Traditional recommender systems for movies base on users' watching history. For example, if The Avengers is in one's watching history, then some similar movies might be recommended to this user. The similarity can be explored in various perspectives: when it comes to names, it is possible to recommend The Avengers 2 and relate to sequential movies; from producers' standpoint, they are more likely to offer movies manufactured by Marvel. We can even find something from genre angle: some science fiction movies, such as Interstellar, are likely to be recommended. Such a recommendation is based on profile inference. And currently there are various methods for profile inference, ranging from collaborative filtering [7], [8], knowledge graph embedding to heterogenous information network [9].
However, it is difficult to recommend movies to new users with no watching history. Such difficulty is referred to as the cold start problem. The introduction of linear mixed model (LMM) can help us with predicting the preference of these new users if we have some information other than the watching history [10]. Thinking about the favorite movies of people of different ages, we may have the intuition that people of different ages are likely to show distinct tastes or preferences. For example, the adventure or science fiction movies may be teenagers' favorites whereas the senior possibly prefer comedy films. Such group profiles/preferences once available can be used to solve the cold start problem. Such group preferences are usually hypothesized and then tested by rigorous statistical inference procedures. These tested and verified hypotheses can be used for various business development. After mining millions of movie rating data, we can provide statistically significant evidence that the youth on average are pickier and more censorious than the senior when rating the same movie [11].
It should be noted that the linear mixed model has long been used for knowledge/science discovery in breeding [12], and now widely used in ecology [13], genetics [14], [15], and genome-wide association study [16], [17]. However, the application of LMM in recommendation systems was examined just recently by sparse publications, for example [10], [11], [18]. The underlying mathematics foundation and theoretical details were inadequate or even omitted in all such recommendation literatures. In addition, so far the corresponding experiments are somewhat incomplete since some essential constituents, such as computational costs and interval estimation, are missing. In light of this, our main contribution in this paper is to provide relevant theoretical details based on our understanding of linear mixed models [20], [21], [23], [25], as well as to show the comprehensive experimental process. Some newly introduced techniques, such as Wald confidence interval, will be applied to show the credibility of our conclusion. In this way, we will show the power of LMM and related packages in handling cold start problem.
In the remaining part of this paper, we shall first introduce the basic idea of linear mixed models, and then discuss how to estimate the parameters in the model, how to do interval estimate, how to select a proper model, and finally we shall apply this model to analyze movie rating data.

II. LINEAR MIXED-EFFECTS MODEL (LMM)
The traditional linear model is written as where Y is the n × 1 vector of responses, β is the p × 1 vector of fixed effects, X is the n × p matrix of fixed effects, ε is the n × 1 vector of random errors. There are three important assumptions on traditional linear models: 1) Normality, which means responses in vector Y follow normal distribution from population. 2) Independence, which means responses are independent so that all their correlation coefficients are zero. 3) Homogeneity of variance, which means every response has the same variance. However, when we analyze the actual data, it is very common to find limitations of traditional linear models because the data sets mostly violate these assumptions.
The linear mixed-effects model can be used to overcome these limitations by introducing additional random effects. Besides, it suffices to take account of the correlation of observations contained in a data set. Therefore, it is reasonable to model the relationship between users' traits and ratings of movies.

A. THE STRUCTURE OF LMM
Linear mixed-effects model (LMM) is a more useful and realistic model to analyze real data sets. It is also called Hierarchical Linear model and as its name implies, this model divides data sets into several levels according to certain grouping factors [26]. For multilevel data, we are able to show the expression of the classical linear mixed-effects at a given factor as follow: where y i is a n i × 1 vector of responses, X i is a n i × p design matrix of fixed effects, β is a p × 1 vector of fixed effects, b i is a q × 1 vector of random effects in factor i, Z i is another n i ×q design matrix of covariates which shows the correlation between responses y i and random effects b i , ε i is the vector of residual errors for factor i. What should be emphasized is VOLUME 8, 2020 that Z i contains known values of q covariates corresponding to q random effects chosen from its distribution [26]: Moreover, b i is unobservable. It implies that random effects lack patterns, which causes difficulties for researchers to figure out its real value. In a LMM, observations are considered not necessarily independent and have heteroscedasticity. The correlation between each pair of observations in the same level is reflected in the distribution of b i and ε i . Since they are in the same level, they are able to follow bivariate normal distribution: where b i is independent of ε i . Moreover, G and R i can be specified as: where G and R i are variance functions which represent weights of each observation's variance decided by parameter θ G and θ R i respectively. Therefore, when random effect b i is known, then the conditional distribution of y i can be formulated as: When b i is not given, the unconditional distribution of y i can be defined as: Combing data sets from all factors, we can get the classical formula of LMM for all data: where Y = (y 1 , y 2 , · · ·, y N ) is the n × 1 vector of responses, where n 1 + n 2 + · · · + n N = n, β is the p × 1 vector of fixed effects, X is the n × p design matrix for fixed effects, Z is the n × (q 1 + q 2 + ... + q N ) matrix of random effects, b is the (q 1 + q 2 + ... + q N ) × 1 vector of random effects, where b = (b 1 , b 2 , · · ·b N ), ε is the n × 1 vector of errors, ε = (ε 1 , ε 2 , ···, ε N ) . Therefore, the unconditional distribution and conditional distribution can be expressed respectively as and This model can take level influence as random effects so that Y can be expressed in a multivariate normal distribution form. On the other hand, according to Y 's variance presented in variance-covariance form, we can know that observations in Y are not independent and error terms are also divided into different levels, better considering real data's features.
Linear mixed-effects models have been widely used in software, such as SAS, SPSS, Matlab as well as R. This article will focus on linear mixed-effects models using R and the lme4 package [34] to discover knowledge.

III. PARAMETER ESTIMATION
In order to understand how linear mixed-effects model is obtained, we need to figure out the parameters' estimation. Two common ways to estimate parameters in linear mixed-effects model are maximum likelihood (ML) estimation and restricted maximum likelihood (REML) estimation [26]. The conditional distribution of y i given b i is not appropriate for constructing the likelihood function since we don't know the real value of random effects b i . Therefore, marginal distribution of y i is applied to build up ML and REML function.

1) MAXIMUM LIKELIHOOD ESTIMATION
Summarizing the parameters contained in linear mixed-effects model above, we get three types of parameters: the fixed effects β = (β 1 , β 2 , · · ·, β p ) T ; the random effects b = (b 1 , b 2 m · · · , b p ) T ; and the variance parameters σ 2 , θ = (θ G , θ R ). Their estimators can be obtained by simultaneously maximizing the log-likelihood function with respect to these parameters. However, it is a numerically complex work which needs to find an optimum in a multidimensional parameters space. Fortunately it can be simplified by profile likelihood technique.
With parameters β, σ 2 , θ, we have likelihood expression given that: where . Ignoring the constant part and taking log operation, we get the log-likelihood function of the form: Assume that the variance parameters are known, the fixed effects and random effects can be determined by solving the following mixed model equations [22]: In particular, By plugging (13) into (12), we gain the log-profile likelihood function: where In this way, the function does not depend on β, which means the parameter space has lower dimension than previous one. Then use the same method, maximizing l * ML (σ 2 , θ) with respect to σ 2 for every known value of θ leads to the estimation of σ 2 : By plugging (13) into (14), we get a log-profile likelihood function for θ : Therefore, there are fewer parameters in the parameter space again. Then maximization of l * ML (θ ) can yield an estimatorθ ML of θ . Pluggingθ ML into (12) and (14) produces estimatorβ ML of β andσ 2 ML of σ 2 that: However, there is a significant limitation on maximum likelihood estimation. ML estimatorsσ 2 ML andθ ML are both biased because they don't adjust for the uncertainty in estimation of β. In other words, the values ofσ 2 ML andθ ML may vary with the change of β so that we cannot make accurate estimations of these two parameters. However, σ 2 and β can be better estimated by restricted maximum likelihood estimation, which will be discussed in next section.

2) RESTRICTED MAXIMUM ESTIMATION
In order to obtain unbiased estimates of σ 2 and θ , we should use an estimation that is orthogonal to the estimation of β, which means using a way to make estimates of σ 2 and θ that are independent of estimation of β [6]. To achieve this goal, we can consider the likelihood function of a set of n − p independent contrasts of y, where p is the dimension of β. After obtainingβ(θ ), the restricted log likelihood function is given by: From this function, maximizing of l * REML (σ 2 , θ) with respect to σ 2 leads to an estimator of σ 2 that: Plugging (18) into (19), we get a function with respect to θ only: Estimator of θ can be obtained by maximization from (21), which can be applied to get estimators of β and σ 2 , respectively.

IV. CONFIDENCE INTERVAL FOR LMM
Confidence Interval (CI) gives a range for a random variable based on a certain confidence level, that is, the credibility of the estimator. Therefore, people can consider the values in this confidence interval to have the same or similar influence. Constructing confidence interval has significant meanings either in theory or empirical problem. When estimating parameters or predicting responses, we can get certain values of parameters or responses. Admittedly, it can not be denied that when considering the preciseness of science and mathematics, these values undoubtedly lack certainty or accuracy. However, if we can locate estimation and prediction in some certain ranges with confidence level attached, the conclusion tend to be more convincing.
There are three methods to compute the confidence intervals: the profile likelihood confidence interval, Wald confidence interval and bootstrap confidence interval. Each method possesses disparate ideas and assumptions. Next we will discuss the underlying concepts of these CIs and their applications in LMMs.

3) PROFILE LIKELIHOOD CONFIDENCE INTERVAL
One of the main conditions of profile likelihood confidence interval (PLCI) is that the estimator does not necessarily have to follow normal distribution [27]. The concept of PLCI is VOLUME 8, 2020 very similar to the profile likelihood technique previously mentioned. In a model, we assume β is our interest parameter and b is the vector of all nuisance parameters. Thus, L(β, b) is the maximum likelihood function based on two random variables β and b, also the profile likelihood function of β is defined as which means the maximum likelihood function of β with MLE value of b.
With this concept, we can consider the confidence interval next. In the hypothesis test, the null hypothesis is constructed like this: H 0 : β = β 0 . In this circumstance, building a confidence interval is equivalent to finding all β 0 , which can make the H 0 not be rejected under the 100(1−α)% confidence level. Then we use the likelihood ratio test [27]: where L(β) is the maximum likelihood with MLE of all parameters and L 1 (β 0 ) handles one fewer parameters, that's why the left hand side of this formula follows Chi-square distribution. Therefore, all β 0 satisfying above formula can form a confidence interval for β. Since log L(β) and χ 2 1−α (1) are constant, we can rearrange the expression: It is likely to get a graph like this below. Therefore, the part of the curve above the red line forms the confidence interval we want.

4) WALD CONFIDENCE INTERVAL
Wald confidence Interval takes Wald Test into account: with the assumption that the difference between the two will be approximately normally distributed. According to this test, when considering the confidence Interval in LMM, we only need to know the estimates and variance of the parameters included. For the fixed effects β, we get the estimate above and its variance-covariance that: Thus, we extract the diagonal of variance-covariance matrix as variance ofβ, the wald confidence interval forβ can be expressed as follow:

5) BOOTSTRAP CONFIDENCE INTERVAL
Bootstrap confidence interval comes from the idea that ''pulling itself up by its own bootstrap'' [28]. In other words, it means doing large number of bootstraps from the original data. Assume we have original data {y 1 , y 2 , y 3 , . . . , y n }, and we build LMM for this data which contains parameter = (β, θ, σ 2 ). After maximum likelihood estimation or restricted maximum likelihood estimation, we are able to obtain esti-matesˆ = (β,θ ,σ 2 ). In order to find the confidence interval, we should calculate the variation ofˆ around , that is, δ = − . Hence, confidence interval based on α% confidence level can be shown as: To find δ, we process bootstrap operation. Firstly, we take resamples from the original data {y 1 , y 2 , y 3 , . . . , y n } and receive n new observations notated as y 2 , . . . , y (1) n which has the same distribution as the original data. After the LMM estimation for the new data, we obtain new parameter estimatesˆ (1) and the first variation is calculated as δ (1) = −ˆ . Then this kind of resampling operation should run repeatedly for m times, normally more than 1000 times and a matrix of resample can be formed:  Each column represents one resampling and produces one δ, thereby a sequence of δ is generated finally δ (1) , δ (2) , δ (3) , . . . , δ (m) and we sort them from the smallest to the biggest. Thus, δ α/2 is at the α/2 percentile and δ 1−α/2 is at the 1 − α/2 percentile. In this way, the confidence interval of parameters can be calculated. Bootstrap introduces us a simple and straightforward angle to observe the variation of estimates. With the law of large number, the resample distribution can be a good approximation to the true distribution.

6) COMPARISON
Since these three methods have different concepts, we should consider carefully about which method should be applied in different circumstances.
Profile confidence interval has very wide applications because of its moderate restriction, so it is still available for the estimators not normally distributed [29]. Due to this reason, the default demand confint() in R software adopts this profile likelihood method and it is useful when analyzing LMMs. Although Wald confidence interval is very common, it is difficult to apply such a method in LMM because it is not feasible to account for the parameters contained in random effects, that is, σ 2 and θ. The Wald confidence interval cannot be calculated for these parameters in LMM. As for bootstrap confidence interval, the CI from this method is relatively valid. This kind of sampling cannot improve point estimates. It is obvious that every bootstrap is chosen from the same data pool and follows the same steps, there is no new information reflected even after all bootstraps [28]. This is also the reason why confidence interval calculated from bootstrap is more wider than profile likelihood method and Wald test, so bootstrap confidence interval is hardly used due to its less preciseness.

V. CRITERIA FOR MODEL SELECTION
As for a data set, there might be several models applied to analyze it. Then how to measure which model is better has become a main problem. There are some important and useful criteria to evaluate those models, such as log-likelihood, Akaike information criterion (AIC), Bayesian information criterion (BIC) and p-value. Obviously, different models possess different focus and purposes which result in different values.

7) LOG-LIKELIHOOD
It is the simplest criterion which has the expression shown below.
where Y = (y 1 , y 2 , ···, y n ) is the vector of observations; = (β, σ 2 , θ) represents the vector of all parameters contained in linear mixed-effects model where β = (β 1 , β 2 , · · ·, β k ) is the vector of fixed effects and θ = (θ G , θ R ) is the vector of random effects; f (Y | ) is the likelihood function of observations. According to this expression, we can see clearly the design of this criterion: is the key component of our model so that it can represent our model to some extent, Y is the data set observed. Therefore, f can quantify how much the data fits our candidate models and adding log is to avoid zero value in the likelihood [30]. The higher value of the log-likelihood achieves, the better the data fits our models.
However, log-likelihood criterion has a huge problem that it doesn't consider the number of parameters. Ideally we prefer a model with high log-likelihood and small number of parameters. Such models can not only guarantee high level of fitting but also require less calculations and operations. Normally, larger number of parameters would increase the value of log-likelihood and the effects are significant when the number of parameters is few. However, when the number of parameters is large enough, each parameter added in model would has little influence to log-likelihood value, which is shown FIGURE 2.
That is, it makes little sense to have too many parameters. On the contrary, we prefer a model with fewer parameters when its log-likelihood value is just a little bit less than the value of the model with more parameters. In this circumstance, Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) both consider the number of parameters and overcome log-likelihood's limitation.

8) AKAIKE INFORMATION CRITERION (AIC)
Akaike information criterion is a standard to measure the goodness of statistical model fitting. It was found and developed by Japanese statistician Akaike. This criterion suffices to quantify the complexity of the estimated model and the goodness of the fitting data of the model. When we use maximum likelihood estimation (MLE) in linear-mixed effect model, log-likelihood l(ˆ ) can be achieved whereˆ = (β,σ 2 ,θ ) contributes to maximum log-likelihood.
AIC can be expressed as l(ˆ ) − p. However, normally, in R software package, the AIC formula is defined as

VOLUME 8, 2020
where p is the number of parameters, l(ˆ ) is the maximum log-likelihood. Both expressions have the same structure with slight difference including sign and multiplies. To be clear, in this essay, we use formula (32). The lower AIC value a model has, the better the model is.

9) BAYESIAN INFORMATION CRITERION (BIC)
In statistics, there are two ways to optimize models. On the one hand, adding more parameters to models can increase their complexity. On the other hand, collecting more observations or data suffices to improve models' ability to describe data sets. AIC considers the parameter problems whereas the number of observations is not included. However, BIC considers both of them and takes them as measurement for models. BIC provides an algorithm to approximate the log marginal likelihood of candidate models and chooses the one having smaller value as the better model. The formula of BIC is: where p is the number of parameters, n is the number of observations.

10) F-TEST
Sometimes AIC and BIC will give different choices between two models. So under this circumstance, we need to refer to other criteria. F-test is able to calculate p-value which can be used to judge whether these two models are significantly different and make decisions about model selection. The logic behind F-test is the comparison of models' deviance which is defined that: whereˆ max is the MLE for the parameter vector in the saturated model which has N parameters,ˆ is the MLE for the parameter vector in the candidate model. Assume there are two models m0 and m1 with degree of freedoms, p and q respectively where p < q and the set of parameters of m1 contains m0's parameters. We can calculate their deviance denoted as D 0 and D 1 which are applied for F test: After we obtain the F value, p-value also can be achieved by referring to F table. Next the process of making decisions depends on our own confidence degree to this test. Normally 95 percent confidence level and 99 percent confidence level are preferred choices, so based on 95 percent confidence level, if the p-value is less than 5 percent, there is significant difference between m0 and m1. Because of the 'useful' parameters in m1, we tend to choose m1 with more parameters. On the contrary, when the p-value is more than 5 percent, there is no significant difference between these two models so that the model with fewer parameters m0 is our choice.

VI. APPLICATION:KNOWLEDGE DISCOVERY FROM HIGH-THROUGHPUT MOVIE RATING DATA
Understanding the knowledge of model selection criteria and parameter estimation, firstly we need a database about detailed information of users, including age, gender, occupation, nation, hobby, etc. Moreover, some movie information, such as movie names, movie genres and movie ratings, is also the necessity. For this purpose, The ml-1m data set is used in this experiment. It provides us with 6040 users, 3952 movies and movie rating made on a 5-star scale. Among these ratings, each user has at least 20 ratings. Furthermore, users' features including age, gender, occupation and zip-code are all in category types. Movie information including Movie ID, title and genres is listed and movie genre can be found in Appendix. Therefore, this is a real big data and the pattern excavated from it will be highly convincing. Although some false information exists in the data set, the huge volume allows us to ignore its influence on our data analysis. The data structure is shown in FIGURE 3 and FIGURE 4. Our task is to discover the age influence on movie rating, thereby we need to rearrange the data provided above. Since UserID from User data set is linked with the ratings in movie data set, we can ignore zip-code which makes little sense in this experiment and construct a data chain aiming for one movie or types of movies (see FIGURE 5).
According to this data structure, we can construct linear mixed-effects models. Since age is our interest variable, it is treated as fixed effect. In addition, gender and occupation are our candidate random effects. Three different models in Table 1 can be designed.  As for how to choose the optimal candidate from these models, we should focus on the real data analysis as well as our model selection criteria.
Given the data as well as candidate models above, our study on age discovery starts from one single movie to different types of movies. During this process, we are willing to discover the implicit general pattern.

A. ONE SINGLE MOVIE: LIFE IS BEAUTIFUL
Firstly, we are willing to see the relationship between users' ages and ratings for one specific movie. Here we choose the film Life Is Beautiful labeled as comedy and romantic movie which won the best foreign language film at the 71st Oscar Awards. Since it is a known work, it received lots of rating records from users so that there are 1152 ratings available.
In our model, user's age is regarded as our interest, i.e. the fixed effect and we need to consider how to choose random effects based on the models provided above. When we try to build candidate models for comparison, the computational time for constructing Model 1, Model 2 and Model 3 are 0.036s, 0.025s and 0.026s respectively, which are fairly short. After using Anova() test, the values of model selection criteria are calculated in FIGURE 6.
According to these results, Model 3 has the lowest AIC and BIC value so Model 3 is the best model based on AIC and BIC criteria, but Model 1 obtains the largest log-likelihood value. Literally, Model 1 is the best one referring to log-likelihood and it has more parameters. However, after the comparison on p-value, we can make our decision. Under 95% confidence interval, the p-value of Model 1 and Model 2 is larger than 0.05 which means there is little difference between Model 1 and Model 2. Since Model 2 has the fewer parameters, it is better than Model 1. Due to the same reason, Model 3 is better than Model 1. In the meantime, observing from AIC, BIC and log-likelihood criteria, we can see the superiority of Model 3 compared with Model 2. To conclude, Model 3 is the best model for the data Life is beautiful.
lmer command is capable of estimating the parameters contained in models and we can have the result in FIGURE 7.
Results in FIGURE 7 show the standard deviation of random effect, gender and residual as well as what we desire: fixed effects of all age levels. To be clear, the radar chart    in FIGURE 8 is able to show the fixed effects and their differences intuitively.
In this graph, visually we can see the rating from users in '18-24' group is much higher than those from other age groups. However, this deduction is unreasonable since we are not sure about whether the rating of users from 18-24 is significantly higher than others. To judge the significance of differences, we need to operate hypothesis test. The effects of every age group, that is, the parameters of age levels can be represented in FIGURE 9.
Therefore, our task is to prove whether β 2 is significantly the largest effect. From the result in FIGURE 6, β 3 is the closest one to β 2 which leads to our hypothesis test that: • H 0 : β 2 = β 3 • H 1 : β 2 = β 3 H 1 represents the original model and we can replace all data belonging to 25-34 age level to 18-24 age level (i.e. combining β 2 and β 3 together), which brings new model. We compare these two models and obtain result in FIGURE 10. The low p-value indicates that it is significant to reject H 0 under 95% confidence interval and there is a big difference between β 2 and β 3 , thereby the conclusion that 18-24 users give higher ratings for Life is Beautiful than any other age groups can be drawn.
Besides observing the estimates of age effects, we are able to use package lm4 to calculate the confidence interval of age effects. Based on three types of confidence interval (profile CI, Wald CI, bootstrap CI), their ranges can be expressed respectively in one graph as shown in FIGURE 11 and FIGURE 12.
As shown in FIGURE 12, the differences among these three confidence intervals are negligible so that the ranges shown in the picture almost overlap with each other. As a result, it is reasonable for us to treat all confidence intervals as the same. Therefore, taking the Wald confidence interval, we are capable of building a radar chart for CI of age effects where the area between red line and green line is the CI place showed above.
Back to FIGURE 12, it is not difficult to find that the CI ranges of 18-24, 25-34, 35-44 are narrower than those for 45-49, 50-55, 56+ and under 18. According to formula of the Wald confidence interval, it can be deduced that the standard deviations of age effects of 18-24, 25-34, 35-44 are smaller than those of 45-49, 50-55, 56+, under 18. The narrower a confidence interval is, the more new information is reflected. Therefore, the age effects of 18-44 for Life Is Beautiful are more meaningful and it can better display the range where the true age effects lie.  Nonetheless, the analysis above only comes from one movie, which may not have a high reliability to generate a tenable conclusion for future application and reference. What we want to explore is the general pattern of different audiences' rating behaviors. Therefore, although we now have some interesting discoveries of Life Is Beautiful, they still have poor influence on our main target. On the other hand, if the general pattern of one specific movie genre can be excavated, then this knowledge would be more plausibly used for recommending movies for people.

B. ONE GENRE: COMEDY
Comedy is a very big genre in ml-1m because it has the largest number of rating data (107009 ratings) among all movie genres. Therefore, the analysis of this big data set will be more meaningful than one certain movie. The target is still to find the relationship between age and rating whereas the research range expands to comedy, one certain genre.
The first thing we need to do is constructing candidate models. Since this time the volume of data set is much larger than the previous case, the computational time become longer: 25s, 9.6s and 8.2s for building Model 1, Model2 and Model 3 respectively. Then we still need to choose the optimal model according to the results shown in FIGURE 13: The comparison is very clear that Model 1 has the smallest AIC and BIC values and the largest log-likelihood value. Moreover, the p-values of Model 1 with Model 2 and Model 1 with Model 3 are very small which gives the information that although Model 1 contains one more parameter than Model 2 and Model 3, the deviance of it is significantly lower than the other two. Therefore, we should choose Model 1 as our optimal model. According to the result from lmer, we can obtain the radar chart of age effects for comedy movies showed FIGURE 14.
Since age effect mentioned here is just the mean value of corresponding parameter, we cannot be fully confident to make an assertion that the age effect to one's rating for comedy is the number displayed in the graph above. One convincing way is to provide a confidence interval for each age effect, here we choose 99% confidence interval. Note that the comedy data is much larger than the data of Life Is Beautiful, that is why in this experiment we want a higher confidence level. Surprisingly, these confidence intervals here perform great differences in FIGURE 15 and we should make a choice among them. Firstly, there is a clear observation that profile confidence interval has much wider ranges for every age effects compared with the other two so that it contains less new information for the location of age effects' estimates, so we will not use this type of confidence interval here. Moreover, although the Wald confidence interval and bootstrap one look very similar, one problem on bootstrap CI is remarkable on the age level 18-24. At this age level, bootstrap CI is fairly narrow whereas it does not contain the estimate which is a serious issue indeed. Considering the previous definition of bootstrap CI, every bootstrap is random and its merit is just based on the large number of repetitions to simulate whereas no improvement is produced. Hence, it is reasonable for the existence of the estimate exclusion. Then let's have a look at the wald CI, for every age level, the range performs high proximity and it demonstrates the significance of every age effect CI is similar. Thereby, we can trust the CIs of different age levels evenly. In this circumstance, the Wald confidence interval should be taken to account for the range of age effects and its radar chart is illustrated in FIGURE 16.
From age effect of comedy graph, on the first cursory glimpse, age above 45 seems to have higher effects and people in 18-24 age group are likely to have relatively lower effect, whereas the whole effects look like a round pie.  To investigate the difference between levels of effects, we are trying to process hypothesis test for all differences. Since it would be a tedious task, we use a simple method to operate it.
In the first step, we choose two effects having the largest difference i.e. in our two candidate models, they will share the same parameters but one model contains one fewer parameter. Then, we use Anova() to compare previous model and the changed model with 99% confidence level. If the result shows they are not significant, then all differences between age effects are of no significance. On the contrary, when the result reflects significance, we need to contrast the value of   the second high difference and do hypothesis test again until the result shows the difference of models are not significant. In this way, we are able to sort the age effects according to significance.
In this experiment, we know the effects β 1−7 are 3.607, 3.490, 3.573, 3.679, 3.715, 3.757, 3.756 respectively. Therefore, we choose β 6 , β 2 and do the first hypothesis test: where H 0 shows comedyModel2, H 1 is the comedy-Model1 defined previously. we can get the Anova() result in FIGURE 17 The p-value is quite small and less than 0.01, so we should do the second hypothesis test with β 7 and β 2 : where H 1 indicates comedyModel3 and we obtain the Anova() result in FIGURE 18. The p-value still shows the non-significance. Then, according to this rule and after several runs of tests, we come to the difference between β 4 and β 5 : where H 0 indicates comedyModel4 so the result is in FIGURE 19.
The p-value is larger than 0.01, so the difference is significant and loop hypothesis tests are finished. According to the tests, there are some surprising discoveries which can be mentioned: • The differences of effects of rating from people with age more than 35 are insignificant so that they can be regard as a unit with high rating on comedy which we can notate as ''The high rating group''. In the meantime, the differences of effects of rating from people with age less than 35 are also insignificant which we notate as ''The low rating group''.
• The difference between every member from ''The high rating group'' and every member from ''The low rating group'' is significant. In this circumstance, we can define young people as the ones with age 1-34 and senior people as the ones with age more than 35. Therefore, the hidden pattern from comedy analysis is that the young people are more picky and particular than the senior for comedy. Thus, it provides a strategy for some video websites so that they can recommend more comedy movies to the senior and show this type of programs less frequently to young people. In this way video websites may receive more positive feedback. However, how about movies of other genres? We have not done researched on the movies of other types and the whole movies recorded. If they give the similar patterns in the radar chart, then it can deliver the information that people from the same age group have the same attitudes to all movies of different genre. Thus, now we continue to analyze other types of movies and observe the related outcomes.

C. OTHERS MOVIE GENRES
Furthermore, we use linear mixed-effects models to analyze data of other genres where age is regarded as fixed effect, occupation and gender are considered as random effects.
Here we choose science fiction movie, children movie and adventure movie to show in radar chart. The age effects for these models are shown from FIGURE 20 to FIGURE 23. After hypothesis tests respectively, these three graphs all give patterns that young people who are defined as ones in age groups ''Under 18'', ''18-24'', ''25-34'' mark the movies lower than the old people which are defined as those in age groups ''45-49'', ''50-55'', ''56+''. It contributes to our idea that as for ranking movie, young people are pickier than old ones. This conclusion is also supported by comedy analysis which is displayed above. Moreover, in order to provide more evidences, rather than analyzing other types of movies, we directly investigate age effects distribution in all movies recorded, i.e 3,000,000 pieces of users' rating for movies, from which we get the chart below.  Clearly, by visual observation and hypothesis test, age effect differences between young people (Under 34) and old people (Beyond 45) are considerable. Therefore, to summarize, by analyzing the data from ml-1m, we are able to come to a conclusion that young people are pickier than old people to mark movies.

VII. DISCUSSION
In this paper, we introduced how to use linear mixed model for movie recommendation, in order to address cold start problem where there are few users' historical ratings available for data analysis. Compared with traditional algorithms, this model can explore the general pattern of users' rating behaviors, based on their features in different group levels. After some careful analysis we can dig out some implicit information or interesting social behaviors from millions of rating data. Linear mixed models are conceptually simple, but the underlying computation is mathematically involved. In this essay, we showed the mathematical background knowledge of LMM. For readers who are interested in the details of related techniques, they can refer to [22] and [24]. Thanks to the lme4 R package, we can apply these mixed models to analyze huge recommendation data. In our future research, other properties of rating behaviors might be discovered, if we are given more types of users' traits and features. We shall also combine such a model with the traditional SVD approach [40], and consider how to do recommendation with multi-source data set [39]. TIANYU ZUO is currently pursuing the bachelor's degree majoring in applied mathematics with the Department of Applied Mathematics, Xi'an Jiaotong-Liverpool University. He is currently very interested in the application of data science and machine learning. VOLUME 8, 2020