Feature Integration through Semi-supervised Multimodal Gaussian Process Latent Variable Model with Pseudo-labels for Interest Level Estimation

This study presents a novel feature integration method for interest level estimation using a semi-supervised multimodal Gaussian process latent variable model with pseudo-labels (semi-MGPPL). Semi-MGPPL is an extended version of the multimodal Gaussian process latent variable model (mGPLVM). It integrates features calculated from multiple modalities to predict the users’ interest levels in content. It is known that reflecting known interest levels of known users in the latent space effectively improves the accuracy of interest level estimation. However, previous methods have difficulty reflecting the interest levels when the number of samples is insufficient. Semi-MGPPL efficiently reflects interest levels in the latent space by pseudo-labeling of unlabeled samples and increasing the number of available pairs among labeled samples. In addition, obtaining behavior features is difficult for a new test sample. However, requirement of features of all modalities by previous mGPLVM-based methods makes the calculation of latent variables of a test sample challenging. Semi-MGPPL solves this problem by training a projection function from the original feature to the latent space. The experimental results on real data demonstrate the effectiveness and robustness of semi-MGPPL.

proposed [3], [4]. Nevertheless, these methods do not reflect the individual interests of users because they use information obtained from contents instead of users. Therefore, to obtain more user-centered recommendations, an approach using users' behavior information while watching contents has been proposed [5], [6]. This approach requires no queries and can reflect users' interests by constructing a latent space with multiple modalities. The latent space enables highly accurate interest estimation by adequately representing the preferences of each user. Therefore, we focus on a novel feature integration method that utilizes both content and users' behavior information.
To achieve accurate interest level estimation using multiple modalities, such as users' behavior and content information, calculation of effective latent variables that capture the potential relationship between these modalitiesis essential. The traditional approach [6] to estimate latent variables adopts canonical correlation analysis (CCA) [7], which is one of the typical latent variable models. CCA can capture the potential relationship by calculating latent variables that maximize the correlation between multiple modalities through a linear transformation. Furthermore, deep learning-based latent variable models have been proposed, which can flexibly construct the latent space through a nonlinear transformation [8]- [10]; unfortunately, these models have some problems that are difficult to solve. In particular, although the behavior information is useful, it is not practical to collect sufficient training data. In addition, users' behavior is influenced by the content and surrounding environment along with other external factors and emotional state. Therefore, behavior information is likely to contain many noises that are unrelated to the users' interests. However, deterministic models, such as CCA or deep learning-based models, that require a large amount of data may suffer from overfitting since it is difficult to integrate features without the influence of noise in the data [11], [12]. Therefore, the performance of interest level estimation may decrease since these feature integration methods are not suitable for determining the relationship between content and behavior information.
This problem has been solved by proposing a multimodal Gaussian process latent variable model (mGPLVM) [13], which is one of the probabilistic generative approaches. mGPLVM constructs a common latent space by assuming that multiple modalities are probabilistically generated from the common latent space. In particular, mGPLVM maximizes the likelihood of multiple modalities against latent variables. By calculating the common latent space based on the probabilistic approach, correlations between modalities can be represented accurately even in noisy data.
Because mGPLVM is a powerful feature integration method, many researchers have constructed its extended versions [14]- [16]. In addition, to improve the ability for constructing the latent space, several mGPLVM-based methods employing label information have been proposed [17], [18]. However, labeled data are often partial because users assign labels to only some of the content they watch. On the contrary, semi-supervised methods can calculate latent variables even when labels are assigned to a part of the samples. In specific, semi-GPLVM, which is a semi-supervised version of GPLVM, has been proposed [19]. Semi-GPLVM is a feature integration method that uses both labeled and unlabeled samples to calculate the latent space of the data. Furthermore, this method preserves label information in the latent space using pairwise relationships between labeled samples and reflects it in the construction. The unique constraint used in the method places sample pairs with the same label close together, whereas that with different labels far apart in the latent space. Moreover, an extended method that considering the ordinality of the labels has been proposed [20], which reflects the similarity of the labels and distance between the labels in the latent space. However, these methods do not consider the following two problems. Problem (i): The content labeled by users often amount to small percentages of the content they watch. However, in previous methods, a decrease in the number of labeled samples rapidly reduces the number of pairs between labeled samples; therefore, to efficiently reflect a small amount of label information in the latent space may be difficult. Problem (ii): A recommendation system estimates users' interest levels when test data are new. In these methods, it is necessary to reconstruct the latent space using all data when calculating the latent variables for test data. Here, we need the features of test data in all modalities. However, to obtain the behavior features of the test data is impossible because users do not watch the content of the test data. Thus, as the latent variables of the test data cannot be calculated, estimation of the interest levels by these methods is difficult.
To obtain an effective latent space for estimating users' interest levels, a novel probabilistic generative model that simultaneously solves the above problems should be constructed.
Therefore, a semi-MGPPL is proposed in this paper to solve the above problem. The proposed semi-MGPPL introduces two following approaches: Approach (i): The proposed method assumes that unlabeled samples have pseudo-labels [21], [22], which are generated using labeled samples. Therefore, the proposed semi-MGPPL increases the number of pairs using pseudo-labeled samples and efficiently reflects the information of labels in the latent space even when the number of labeled samples is small. Approach (ii): The proposed method introduces a mapping scheme that allows learning the projection from the observation space to the common latent space [23]. The mapping scheme obtains latent variables in test data even when it is difficult to obtain behavior information.
Because the proposed method is based on mGPLVM, it can integrate features even when noisy modalities, such as behavior features, are included. In addition, the above two novel points allow the proposed method to project new samples to appropriate latent variables when the number of labeled samples is small. Therefore, a highly accurate recommendation system can be realized using the proposed method for interest level estimation. The experimental results show that semi-MGPPL can improve the prediction accuracy of user's interest levels. This is an extension of our previous study [20].

II. RELATED WORKS
mGPLVM is a probabilistic model capable of integrating nonlinear features and applies to various data. Further, it is an underfitting model for small samples or noisy data.
. . , x m N ] T ∈ R N ×Dm denotes m-th modality, D m is the dimension of m-th modality, and N is the number of samples. The aim of mGPLVM is to train the latent variables Z = [z 1 , z 2 , . . . , z N ] T ∈ R N ×Q , where Q denotes the dimension of the latent variables. The hyperparameters Θ = {θ 1 , θ 2 , . . . , θ M }, where θ m is the hyperparameter of all participating kernel functions and the variance of a noise. This model assumes that each dimension of the observed value is generated from each potential function f m d (d = 1, ..., D m ) as follows: where x m nd is (n, d) element in matrix X m , and m nd ∼ N (0, σ 2 m ) denotes the Gaussian noise. mGPLVM assumes that these latent functions {f m d } Dm d=1 follow the Gaussian distribution, which is defined as follows: where k m (·) is a kernel function. Depending on the purpose of a task, the kernel function can be selected freely from the linear and radial basis function kernels.
The maximum a posteriori (MAP) method can be used to calculate the hyperparameters Θ and latent variables Z. The Bayesian theorem is given by where p(X 1 , X 2 , . . . , X M |Z, Θ) is the joint marginal likelihood of {X m } M m=1 and p(Z) is the prior distribution of Z. To maximize the posterior distribution, the value on the right-hand side of Eq. (3) should be maximized. Thus, the following maximization problem is obtained: Generally, mGPLVM assumes that each z n follows the standard Gaussian distribution of the Q dimension and de-fines p(Z) as follows: p(z n ) = N (0, I Q ).
Different forms of prior distributions can be introduced in mGPLVM to increase the model's flexibility [23], [24]. Based on the above, mGPLVM performs the latent variable calculation to integrate multiple modalities from the observations.

III. FEATURE INTEGRATION VIA SEMI-MGPPL
In this section, the proposed semi-MGPPL method is explained. Using the proposed semi-MGPPL, common latent variables from multiple modalities is calculated. Furthermore, the proposed method considers label information efficiently by assuming pseudo-labels [21], [22] for unlabeled data and increasing the available pairwise relationships between samples. Furthermore, a back constraint (BC) method [23] can be used to estimate a mapping function from observation to latent space. Therefore, the proposed semi-MGPPL can calculate latent variables even in real-world scenarios with no behavioral features in the test data.

A. OBJECTIVE FUNCTION OF SEMI-MGPPL
In this subsection, the objective function of the proposed semi-MGPPL is described. Using the semi-MGPPL, two kinds of features are obtained. First, labeled content features are defined as follows: (N being the number of labeled samples) and unlabeled content features (N being the number of unlabeled samples) from contents. Second, labeled behavior features are defined by from users' behavior while viewing content. In addition, their concatenated matrices are defined as X c ∈ R N ×D c and Similar to multimodal similarity Gaussian process latent variable model (m-SimGP) [14], the proposed semi-MGPPL assumes that the similarity matrices of the observed values are generated probabilistically from the latent space. The similarity matrix S m = [s m 1 , s m 2 , . . . , s m N ] T ∈ R N ×N is defined as follows: where s m pq is (p, q) element in the matrix S m (m ∈ {c, b}), γ m is a bandwidth parameter, and γ m > 0. Each element VOLUME 4, 2016 of the similarity matrices calculated in Eq. (9) is assumed to be generated from a latent function f m q (q = 1, ..., N ) as follows: Using the similarity matrices and partial label information, the objective function of the proposed semi-MGPPL is defined as follows: where Ψ is a parameter in the mapping function g(·) that is introduced according to the BC technique, and z i = g(s c i ). Furthermore, p(S c , S b |Ψ, Θ) is the joint marginal likelihood of S c and S b , p(Ψ) is the prior distribution of Z, Σ is the covariance matrix of the pseudo-labels, and λ is a tradeoff parameter. The definitions of the above characteristics are discussed in detail in sections III-B and III-C. Similar to the semi-GPLVM-based methods [19], [20], p(Ψ) is calculated using the label information. However, unlike the methods [19], [20] using the obtained labels only, the proposed method uses the obtained and pseudo-labels to calculate p(Ψ). This point corresponds to Approach (i). Unlike mGPLVM, which directly optimizes latent variables, the proposed method optimizes the mapping function parameters that project the observed values into the latent space. This point corresponds to Approach (ii). Therefore, Eq. (12) is solved by alternating between the following two updates: (a) pseudo-labels update based on Gaussian process regression (GPR) and (b) parameters update based on the MAP method. Each procedure is described in detail in the following subsections.

B. UPDATE OF PSEUDO-LABELS BASED ON GPR
Using the proposed method, pseudo-labels l of Z are generated from Z and l. Because the latent variable Z cannot be observed, it is initialized based on principal component analysis [25]. The kernel function parameters k (·) used in GPR are optimized as follows: where k (·) is the radial basis function kernel and φ is the parameter in GPR. Therefore, using the optimizedφ, the predictive distribution of the pseudo-labels N (l|µ, Σ) are calculated where The proposed semi-MGPPL defines the prior distribution to reflect the information of the known labels l and pseudolabels l in the latent space. The vector l ∈ R N as the concatenation of l and µ and the prior distribution p(Ψ) is defined as follows: where const. is a constant term, t = z i − z j 2 , α is a parameter, and ∆ = (Lmax−1) 2 . Moreover, Eq. (15) can be rewritten as follows: where A consists of a ij . However, the use of the pseudolabels in Eq. (15) does not necessarily improve the interest level estimation accuracy because errors in the pseudo-labels may limit the performance [26]. To overcome this problem, the predictive covariance matrix Σ is explicitly exploited. In particular, the predictive covariance is minimized by considering Eq. (12) in the objective function, as described in [26]. The generation of the pseudo-labels enhances the calculation of a i,j for all sample pairs in Eq. (16). Therefore, the problem that the number of sample pairs will be reduced drastically when the number of labeled samples is small can be solved. In addition, the small amount of label information can be reflected efficiently in the latent space. This corresponds to the contribution of Approach (i) in this study.

C. UPDATE OF PARAMETERS BASED ON MAP METHOD
According to mGPLVM, the joint marginal likelihood is defined as follows: The linear kernel k(·), which is the simplest kernel function, is used to verify the effectiveness of the proposed semi-MGPPL. By rewriting Eq. (12), the following equation can be obtained.
In this study, Eq. (19) is solved using the scaled conjugate gradient method [27] similar to previous methods [24], [28], [29]. By solving Eqs. (13) and (19) iteratively, the parameters of the mapping function Ψ and hyperparameters Θ are optimized. Then, application of the estimated mapping function allows the estimation of the latent variables z (t) of the test sample x c (t) as follows: Therefore, the proposed semi-MGPPL can calculate the latent variables for test data for which no behavioral features are available by assuming the mapping function g(·) that projects the content features into the latent space. In semi-MGPPL, a multilayer perceptron is used as the mapping function. This corresponds to the contribution of Approach (ii) in this study.

IV. EXPERIMENTAL RESULTS
In this section, experimental results are presented to verify the effectiveness and robustness of the proposed semi-MGPPL. In IV-A, the dataset used in the experiments is explained. Then, in IV-B, the experimental setup, comparison methods, and evaluation results are described. Finally, the experimental results are presented in IV-C.

A. DATASET
In this subsection, the dataset used in the experiments is explained. In the present experiment, 49 movie trailers obtained from YouTube 3 are used, similar to previous studies [6], [20], [30]. In particular, ten trailers each from "science," "music," "action," and "comedy" genres, and nine from the "sports" genre are used. Each frame of these videos is used as input to Inception-v3 [31] and obtained output vector from the middle layer. Then, their average vectors as content features x c n ∈ R D c for each i-th (i = 1, 2, . . . , I; I being the number of the videos) video are calculated, where D c = 2048.
When acquiring user behavior information, the subjects sat in a chair in front of a screen and watched videos, as shown in Fig. 1. First, the subjects were given 10 s as preparation time. Then, one of the videos was shown on the display, and the subjects watched it for 30 s. The subjects then had 5 s to record their interest levels 4 for the watched video. For all videos, the subjects were asked to repeat these actions. Note that the subjects included eight men and two women; they were approximately 22 years old. In this study, while the subjects were watching the videos, Tobii Eye Tracker 4C 5 and OpenPose [32] are used to obtain their behavior information. Tobii Eye Tracker 4C can detect the two-dimensional (2D) eye gaze position of users, and the gaze information correlates closely with users' interest [33]. OpenPose is one of the latest methods for estimating 2D body skeleton positions; it has been recently used in several studies. Based on an affinity for body parts, deep neural networks can estimate them positions [32]. Tobii Eye Tracker 4C and OpenPose are used to obtain information on the gaze and skeletal positions of each body part, respectively. Then, averages and variances over movements of those positions for two axes in the 2D space are calculated and user behavior features x b n ∈ R D b are obtained for each i-th video watched by each j-th (j = 1, 2, . . . , J; J being the number of the subjects) subject, where D b = 64, as listed in Table 1.
Because behavior features include facial information, information on the facial expression toward the video can be obtained. Therefore, it is expected to acquire information closer to users' interests similar to biometric devices, such as a smartwatch.

B. EXPERIMENTAL CONDITIONS
Unlike mGPLVM, the proposed semi-MGPPL can calculate latent variables of new test data using the projection function from observation to latent space. The effectiveness of the proposed semi-MGPPL is confirmed by comparing the accuracy of interest level estimation on the new test data where only content features are available in the experiment. Because the proposed method aims for feature integration, interest level estimation is performed using a different method. Thus, tensor completion used in previous studies is adopted [6], [20]. In particular, an incomplete tensor was constructed using latent variables of all samples and known labels. Then, labels for test data were estimated using tensor completion. To confirm the robustness of semi-MGPPL, conducted experiments in three situations were conducted. First, 10%, 60%, and 30% of the data are selected randomly as labeled training, unlabeled training, and test data, respectively. Second, 20%, 50%, and 30% of the data are selected randomly as labeled training, unlabeled training, and test data. Third, 30%, 40%, and 30% of the data are selected randomly as labeled training, unlabeled training, and test data. Note that all labeled samples are included in the training data. Figure 2 shows the construction of the dataset used in the first condition.

Details of features Dimensions Eye Tracker
Averages and variances over motions of gaze positions for two axes 4 OpenPose Averages and variances over motions of the neck, nose, and center of the hip positions for two axes 12 Averages and variances over motions of both ears, eyes, shoulders, wrists, elbows, and hips for two axes 48 Total 64 The performance of the proposed semi-MGPPL is compared with the following six methods: BC-mGPLVM, BCm-SimGP, BC-semi-supervised ordinally multimodal Gaussian process latent variable model (BC-semi-OMGP), multiview CCA (MVCCA) [34], Bayesian CCA (BCCA) [35], and Deep CCA [8]. BC-mGPLVM, BC-m-SimGP, and BCsemi-OMGP are methods that introduce the BC technique in mGPLVM [13], m-SimGP [14], and semi-OMGP [20], respectively, to calculate latent variables of new test data. mGPLVM is the baseline method; m-SimGP and semi-OMGP are the mGPLVM-based methods. In addition, three types of extended CCA are selected because CCA is one of the most widely used feature integration methods. MVCCA is a deterministic method for calculating the projection that maximizes the sum of the correlations between multiple modalities. BCCA is a fully Bayesian approach to CCA by assuming an appropriate prior distribution for the model parameters. It also has the advantage of being robust to small sample sizes. Deep CCA is a method for maximizing correlations between multiple modalities using a multilayer perceptron. It has been used for several tasks, not limited to interest level estimation. Note that the parameters in semi-MGPPL λ, Q, γ m , and α were set as 100, 000, 20, 2, and 100, respectively.
Mean absolute error (MAE), which is defined by the following equation, was used for evaluation.
where l PRE s is the estimated interest level of s-th sample, l GT s is its ground truth, and N test is the number of test samples.

C. EXPERIMENTAL RESULTS
In this subsection, the experimental results are presented. The results obtained from the experiments in Tables 2, 3, and 4 are shown. These tables list MAEs for each subject in the interest level estimation of the test data in the first, second, and third conditions, respectively. The experimental results confirm that the proposed semi-MGPPL is effective in a realistic scenario of interest level estimation for test data without behavior features.
Because the proposed semi-MGPPL and other mGPLVMbased methods outperform CCA-based methods, it can be confirmed that feature integration based on mGPLVM is effective in interest level estimation. Furthermore, by comparing semi-MGPPL and BC-semi-OMGP with BC-mGPLVM and BC-m-SimGP, the effectiveness of reflecting label infor- mation in the latent space is confirmed. In addition, by comparing semi-MGPPL with BC-semi-OMGP, the effectiveness of pseudo-labeling for unlabeled samples and its use for the latent space construction is confirmed. Because the estimation accuracy of the proposed semi-MGPPL is better than that of the comparison methods in all situations, robustness of the proposed method to changes in the number of unlabeled samples is confirmed. This robustness is essential for interest level estimation since it is not easy to collect several labeled samples. Tables 2 and 4 show that the proposed semi-MGPPL is more effective than the comparison methods when users have labeled very little and nearly half of the content, respectively. Therefore, the present experiment confirms the effectiveness of the proposed semi-MGPPL.

V. CONCLUSIONS
This study presented a novel feature integration method through a semi-supervised multimodal Gaussian process latent variable model with pseudo-labels for interest level estimation. We define a new mGPLVM-based framework called semi-MGPPL suitable for interest level estimation using users' behavior information. Because the proposed semi-MGPPL assumes pseudo-labels for unlabeled samples and can increase the pairwise relationship between labeled samples, the proposed method can efficiently reflect a small amount of label information in the latent space. Furthermore, since semi-MGPPL introduces BC, the proposed method can calculate the latent variables of the newly obtained test data. The experimental results confirmed the effectiveness and robustness of the proposed semi-MGPPL to changes in the MIKI HASEYAMA received her B.S., M.S., and Ph.D. degrees in Electronics from Hokkaido University, Japan in 1986, 1988, and 1993, respectively. She joined the Graduate School of Information Science and Technology, Hokkaido University as an associate professor in 1994. She was a visiting associate professor at the Washington University, USA, from 1995 to 1996. She is currently a professor at the Faculty of Information Science and Technology Division of Media and Network Technologies, Hokkaido University. Her research interests include image and video processing and its development into the semantic analysis. She has been a Vice-President of the Institute of Image ITE, Japan, an Editorin-Chief of ITE Transactions on Media Technology and Applications, a Director, International Coordination and Publicity of The IEICE. She is a member of IEEE, IEICE, ITE, and Acoustical Society of Japan. VOLUME 4, 2016