Novel Kernel Orthogonal Partial Least Squares for Dominant Sensor Data Extraction

Orthogonal Partial Least Squares (OPLS) methods are aimed at finding the dominant factors from predictor variables that can maximize cross-covariance between the factors themselves and response variables while a high correlation between them should also be satisfied at the same time. Compared with discriminant analysis like Linear Discriminant Analysis, OPLS simultaneously considers covariance maximization and data fitting. However, unlike discriminant analysis that focuses on between-group discriminability, OPLS concentrates on cross-covariance that already contains no discriminant information. This deepens the difficulty of finding effective dominant factors. To rectify such a drawback of OPLS, this study proposes 1) successively orthogonal deflation in constrained noisy subspace and 2) isotropic space transform for enhancing OPLS. The former explores successively orthogonal projective vectors in subspace and iteratively updates the weighted signal space. The latter converts the dimensions with unequal influences into those with equal ones for correcting distortions. The two proposed rectifications are implemented in three types of Maximum Covariance Analysis (MCA) for examining the gradually changing functionalities, respectively — i) Successive Subspace-MCA, ii) Isotropic Subspace-MCA, and iii) Successive Isotropic Subspace-MCA. Experiments on open datasets were carried out to compare the proposed approaches with the baseline. The experimental results showed that the proposed rectifications maximized cross-covariance while fitting data well, thereby substantiating the effectiveness of the proposed idea.


I. INTRODUCTION
In meteorological research, data captured by sensor networks usually involve multivariables. Heterogeneous data like temperature, pressure, and moisture are collected and compiled together in a dataset. These multivariate sensing data are frequently used as a source for reasoning observed atmospheric phenomena, e.g., sea surface temperature, aerosols, and precipitation. At present, a great deal of statistical research has been developed, such as Ordinary Least Squares (OLS), Factor Analysis, and Analysis of Variance. Among those statistical reasoning methods, Maximum Covariance Analysis (MCA) is widely applied for finding dominant factors from multivariate data, such that harvested weather data can be correlated with observed atmospheric phenomena. MCA discovers the maximum covariance between the dominant hidden components extracted from both predictor variables The associate editor coordinating the review of this manuscript and approving it for publication was Chun-Wei Tsai . and response variables. For instance, meteorologists often conducted MCA to analyze dominant coupled variables that affected sea surface temperature [1], as shown in Fig. 1. Thus, meteorologists can discover which factors have the major influence on sea surface temperature. In statistics, there are many types of factors, such as discriminant factors and redundancy factors [2]. Unlike other factors, dominant factors particularly refer to variables with maximum variance. Compared with discriminant analysis for extracting discriminant factors (e.g., Fisher Linear Discriminant Analysis (LDA), [3], and [4]), finding dominant factors requires considerations for the tradeoff between maximum variance and high correlations [5]. The Partial Least Squares (PLS) method is an approach dedicated to such a purpose.

A. PARTIAL LEAST SQUARES
PLS regression is a variation of multilinear regression. Unlike typical regression (e.g., OLS), PLS embeds MCA into least squares concepts. Typical regression separates the hidden VOLUME  component discovery from the main process, which does not particularly highlight maximum covariance between the hidden components of both predictor variables and response variables. Moreover, typical regression emphasizes finding a correlation between predictors and responses. PLS [6]- [10] jointly searches projection vectors for both independent variables and dependent responses when computing crosscovariance, so that the extracted hidden components of the independent variables and those of the dependent responses have maximum covariance. In other words, the hidden components of both predictor variables and response variables are simultaneously considered while PLS creates a link to map the hidden components of predictor variables onto those of response variables, rather than focusing on finding the hidden components on one side only.
To discover the hidden components on both sides, several studies have been devoted to PLS, including Nonlinear Iterative Partial Least Squares (NIPALS), PLS1, PLS2, SIMPLS [11], and probabilistic PLS [12], which were based on L 2 norm. Recent research on PLS has evolved into other types of metrics, such as L 1 norm [13], manifolds [14], [15], and correntropy [16]. In this study, L 2 norm is focused. For PLS in classification problems, PLS-Discriminant Analysis (PLS-DA) is widely used, and PLS-DA is a term for PLS classification in contrast to PLS regression [17]- [19]. The response variables in PLS classification become categorical variables, or frequently called label information in pattern recognition. Literature [20] has examined the methodological difference between PLS-DA and LDA, subsequently proving that there was a connection between PLS-DA and LDA in their parameters. Despite such a connection, they are not equivalent because the objectives are basically different.

B. ORTHOGONAL PARTIAL LEAST SQUARES
Orthogonal Partial Least Squares (OPLS), or equivalently, Orthogonal Projection to Latent Structures, was derived from PLS [21], [22]. It is a special case of PLS problems. The difference between OPLS and PLS is that the hidden component analysis in OPLS is applied only to predictor variables, whereas PLS jointly works on both predictor variables and response variables. Thus, only the projection vectors from predictor variables are extracted. Like PLS regression and PLS-DA, OPLS also develops two versions. One is OPLS regression, and the other is OPLS-DA [23] for classification problems. The connection between OPLS-DA and LDA was investigated in [20]. Recent studies [24], [25] gave a clear investigation into the eigenstructures of PLS and OPLS from the perspective of eigenvalue decomposition.
For OPLS, after the hidden components of predictor variables are found, there is another process called data fitting, which employs OLS to map the hidden components onto response variables. Although data fitting involves computing the weight vector of OLS, the process requires participation of a projection matrix (i.e., a set of projection vectors). Searching such a projection matrix is actually separated from computation of OLS weight vectors. Because two processes are independent, the problem becomes how to find the maximum correlations/covariance between hidden components and response variables while hidden components should be effectively discovered at the same time. OPLS solves such a problem by calculating a cross-covariance matrix S Xy between predictor variables (e.g., X = [x 1 ,. . . , x N ]; X ∈ R M ; N is the number of samples; M signifies the dimensionality) and response variables (e.g., y = [y 1 ,. . . , y N ]; y ∈ R 1 ) prior to the hidden component analysis. This is the fundamental requirement of typical OPLS problems that differ from discriminant analysis. Several variants [26], [27] of OPLS have been developed at present. Reference [26] utilized the objective of OPLS but integrated it into other concepts, such as Nonnegative Matrix Factorization, nonnegative constraints, orthogonal Procrustes solutions [28], [29], and sequential deflation. The purpose was to find a set of projection matrices based on those algorithms. Then, the system used those projection matrices as filter banks to generate dominant features. A similar application was developed in [30]. More recently, [27] proposed using normalized and redundancyfree measurement projection [31] based on Singular Value Decomposition to replace typical OPLS processes.
There is a clear difference between OPLS-based hidden component analysis and discriminant analysis (e.g., LDA). The former maximizes cross-covariance matrices, whereas the latter is aimed at maximizing between-group discriminability. It is worth noting that discriminant analysis cannot be directly applied to OPLS-based or PLS-based hidden component analysis despite the success in discriminant pattern recognition. This is because OPLS-based hidden component analysis should maximize the cross-covariance matrix S Xy rather than between-group matrices.

C. CHALLENGES
The first challenge is how to find dominant components to fit response variables while maximization of cross-covariance matrices is achieved. In other words, fitting data well without decreasing covariance too much is the purpose of OPLS. Discriminant analysis like feature selection, dimensional reduction (e.g., Principle Component Analysis (PCA), and LDA), and recent manifold learning (e.g., [32]) may work well for data fitting, but they did not simultaneously maximize cross-covariance matrices, which deviates the target of OPLS.
Secondly, a cross-covariance matrix is actually a mixture of within-group data and between-group data. Directly using the concept of maximizing the between-group discriminability (e.g., via discriminant analysis) is not applicable. This is because it violates the requirements of PLS and OPLS owing to different objectives.
Thirdly, unequal variance may exist in different types of predictor variables during data collection, e.g., heterogeneous data (e.g., vehicular/human/animal temperature) collected at different places (e.g., indoors/outdoors) and in various time (e.g., days/nights). Unequal distributional variance may result in biased data fitting in OPLS. Either PLS or OPLS still concentrated on original feature space that suffered from heterogeneous variance.

D. CONTRIBUTIONS
In light of the aforementioned problems, this study proposes the two types of rectifications to improve OPLS: 1) Successively orthogonal deflation in constrained noisy subspace and 2) isotropic space transform. Additionally, these two rectifications are also the innovation factors of this study. The details are elaborated as follows.
Successively orthogonal deflation in constrained noisy subspace: In OPLS, its objective is maximization of cross-covariance matrices, and its implementation is actually trace-norm optimization of the weighted signal space (i.e., S −T/2 S Y S T Y S −1/2 ). Such weighted signal space is computed based on the signal space (i.e., S Y S T Y ) and the noise space (i.e., S ) in terms of signal-to-noise ratios. However, the eigenstructural analysis of OPLS simply relies on the entire weighted signal space. After projection vectors are extracted, only the weighted signal space is deflated. Nevertheless, S is still the same. To correct such a drawback, the proposed successively orthogonal deflation updates the feature space S every time the weighted signal space is changed while the targeted maximization of cross-covariance matrices still remains intact.
Isotropic space transform: Isotropic space refers to space with equal dimensional variance. As mentioned earlier in the challenges, heterogeneous data collected at different places (e.g., indoors/outdoors) and in various time (e.g., days/nights) may result in unequal distributional variance. This may subsequently cause biased estimations of cross-covariance maximization and data fitting for OPLS. Isotropic space transform corrects the distorted components in feature space affected by unequal variance and remaps them into isotropic vector space, where predictor dispersion is normalized. For multivariables, especially when heterogeneous sensors are used to capture data, unequal variance easily exists in datasets. Isotropic space transform can offset such a bias. The rest of this paper is organized as follows. Sections II details typical OPLS. Section III then describes the proposed method. Next, Section IV summarizes the performance of the proposed method and the analytic results. Conclusions are finally drawn in Section V along with future recommendations.

II. PREVIOUS METHOD: OPLS
This section introduces typical OPLS and Kernel OPLS (KOPLS). Assume that there is a multivariate dataset X = [x 1 ,. . . , x N ] along with its response variables y = [y 1 ,. . . , y N ], where X ∈ R M , y ∈ R 1 , M represents the number of dimensions, and N denotes the number of samples. Let J denote the degree of intrinsic space when data X are transformed into by a kernel function ϕ(·), i.e., X∈ R M → ∈ R J . Therefore, KOPLS is expressed as where T means the transpose operator, V is an unknown J ×D projection matrix (i.e., a set of projection vectors), signifies dimensionally reduced data, and D is the number of hidden dimensions (D ≤ J ). Moreover, B and B are residuals.
To represent KOPLS in a concise form, (1) becomes where || · || 2 calculates the L 2 -norm distance. The objective of KOPLS is to minimize the least squares errors E while finding a projection matrix V and a D-by-1 weight vector u.
A L 2 -norm regularized version is obtained by adding a penalty term ρ||u|| 2 to the equation, where ρ indicates the ridge parameter. Furthermore, V should satisfy the following condition. max ee T is the centering matrix, e represents an N -by-1 vector with all the elements equal to ones, and I denotes an identity matrix. For brevity, this study omits I with different sizes. VOLUME 8, 2020 To facilitate the subsequent discussions, the response variable y needs to be extended to a multivariate version Y by using one-hot encoding, where Y denotes an N -by-C matrix, with each row representing the response to the input ϕ(x i ). Also, in the following section, the multivariate versions of S Y and U are used to replace S y and u respectively. In classification problems, C refers to the number of classes.

III. PROPOSED METHOD A. OVERVIEW
As mentioned earlier in the contributions of the introduction, this study proposes two types of rectifications for improving OPLS -1) Successively orthogonal deflation in constrained noisy subspace and 2) the isotropic space transform. These two rectifications are implemented in a series of related MCAs to demonstrate their gradually changing functionalities, respectively -i) Successive Subspace-MCA, ii) Isotropic Subspace-MCA, and iii) Successive Isotropic Subspace-MCA. These MCAs are all interconnected, with the same objective ''finding coupled factors with maximum covariance'' but with different strategies (i.e., the aforementioned two types of rectifications). Their interconnections are listed in Table 1. In typical OPLS, the objective is to discover dominant projection such that the cross-covariance matrix S Y S T Y is maximized while S is constrained during trace-norm optimization. The three proposed MCA methods advance OPLS by introducing successive subspace deflation and the isotropic space transform I. The former finds successively orthogonal projective vectors in subspace, whereas the latter is aimed at converting the dimensions with unequal influences into those with equal ones in order to avoid biases in dimensions. As listed in the table, the proposed Successive Subspace-MCA iteratively considers successively orthogonal projective vectors from residuals compared with typical OPLS. The Isotropic Subspace-MCA explores dominant projection in isotropic space rather than in original feature space, where typical KOPLS searches. The Successive Isotropic Subspace-MCA combines the Successive Subspace-MCA and the Isotropic Subspace-MCA, where successively orthogonal projective vectors are explored in the residual isotropic subspace. In this study, these three proposed MCA methods along with the baseline are discussed.

B. SUCCESSIVE SUBSPACE-MCA
The first one of the proposed MCA is Successive Subspace-MCA (SS-MCA). It enhances KOPLS methods that find coupled factors with maximum covariance. In KOPLS, the weighted signal space (i.e., S −1 S Y S T Y or equivalently S −T/2 S Y S T Y S −1/2 ) is based on the signal space (i.e., S Y S T Y ) and the noise space (i.e., S ), but the weighted signal space is deflated without any update on the noise space.
Let v SS represent a J × 1 projection vector, and which is equivalent to where means Hotelling deflation iterative process, and a basis is removed from subspace. Furthermore, ι denotes the ι-th iteration, and λ signifies the eigenvalue. Equation (6) updates the noise space when a new projection vector is found. Then, the process loops back to (5) and keeps searching another projection vector in the residual until D vectors are discovered. The advantage is that the signal space or the target of covariance maximization stays the same during matrix deflation, but the noise space (i.e., S ) is updated in each iteration with the change to the weighted signal space. The following descriptions also summarize the algorithm of the proposed SS-MCA and its rotational-invariant property.  (2) that max Trace In other words, IS-MCA relies on the isotropic space transform, which corrects the distributional variance of each predicator variable. Let I represent the isotropic space transform, and IS-MCA is expressed as

SS-MCA is rotational invariant
where I focuses on generating a transform vector F = S −1/2 .
Solve U with , Y, W IS , and the multivariate version of (2) feature space. It follows that In brief, V IS functions on isotropic space, whereas W IS works on original feature space.
In IS-MCA, V IS and W IS are generated from the data in isotropic space and original feature space respectively. Proof: , VOLUME 8, 2020 where and In each iteration where a projection vector is selected, the proposed SIS-MCA performs an update on its residual feature space by invoking (14). Hence, when the subsequent iteration (i.e., (12)) is launched, potential projection vectors are explored from the residual. The algorithm for the proposed SIS-MCA is summarized in the algorithm.
The proposed SIS-MCA has rotational invariance. This property can be verified by using the same concept mentioned in rotational-invariant SS-MCA. Therefore, the proof is omitted herein.

IV. EXPERIMENTAL RESULT
Experiments on six open datasets were carried out for evaluating performance. The information of these datasets is listed in Table 2. The first column shows the name. The rest   columns signify the number of classes, samples, and dimensions (after kernel mapping), respectively. All the datasets were downloaded from the UCI machine learning repository.  The experiment used 80.00% of the data for training and 20.00% of the data for testing. Ten-fold cross-validation was applied to the experiment. For kernels, as RBFs were inapplicable to intrinsic space due to infinite dimensions, third-ordered polynomial (poly3) kernels were used herein. For comparison, OPLS along with the proposed methods was tested. The proposed methods included SS-MCA, IS-MCA, and SIS-MCA; each implemented the proposed rectifications -1) Successively orthogonal deflation in constrained noisy subspace and 2) the isotropic space transform. The criteria for benchmarks were a) fitting accuracy, b) covariance between the response variables and hidden components, and c) average covariance increases. It is worth noting that the objective of OPLS is fitting data well without decreasing covariance too much when one examines those benchmarks. This means that high accuracy with low covariance does not achieve the target of OPLS. Fig. 2-Fig. 7 show the comparison between typical OPLS and the proposed methods. The horizontal axis denotes discovered hidden components, whereas the vertical axis represents fitting accuracy or covariance. Besides, only a half of the hidden components are displayed in the figures. For accuracy, observations showed that all the proposed methods were better than typical OPLS, especially when the number of hidden components increased or reached a half of the hidden components. The difference between OPLS and the proposed methods increased. Moreover, the accuracy of OPLS gradually degraded, whereas that of the proposed methods remained more stable compared with OPLS. For OPLS,  although it gained more accuracy rates when fewer hidden components were used in data fitting, the covariance between the response variables and the hidden components was not satisfactory. Notably, fitting data well without decreasing covariance too much is the purpose of OPLS. Closely examining Fig. 2-Fig. 7 reveals that either the proposed SS-MCA or the proposed SIS-MCA generated maximum covariance. This is because both SS-MCA and SIS-MCA utilized the proposed technique -Successively orthogonal deflation in constrained noisy subspace, which was a major rectification for enhancing OPLS. Such a rectification narrowed down the noisy space and enhanced signal-to-noise ratios. For IS-MCA, the reason that the proposed IS-MCA did not generate maximum covariance was that it simply used the isotropic space transform. IS-MCA could be regarded as OPLS in isotropic space. Despite such, the proposed IS-MCA still created a steppingstone for SIS-MCA because SIS-MCA inherited both of the advantages from SS-MCA (using successively orthogonal deflation) and IS-MCA (using the isotropic space transform). Table 3 summarizes the average covariance increase when more hidden components were added. As listed in the table, either SS-MCA or SIS-MCA yielded a higher increase of covariance, far better than OPLS. Based on the above experimental results from a) fitting accuracy, b) covariance, and c) average covariance increases, the proposed two types of enhancing rectifications -Successively orthogonal deflation in constrained noisy subspace and the isotropic space transform could improve OPLS. This indicated that the proposed method not only fitted data well but also maximized covariance.

V. CONCLUSION
OPLS is usually susceptible to low signal-to-noise ratios owing to its objective -Maximizing cross-covariance in the signal space (i.e., S Y S T Y ) while the noise space (i.e., S ) should also be minimized at the same time. Even worse, between-group information is not used when cross-covariance matrices are computed. Such a characteristic intrinsically limits the capability to find correlations between predictor and response variables. To enhance signalto-noise ratios, this work proposes a novel OPLS method by introducing 1) successively orthogonal deflation in constrained noisy subspace and 2) isotropic space transform for rectifying low signal-to-noise ratios while maximizing cross-covariance. The former seeks any potential hidden components from residual feature space and provides updates on the weighted signal space (i.e., S −T/2 S Y S T Y S −1/2 ), which OPLS ignores. The feature space (i.e., S ) and its residuals are iteratively deflated, and the result is reflected in the weighted signal space. The latter searches potential hidden components in isotropic space instead of feature space. Isotropic space provides an alternative perspective for successively orthogonal deflation. The two proposed rectifications were implemented and assessed in our SS-MCA, IS-MCA, and SIS-MCA, respectively.
Based on experimental results, observations showed that either SS-MCA (using successively orthogonal deflation) or SIS-MCA (using the isotropic space transform and successively orthogonal deflation) satisfied the objective. Cross-covariance was maximized while data were also fitted, far better than the baseline. As expected, IS-MCA did not achieve the objective. This is because it simply performed isotropic space transform, and its rest processes were the same as the baseline. Despite such, experimental results indicated that using the isotropic space transform and successively orthogonal deflation simultaneously was still conducive to finding dominant components. This is why SIS-MCA outperformed the other approaches in some results. Such findings indicated that the proposed rectifications were capable of enhancing OPLS effectively.
The future work will concentrate on a more effective eigenstructure, e.g., new isotropic space transform that utilizes between-group information while maximizing crosscovariance matrices at the same time. The proposed approach devised a transform vector that relied on S −1/2 , but an attempt to use between-group matrices (i.e., S −1/2 Between ) and withingroup matrices (i.e., S −1/2 Within ) did not yield better results than the proposed one. Finally, as an open issue to this topic, it is still a challenge to efficiently decide the optimum number of dominant components. A more efficient way will be developed to avoid iterative matrix deflation for rapid computation.