A Doubly Regularized Linear Discriminant Analysis Classifier with Automatic Parameter Selection

Linear discriminant analysis (LDA) based classifiers tend to falter in many practical settings where the training data size is smaller than, or comparable to, the number of features. As a remedy, regularized LDA (RLDA) methods have been proposed. However, the classification performance of these methods vary depending on the size of training and test data. In this paper, we propose a doubly regularized LDA classifier that we denote as R2LDA. In the proposed R2LDA approach, two regularization operations are carried out; one involving only the training data set, while the other also includes the given test data sample. The proposed R2LDA algorithm, unlike the classical RLDA techniques, caters for errors due to training data as well as the possible noise in the test data. Choosing the two regularization parameters in R2LDA can be automated through existing methods based on least squares (LS). Particularly, we show that a constrained perturbation regularization approach (COPRA) is well suited for the regularization parameter selection task needed for the proposed R2LDA classifier. Results obtained from both synthetic and real data demonstrate the consistency and effectiveness of the proposed R2LDA-COPRA classifier, especially in scenarios involving noisy test data.


I. INTRODUCTION
The idea of linear discriminant analysis (LDA) was originally conceived by R. A. Fisher [1] and is based on the assumption of Gaussian distribution of data with a common class covariance matrix. Owing to its simplicity, LDA has been successfully applied to various classification and recognition problems such as detection [2], speech recognition [3], cancer genomics [4], [5] and face recognition [6] to mention a few.
The performance of LDA based classifiers depends heavily on accurate estimation of the class statistics in the form of sample covariance matrices and mean vectors. These estimates are fairly accurate when the number of available samples is large compared to the data dimensionality. In practical highdimensional data settings, the challenge is to cope with the limitation in the number of available samples. In this case, the sample covariance estimates become highly perturbed and ill-conditioned resulting in severe performance degradation. To alleviate this problem, the sample covariance matrix is replaced with a regularized or ridge covariance matrix [7], giving the name regularized LDA (RLDA) [8], [9], [10]. The performance of RLDA classifiers is ultimately dictated by the choice of the regularization parameter. It is essential to judiciously set the value of the regularization parameter to reap the full benefits of RLDA. Towards this end, various regularization techniques have been proposed, e.g., crossvalidation [11] has been one of the classical techniques for estimating the ridge parameter as evidenced in [12], [13], [14], [5], [15]. However, the search mechanisms of these methods lead to high computational complexity. In addition, they are not based on performance optimizing criteria.
Recently, an optimal regularization method that minimizes the asymptotic classification error was derived in [16], [17]. The method is based on recent results from random matrix theory. In [18], the latter method was extended to a more general class of discriminant analysis based classifiers, with LDA obtained as a special case. Despite being elegant approaches, both [17] and [18] require a grid search mechanism to find the best value of the regularization parameter. In [19], an improved RLDA classifier is proposed which avoids the grid search but is limited to spiked-model covariance structures. It is worth mentioning here that these theoretical results strongly rely on the Gaussian assumption, and so they might not apply equally well to real data. Moreover, the performance of the above mentioned approaches deteriorates significantly when the test data is contaminated with noise that is not observed during the training stage.
Focusing on binary classification, this paper presents a doubly regularized LDA (R2LDA) classifier by expressing the LDA score function as an inner product of two vectors which are linearly related to the mean vectors and the data covariance matrices. These vectors are estimated by using a perturbation regularization approach [20] where the regularization parameters can be selected to be optimal in the meansquared-error (MSE) sense. The proposed method takes care of the ill-conditioning of the sample covariance matrices and the uncertainties in the training or the test data. In addition, the proposed method has the following distinctive features: • Two regularization parameters are calculated based on both the training and test data. These parameters can be tuned independently to cope with the different perturbations including those in the test data. This is to be contrasted with existing approaches which utilize a single regularization operation based solely on the training data. This feature makes the proposed approach more robust to noise that is unobserved in the training data but occurs in the test data. • The regularization parameter selection approach is agnostic to the underlying distribution of the data contrary to [17], [18], [19], which rely on the Gaussian assumption.

II. RLDA CLASSIFICATION
We consider the binary classification problem of assigning a multivariate observation vector x ∈ R p×1 to one of two classes C i , i = 0, 1. Let π i be a prior probability that x belongs to a class C i and assume that the class conditional densities P (x|x ∈ C i ) , i = 0, 1 are Gaussian with mean vectors m i ∈ R p×1 and non-negative covariance matrices Σ i ∈ R p×p .
LDA employs the Bayesian discriminant rule, which assigns x to the class with maximum posterior probability. Let S 0 = {x l } n0 l=0 and S 1 = {x l } n0+n1 l=n0+1 represent the available training samples pertaining to the classes C 0 and C 1 , respectively, where n i is the number of samples in class C i and n = n 0 + n 1 is the total number of training samples. The LDA score function reads [21] where (.) T is the matrix transpose operation. The unbiased mean vector estimatesm i and the pooled sample covariance matrixΣ are given bŷ where the sample covariance matricesΣ i are defined aŝ The class assignment rule for x is as follows: x ∈ C 0 , if W (x) > log(π 1 /π 0 ); C 1 , otherwise.
A major source of error in the above formulation is the inversion of the covariance matrixΣ. In many practical setups where n is comparable to p,Σ becomes ill conditioned, or even singular. To get around this issue,Σ −1 in (1) is replaced with a regularized matrix H = (I + γΣ) −1 , where γ ∈ R + and I is the identity matrix of dimension p. This replacement results in the RLDA score function [17], [16] In this work, we employ a different form of regularization to that in (1). In the proposed regularized LDA classifier, we apply two separate regularization operations, which help in accounting for errors in the training data and providing robustness against error contributions that are present in the test data.

III. THE PROPOSED R2LDA CLASSIFIER
Existing RLDA techniques are based on (5), with H estimated by selecting the regularization parameter γ using only the training data. This makes these techniques vulnerable to errors in the test data, especially when the error statistics of the test data deviate from those of the training data. To address this issue, we reformulate the LDA score function (1) as where 2m− . Based on the last two definitions, our proposed R2LDA method aims to obtain regularized estimates of z and b to improve the computation of the score function in (6). To this end, we utilize the linear models where v x and v m are additive noise vectors. Each of (7) and (8) can be represented by the model where v represents model noise. To simplify our derivations, we make the following assumptions: 1) The noise vector v has zero mean and an unknown covariance matrix σ 2 v I.
2) The unknown random vector c is zero mean with an unknown positive semi-definite diagonal covariance matrix Σ cc .
3) The vectors v and c are mutually independent. In Section V, we will see that these simplifying assumptions still work for different classification examples.
Focusing on (9), regularization methods, commonly named ridge regression or Tikhonov regularization [22], [23], [24], can be applied to obtain a stabilized estimate of c. This estimate can be expressed in a closed form as [10] Based on (10), we can estimate z and b and substitute the results in (6) to obtain the R2LDA score function in the form where γ z ∈ R + and γ b ∈ R + are the regularization parameters associated with the linear systems (7) and (8), respectively. The second equality in (11) follows directly from substituting (in (10)) the eigenvalue decomposition (EVD) ofΣ given bŷ Σ = UD 2 U T , where U is the matrix of eigenvectors and D 2 is the diagonal matrix of eigenvalues ofΣ. Now, it only remains to set the values of the regularization parameters γ z and γ b . In the following section, we present a robust method to compute the regularization parameter for the regularized least-squares (RLS) solution in (10).
Remark 1: Compared to the conventional RLDA score function (5), the new formulation (11) involves two regularization operations. Note that the estimation of the class mean vectors m i results in perturbations in bothm − and x ′ . In addition, x ′ also has errors coming from the test data. By carrying out two independent estimations to obtain regularized estimates of z and b in (6), we can optimize the choice of two different regularization parameters to cope with the different perturbations in x ′ andm − . This is the key advantage of proposed R2LDA method over the classical RLDA based on (5) which uses a single regularization operation that involves only the training data. It will also become clearer that the proposed R2LDA still uses the statistics from the training data only, which is fundamental requirement of any machine learning algorithm.

IV. REGULARIZATION PARAMETER SELECTION
Several methods have been proposed in the literature for selecting the regularization parameter γ required in (10), e.g., the L-curve [25], the generalized cross-validation (GCV) [26], and the quasi-optimal method [27], [28], to mention a few. These methods use different criteria which results in different values of the regularization parameter (see [29]).
In this work, we adopt the constrained perturbation regularization approach (COPRA) [20], which allows for regularization parameter selection in a way that optimizes the MSE. We adapt this algorithm to the setting of the problem in hand. COPRA works by introducing an artificial perturbation in the linear model to improve the singular-value structure of the resulting model matrixΣ, and hence, is well suited to the naturally perturbed model in hand. To proceed, we start by replacingΣ 1 2 in (9) by a perturbed version to obtain the model where ∆ ∈ C m×n is an unknown perturbation matrix which is norm bounded by a positive number λ, i.e., ∆ 2 ≤ λ. One can consider ∆ to be a way to perturbΣ 1 2 to make the solution of (12) stable [20]. The perturbation ∆ can also be thought of as a genuine error in the model due to the noisy nature ofΣ 1 2 , which is the case for (9). To obtain an estimate of c, we consider the minimization of the worst-case residual error. Namely, we pursue the following optimization: Interestingly, as shown in [30], [20], [31], the min-max problem (13) can be converted to a minimization problem whose solution is given by (10) with the constraint We observe that the solution of (13) depends on the bound λ (in addition to the other linear system parameters) and is agnostic to the structure of the perturbation matrix ∆. Note that both λ andĉ are unknown. However, we can substitute (10) and the EVD ofΣ in (14) and manipulate to obtain where tr(.) is the matrix trace operator. Since λ in (15) is stochastic in nature, we consider a value of λ that would represent the average case. To this end we replace yy H with its expected value E yy H , which can be written based on (9) in the following form: Owing to the ill-conditioning ofΣ, it is likely that some of its eigenvalues are very close to, or even, zero. Therefore, the EVD ofΣ can be written in the form, where D 1 and D 2 are diagonal matrices containing n s most significant and n−n s least significant eigenvalues, respectively. This partitioning is introduced as a general case form. For the special case where all eigenvalues are significant, we set n s = n and no partitioning is required. A threshold based approach to find the point of this partitioning is recommended in [20]. However, a simple and intuitive rule is used here to determine the value of n s as the smaller value of p and n, i.e., n s = min(n, p). The rationale behind (17) will be explained subsequently (see Remark 2). Now, we substitute (16) and (17) in (15) and manipulate to obtain (18, shown on the top of the next page). Next, we proceed by eliminating the dependency of λ on the unknowns σ 2 v and Σ cc in (18) by using the MSE criterion. The MSE of the RLS estimator (10) can be written as [10] By differentiating (19), the regularization parameter γ that minimizes the MSE, is given by By substituting (20) in (18), we obtain (21, shown on the top of the next page), which shows a bound λ that does not depend on the statistics of c or that of the noise. Note that the derivation of (16)-(18) is largely based on the Assumptions 1-3. Ultimately, by using (21), we can eliminate λ from (15) to obtain (22), where d = U T y and β = n/n s . Equation (22), which is non-linear in γ, can be solved by using Newton's Method [32] to obtain the optimal value of γ. The iterations should be initialized from a positive initial guess close to zero to avoid missing the positive root, as explained in [20].
Remark 2: Equation (22) is based on the contribution of only the significant eigenvalues ofΣ which occupy the diagonal of the matrix D 2 1 . In this case, if Newton's method iterations start from a small initial value of γ, the (diagonal) matrix inversion operation required to compute the right-hand side of (22) will be numerically stable since the diagonal elements of D 2 1 are not overly small. This highlights the benefit of partitioning and truncation of the insignificant eigenvalues in (17).

A. Summary of the Proposed R2LDA-COPRA Algorithm
The main steps involved in the proposed R2LDA algorithm based on COPRA are summarized as follows: 1. Estimate the class statistics;m i ,Σ i andΣ based on the training data by using (2) and (3). 2. Computem + ,m − and the EVD ofΣ to determine D 1 and U 1 corresponding to the n s most significant eigenvalues.
3. Set y =m − in (22) and solve using Newton's method to obtain γ b . 4. For a given test sample, compute x ′ . Then repeat step 3 by setting y = x ′ to obtain γ z . 5. Compute the R2LDA score function given in (11) and classify the given test sample according to (4). By using (19), COPRA guarantees that we obtain the best (in terms of MSE) regularized estimates of z and b required to form our R2LDA score function in (11). This does not guarantee optimal classification performance based on (11). However, our results show that the proposed R2LDA algorithm still outperforms classical RLDA classifiers of the form given by (5). It is worth mentioning here that COPRA can be replaced with other regularization methods to compute the regularization parameters γ b and γ z . Further, the proposed R2LDA algorithm uses only the statistics from the training data (step 1). The computations in step 4 and step 5 use the given test sample and not the test data or the noise statistics.

V. RESULTS
We demonstrate the performance of the proposed R2LDA classification against the RLDA techniques of the asymptotic error estimator (Asym) [17] and the improved error estimator (Impr) [19]. We also consider GCV [26] and bounded perturbation regularization (BPR) [33] as alternatives to COPRA in selecting the two regularization parameters of the R2LDA classifier. We consider both synthetic and real data for performance comparison.
The synthetic data was generated using a Gaussian data model with class covariance matrices and mean vectors defined as: Σ 0 , which is of dimensionality p × p = 100 × 100 and has 1 on the main diagonal and 0.1 as off-diagonal elements; a, a, ..., a] T . The parameter a was chosen according to Mahalanobis distance (δ) between classes defined as, δ 2 = (m 0 −m 1 ) T Σ −1 (m 0 −m 1 ) [17]. We set δ 2 = 9. A training set S i of size n i for the class i was generated in each trial. We set n 0 = n 1 . For the test data, we generated an independent set of samples for each class. A total of 500 training trials were carried out, each followed by 500 test trials.
For real data, we use the MNIST dataset of 20 × 20 grayscale images of handwritten digits [34], and the phonemes dataset considered in [35]. The later is based on logperiodogram (of length p = 256) of digitized speech frames extracted from the TIMIT database (TIMIT Acoustic-Phonetic Continuous Speech Corpus, NTIS, U.S. Department of Commerce) [35], which is widely used in speech recognition. The MNIST images are vectorized to result in data of dimensionally equal to p = 400. For binary classification, selected pairs of images were used. On the other hand, we used only two phonemes transcribed as: "sh"as in "she"and "dcl"as in "dark", for binary classification. Real data results were obtained from 100 training attempts. In each attempt, the training samples were chosen randomly from the dataset. Each trained model was tested using 500 examples, which were also randomly selected from the dataset.
For both the synthetic and real datasets, zero-mean Gaussian noise with standard deviation σ n was added during the test phase. The properties of the noise were not known by the proposed R2LDA classifier, nor were they known by any of the benchmarks.

A. Discussion
Figs.1-3 shows classification error versus the size of the training data n for different scenarios. Fig.1 presents the results for the (synthetic) Gaussian data, while Fig.2 and Fig.3 present results for the MNIST and phonemes datasets, respectively. The MNIST results are based on images/digits pair examples of (1,7), (5,8) and (7,9). From these results in Figs.1-3, we observe the following: • On average, the R2LDA methods outperform the RLDA methods. • The R2LDA remains more consistent and stable than the RLDA methods as the level of noise in the test data increases. This is more visible with the MNIST and phonemes datasets. • Amongst the R2LDA classifiers, R2LDA-COPRA is the most consistent. R2LDA-GCV and R2LDA-BPR falters occasionally as in Fig.2(a) and Fig.2(g).

VI. CONCLUSIONS
We presented a novel regularized LDA classifier based on a dual regularization approach to provide robustness against both training and test data perturbations. In the proposed classifier, the regularization parameters are obtained by solving a nonlinear equation using Newton's method. Results based on both synthetic and real data demonstrate the effectiveness of our method, especially when noise is present in the test data. Although the proposed method is presented for binary classification, it can be easily extended to multi-class problems.