A Dynamic Discriminative Canonical Correlation Analysis via Adaptive Weight Scheme

In multi-view learning, massive literature is devoted to exploring the intrinsic structure between cross-views. It is well known that canonical correlation analysis (CCA) is a conventional multi-view learning method, which considers the correlation between two views. However, it fails to utilize class information and is difficult to suit different issues to extract discriminative features. In this paper, we propose a novel cross-view discriminative feature learning method called dynamic discriminative canonical correlation analysis, which captures class information to yield discriminative features. More specifically, we develop an adaptive weight scheme of cross-view within-class and between-class scatters to make full use of distribution class information. In addition, an iterative algorithm with Cauchy inequalities and the Lagrange multiplier is proposed to handle the non-smooth objective function. Our method is applied to face recognition and multi-linguistic text classification tasks. Extensive experimental results reveal that the adaptive weight scheme plays a beneficial role and our method is an effective feature learning.


I. INTRODUCTION
One object can possess single or multiple data representations, and data can be classified as single-view data [1] and multi-view data [2] by the number of data representations corresponding to one same object. Single-view data can directly reflect one nature, while multi-view data can explore the same latent contents from different viewpoints. The fact is that most of the data existing in many real-world applications (e.g., hyperspectral anomaly detection [3], face recognition, and multi-linguistic text classification) are multi-view data. For instance, hyperspectral and multi-sensor data contain different discrimination with subtle spectral differences. In addition, we often translate an article into different languages to make readers from different regions understand the contents that authors want to deliver. The text representations from different languages belong to the multi-view data The associate editor coordinating the review of this manuscript and approving it for publication was Andrea F. Abate . of the article. These multi-view data can be beneficial to improve a variety of practical applications including classification, retrieval, label prediction tasks [4]. Naturally, multiview feature learning has been a popular and valuable field. Scholars show a strong interest in the study of the multi-view method and propose a lot of meaningful algorithms based on effective approaches and rigorous theory. With the success of convolutional neural networks (CNN [5], [6]), the computer vision community has witnessed a huge shift from conventional methods to deep learning in multi-view feature learning. Zhao et al. proposed a series of effective methods to promote the joint classification using hierarchical random walk network (HRWN) [7], fractional Gabor convolutional network [8], dual-tunnel CNN [9], Patch-to-Patch CNN [10]. On the other hand, in recent years traditional hashing also demonstrates its superiority of short binary representation for high-dimension data. To make full use of correlations of cross-view data, many effective cross-view hashing e.g semi-paired discrete hashing (SPDH) [11] and VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ dubbed multiview discrete hashing (MvDH) [12] has been proposed. Nowadays multi-view feature learning is divided into three kinds: co-training, subspace learning, and multi-kernel learning. Canonical correlation analysis (CCA) [13] belonging to subspace learning has been widely regarded as a classic method with the capability of exploiting the linear correlation between two series of features. CCA based on multivariate correlation statistics maximizes the correlation between two canonical variates in common low-dimensional feature space [14]. Canonical projections of CCA are the conjugate orthogonal projections. Specifically, the ith projection pair is conjugate orthogonal with the jth (i = j) projection pair to preserve useful content in each feature space. Nevertheless, classic CCA fails to cope with the small sample size (SSS)problem [15]. The SSS problem and noise disturbance usually lead to poor performance in recognition and classification tasks. Orthogonality [16] has been proved as a useful criterion to solve the SSS problem. Shen et al. [17] proposed an extended version of orthogonal CCA (OCCA) that can compute the first canonical projection pair and constrained the other canonical projection direction pairs with orthogonality. OCCA gives effective solutions by using twin eigen decomposition and by replacing the conjugate orthogonality constraint. A large number of experiments have shown its validity. Another effective approach to improve the robustness is probabilistic CCA (PCCA) [18]. The novel strategy and solution of canonical correlation features extraction based on distribution assumption in the union subspace, PCCA, has been proved effective to improve the robustness of the model.
In addition, a number of existing multi-view methods are committed to exploiting the mutual information toward one same object and unique character of different views such as cross-view locality preserved diversity and consensus learning (CvLP-DCL) [19], learning a joint affinity graph for multiview subspace clustering [20], and cost-sensitive canonical correlation analysis (CSCCA) [21]. CCA and OCCA are unsupervised feature learning methods [22], which fail to utilize label information to enhance their discrimination power for better recognition performance. In theory, supervisory information [23] could show real distribution, and more real structure information can be explored to obtain a more discriminative training model for feature learning.
Abhishek Sharma et al. [24] presented a general multi-view feature learning approach namely generalized multi-view analysis (GMA) for cross-view classification and retrieval, which turns out to be supervised discriminative and effective. GMA pointed out that some existing feature learning methods could be regarded as certain special forms of quadratic constrained quadratic programming (QCQP). GMA first simply adds a balance parameter to extend a single-view model to a multi-view model, and then certain constraints are added for the convenience of computation in the latent low-dimensional feature space. Single-view feature learning methods such as principal component analysis (PCA) [25], linear discriminant analysis (LDA) [26], neighbor-hood preserving embedding (NPE) [27], and locality preserving projection (LPP) [28] can be extended to multi-view versions that can preserve their inherent characteristics still through the GMA framework. For instance, numerous experiments of generalized multi-view locality preserving projection (GMLPP) [24] have shown its validity, and GMLPP makes the low-dimensional features of multi-view data still maintain the discriminative local structural relationships of the original high-dimensional feature space in the low-dimensional feature space. Although the GMA schema solves the multi-view problem very well, it does not take into account the correlation between views. Different from GMA, discriminative canonical correlation analysis (DCCA [29]) makes full use of the advantage of supervision and takes the global correlation into account. The canonical projection directions in DCCA have the property of maximizing within-class similarity [30] and minimizing between-class similarity to eliminate the redundant information in the same view while canonical projection directions obtain the maximum between-view correlation.
However, the GMA framework and DCCA suffer from a lack of consideration towards label scatter [30]. Under some specific application backgrounds, it is significant to highlight the class information that mainly contributes to better discrimination. For instance, to judge whether an article is suitable for publication in scientific and technological journals, the training model based on published articles in the journal should be more sensitive to the semantics of science and technology in text classification [31]. For a vast of labeled data, it is a cumbersome task that we note the practical category weights modifying model performance instead of simply calculate the probabilities of each category. Besides, it is nearly impossible to obtain the consistent practical category weight according to two different views. Since the problem above is a serious challenge, in this paper we propose a novel supervised feature learning method called dynamic discriminative canonical correlation analysis named DDCCA. DDCCA sets the adaptive weight [32] for each category and automatically learns the contribution of categories and finds out the primary classes to be identified in classification to improve classification performance. In addition, for simplifying the solving process DDCCA decomposes a multivariable problem into two univariate problems with mutual dynamic constraints [33]. The decomposition based on Cauchy-Schwarz inequality [34] and maximum correlation between cross-view makes DDCCA fully learn the dispersion of supervisory information to realize class sensitivity and complete the fast dynamic global optimal solution. Numerous experiments of image recognition and text classification can reveal the superiority of our method. We summarize the main contribution of this work as follows: 1) For better exploiting the inner structure of classes and practical contribution of categories hidden in the vast labeled data, a dynamic discriminative conical correlation analysis via adaptive scheme is designed to learn the consistent contribution of different classes and preserve the class scatter in the union low-dimensional subspace from two views, which is helpful to improve the discrimination power of features. 2) In order to adjust the adaptive weights to obtain the discriminative feature quickly, we propose an effective iterative solution with alternating optimization by Cauchy-Schwarz inequality and Lagrange multiplier, which is of low computation complexity. 3) We carefully conduct extensive experiments on several faces and multi-language text datasets to demonstrate the superiority of our method to existing related methods. The rest of this paper is organized as follows: In Section 2, we briefly review conventional CCA. In Section 3, adynamic discriminative canonical correlation analysis via adaptive weight scheme is formulated and elaborated in detail. In Section 4, we perform several experiments on the public datasets with some other methods to demonstrate the effectiveness of our method. Finally, in Section 5, we conclude the paper.

II. REVIEW OF CCA
In univariate statistical analysis, the correlation coefficient [35] is used to measure the linear correlation of two random variables, and the complex correlation coefficient is used to study the linear correlation between one random variable and multiple random variables. On the other hand, CCA is a multivariate statistical analysis method that uses the correlation between canonical variable pairs to reflect the overall correlation between the two groups of indicators. More specifically, given N pairs of samples {(x i , y i )} N i=1 coming from c classes from two views, CCA aims to find pairs of projection directions α and β that maximize the correlation between the joint projections α T x and β T y. Denote ρ as canonical correlation: where C xx and C yy are variance matrices of X and Y , and C xy represents the covariance matrix between X and Y sets respectively. From Eq. (1), we can find that the value of ρ has nothing to do with the length of α and β, and the value only matters with the angular relationship between α and β [36]. It is easy to constrain the denominator α T C xx α = 1 and β T C yy β = 1 like the operation of LDA. The optimization problem of CCA can be solved by the Lagrange multiplier method.

III. A DYNAMIC DISCRIMINATIVE CANONICAL CORRELATION ANALYSIS VIA ADAPTIVE WEIGHT SCHEME
Although CCA can reduce the dimension of raw data to extract features, CCA fails to utilize supervised information such as label information, which has been proved a superior strategy. To overcome such a shortage of CCA, Bai et al. [37] proposed a supervised combined feature learning method known as DCCA. The features extracted by DCCA can be proved to be more discriminative, which benefits image recognition. However, DCCA fails to exploit the distribution of classes [38] and react to the difference of classes.
In this section, we explore the discriminative structures [39] of classes to access sensitivity towards the contribution of classes by means of natural logic hidden in data rather than artificial classification logic. Then the novel cross-view discriminative feature learning method i.e. DDCCA is proposed. DDCCA utilizes the mathematical operation of Cauchy-Schwarz inequality [34] to reach effective self-learning, and the theoretical analysis will be given to prove its validity.
n c , are two sets across different views where N pairs of zero-means samples from c classes and n i is the number of samples from ith class. The greater the within-class correlation of canonical variates is, the more profound and obvious the characteristics of classes are. The smaller the between-class correlation of canonical variates is, the greater difference and the higher degree of discrimination between different classes are, which is conducive to classification and recognition. Aiming at extracting discriminant features, we can construct an objective function by maximizing cross-view within-class scatters and minimizing cross-view between-class scatters: We constrain the within-class structures and between-class relationships of canonical variables according to Eq. (2). The class weights of the formula default to 1 for each class, which means the equal effectiveness of each class in discriminative feature learning. However, different classes have different sensitivity for classification or recognition tasks. For better exploring the inner structures in each class and the relationships between different classes, we seek a feasible linear joint combination of the column vectors of X and Y for representing cross-view within-class and between-class scatters. In addition, we set a tunable parameter that reveals the relative sensitivity between the within-class scatters and the between-class scatters. Then we modify the correlation criterion of pairs of canonical variates: where δ i and η i denotes the weights affected by the distribution of classes. Although more details on different classes can be explored, it will result in higher overall complexity. It can be easily solved by a mathematical connection between cross-view within-class scatters and cross-view betweenclass scatters since the samples have been mean-normalized. Thus, the between-class scatter matrix can be rewritten as follows: Obviously, the between-class scatter matrix can represent the within-class scatter matrix, which indicates the connection between local and global distribution of classes. Furthermore, in order to adjust the contribution between the classes with the smaller scatter matrix and the classes with a larger scatter matrix, an adaptive weight is assigned for each class ω = {w i , i = 1, . . . , c} globally instead of simply adding regularization for specific classes manually. Thus, C w is denoted as where represents an adaptive weight matrix that is symmetric, positive semi-definite, and blocked diagonal [40]. rank( W ) = c is always less than the number of samples. And then, we minimize the weight summation function and simplify the equation by only analyzing the cross-view within-class scatters as follows to obtain the higher correlation: We notice that the correlation increases with the decrease of weight 1 w i , which can intentionally reduce the devotion of the classes with small correlations. The projection directions can be obtained by maximizing f (w) and the objective function can be noted as follows: f (α, β, w) has nothing to do with the length of α and β [36]. That is, f (α, β, w) possesses the property of the scale invariance. Thus, the criterion function is rewritten as Specifically, Eq. (2) can be approximately transformed into a supervise canonical correlation analysis when each element of w equal to 1. It can be easily found that its optimal solutions α and β not only minimize the intra-class aggregation but also maximize the inter-class separability. Since the optimal projection directions are based on the dynamic weight w, the objective function is difficult to be directly solved. Thus, an iterative approach is constructed for dealing with the problem in the next subsection.

B. OPTIMIZATION
In this subsection, a capable and fast iterative approach is proposed to reach the optimal stable solution of Eq. (8).
Firstly, adaptive weights ω = w i : w i = 1 c, i = 1, . . . , c are initialized and then update ω and (α, β) iteratively by updating one item while fixing the other one.
When ω is fixed, the objective function of Eq. (8) can be simplified to It is not difficult to utilize the Lagrange multiplier method to solve it. And we can get the corresponding Lagrange function: The above equation can further compute the partial derivations of α and β, and the partial derivations are set to zero: Both sides of Eq. (11) and Eq. (12) are multiplied by and respectively, and combined with the constraint of Eq. (9), we can get Since adaptive weight matrix is symmetric blocked diagonal matrix, that is (11) and Eq. (12) can be converted to the two general eigen decomposition problems: Obviously, we can do an eigen decomposition of Eq. (15) and Eq. (16) to get projector matrices A and B consisting of optimal project direction (α i , β i ) d i=1 . When A and B is fixed, Eq. (8) can reformulate the following objective function with the characteristic of trace (13) can be equivalently translated as follow for further analysis: According to the property of matrix trace, we can know that tr (mA + nB) = mtr(A) + ntr(B). The property makes the solution of Eq. (13) feasible by utilizing the property of Cauchy-Schwarz inequality shown in Theorem 1.

Algorithm 1 Dynamic Discriminative Canonical Correlation Analysis
Input: where d x and d y is the larger dimension between X and Y respectively. In general, the overall computation complexity of DDCCA where T is the number of iteration. Benefitting from the pre-process, the data of two views have suffered a proper dimension reduction in the practical procedure. That is to say, the total computation cost is not enough affected by the high dimensions of the raw data. Thus DDCCA has low computation complexity.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
In this section, a variety of face recognition [41] and text classification [42] experiments are carried out on several real-world image and text datasets to evaluate the effectiveness of DDCCA. For real-world image datasets, we employ the view strategy [41] to obtain two view data of each image. Concretely, by using Coiflets and Daubechies wavelet transform technology [44], we can obtain two low-frequency sub-images of each image, and then K-L transformation [45] is utilized to reduce the dimensionality of two sub-images so that two view data of each image are obtained. Simultaneously, the performance of DDCCA is compared with that of CCA, DCCA, OCCA, GMALPP, and graph regulation multi canonical correlation analysis (GrMCCA) [51]. In the final recognition stage, we utilize serial feature fusion strategy [43], [41] and K nearest neighbor classifiers based on Euclidean distance [46].

A. INTRODUCTION TO DATA
We performed face recognition and text classification on these three image datasets and a text dataset respectively, and the detailed dataset form will be elaborated in this subsection.
Umist [47] dataset. The dataset contains 575 grayscale face images of 20 different people. These images belonging to the same individual vary according to the angle from the side face to the front face. Raw data had been manually cropped and resized to 112 × 92 pixels by Daniel Graham at Umist.
Yale [48] dataset. The dataset is made up of facial images of 15 people. Each people take 11 facial images with different kinds (normal front, wearing glass, lighting conditions, expressions, eyes closed). The size of each image is 100 * 100 pixels.
YaleB [49] dataset. It includes 38 different individuals. And for each people, nearly 64 frontal images are under different lighting directions. Each image is 32 x 32 pixels grayscale.
Reuters [50] Multilingual dataset. It contains six subsets of 1200 documents over six labels or topics (E21, CCAT, M11, GCAT, C15, and ECAT) and is described in five languages of 2000 words each.
Without the loss of generality, we adopt average accuracy and standard deviation for better evaluating the performance of our method.

B. EXPERIMENTS USING THE UMIST DATASET
In this experiment, we randomly select u (u = 2, 3, 4) images per individual as training samples, and the rest images are utilized as testing samples. Besides, ten independent experiments have been finished, and the average recognition rates of 10 experiments are displayed in Table 1.
From Table 1, we can see that the recognition results of DDCCA are the best among all the methods. In addition, Table 1 proves that the correlation of cross-view data CCA considered can do well to recognition. DCCA utilizes the label information to approximate the distribution of data in the real world so that DCCA outperforms traditional CCA.
Although GMALPP considers the locality information of data, 2 train samples per class are too few to help GMALPP construct a reliable manifold benefiting training a discriminative model. Besides, the graph information constructed by GrMCCA with 2 train samples per class as well fails to extract effective discriminative features from Table 1. As the number

C. EXPERIMENTS USING THE YALE DATASET
In this experiment, we randomly select u(u = 2, 3, 4) images per individual as training samples. The rest images are utilized as testing samples. Besides, ten independent   experiments have been finished, and the average recognition rates of 10 experiments are displayed in Table 2.
It can be seen from the data in Table 2 that DDCCA is superior to CCA, DCCA, OCC, GMALPP, and GrMCCA. Thanks to the adaptive weight scheme, the feature DDCCA extracted has more discrimination power. OCCA still performs the worst while other methods achieve comparable recognition results. The standard deviation still shows the stability of orthogonality. The experiments on the Yale dataset examine the validity and improvement of DDCCA coping with face recognition again.

D. EXPERIMENTS USING THE YALEB DATASET
In the implementation, we randomly select u(u = 2, 3, 4) images per individual as training samples, the rest images are utilized as testing samples. Besides, ten independent experiments have been finished, and the average recognition rates of 10 experiments are displayed in Table 3 and Fig 2 demonstrates the accuracy of methods versus variation of the dimension.
As can be seen from Table 3, DDCCA still has better recognition performance most time. The standard deviation of OCCA and GrMCCA in Tables 2 and 3 performs relatively low in experiments, which also indicates that OCCA and GrMCCA have great robustness to the different datasets in face recognition.
From Figure 2, we can see that the recognition rate basically rises as the dimensionality rise. Besides, the recognition rate corresponding to the projection vector dimension of DDCCA rises rapidly in 1-20 dimensions, reaches a peak in 20-40 dimensions early, and maintains a certain level in 40-100 dimensions. It shows the dynamic strategy helps our method solved quickly and obtain a stable optimal solution. In addition, the results of DDCCA are consistent with that of DCCA, which demonstrates DCCA has taken a great response to class distribution in this dataset.
In total, thanks to the adaptive weight scheme and dynamic strategy, DDCCA obtains excellent performance handling with face recognition and suits different class distribution in different face datasets.

E. EXPERIMENTS ON TEXT CLASSIFICATION
To evaluate the effectiveness of the methods again, we select multiple-linguistic text classification in this section. The Reuters multilingual dataset has six subsets, and according to the order of the website article, the six subsets are from Subset1 to Subset6.
From Table 4, we can see that DDCCA outperforms other feature learning methods in most cases. As the number of samples per topic increases, the classification accuracy also increases.
Second, we randomly select 50 articles written in English and French versions per topic under six subsets. Table 5 presents the classification performance of DDCCA is superior to other methods in all subsets. It proves that DDCCA can fit different datasets.
At last, we randomly select 50 articles per topic written in two languages (all 10 combinations of 5 languages here) on the final subset. As shown in Table 6, DDCCA still performs greater than other methods under experiments in most cases. It further shows that DDCCA can fit different languages.
The results in Tables 4,5 and 6 from the point of view of training samples, different subsets, and different languages fully showed the superiority of self-adaptation. Besides, it is observed that the supervisory can be conducive to text classification through the performance of DDCCA and DCCA. And its dynamic solution promises its robustness and great performance of classification according to the standard deviation and average accuracy.

V. CONCLUSION
Supervisory information and multiple correlations are beneficial to the discrimination power of cross-view feature learnings. However, the methods above fail to be sensitive to class distribution. In this paper, we propose the novel cross-view dynamic discriminative feature learning method, i.e. DDCCA. To make full use of supervisory information, each class pair is assigned an automatic learning adaptive weight in the low-dimensional feature space. And we adopt Cauchy-Schwarz inequality and Lagrange multiplier obtaining the iterative local optimal solutions to gain a global stable optimal solution maximizing the correlation between two views in this iterative model. A serial of experiments on both face image datasets and multi-linguistic text datasets demonstrates that DDCCA has positive accuracy improvement compared with existing competing methods. The future research will focus on the view extension with self-learning correlation rules.

ACKNOWLEDGMENT
(Shuzhi Su and Penglian Gao contributed equally to this work.)