Exponential Multi-Modal Discriminant Feature Fusion for Small Sample Size

Multi-modal Canonical Correlation Analysis (MCCA) is an important information fusion method, and some discriminant variations of MCCA have been proposed. However, the variations suffer from the Small Sample Size (SSS) problem and the absence of cross-modal discriminant scatters. Thus we propose a novel exponential multi-modal discriminant feature fusion method for a small amount of training samples, i.e. exponential multi-modal discriminant correlation analysis. In the method, we construct a discriminative integration scatter of all the modalities by constraining the aggregation towards cross-modal discriminative centroids. Besides, the method gives a decomposition-based matrix exponential strategy. The strategy can solve the SSS problem and improve the robustness of noises, and we further provide corresponding theoretical proofs and some intuitive analysis. The method can learn correlation fusion features with well discriminative power from a small amount of samples. Encouraging experimental results show the effectiveness and robustness of our method.


I. INTRODUCTION
Data collected in real-world applications is divided into single-modal and multi-modal data depending on the number of data representations corresponding to the same target [1]. There is only one type of data representations for one single target, and this data is single-modal data. Unlike singlemodal data, multi-modal data has several categories of data representations correlating to one same target. For example, we can gather face data, fingerprint data, gait data, and so on, which can simultaneously characterize the person from different viewpoints. These data can be viewed as the person's multi-modal data. As compared to single-modal data, multimodal data contains more information on a single target. Multi-modal fusion is divided into three different fusion levels [2]: pixel-level fusion, feature-level fusion, and decisionlevel fusion. In the three-level fusions, pixel-level fusion [3] is located at the lowest level, which can retain the original information to the greatest extent. Pixel-level fusion is generally divided into four steps: preprocessing, transforma-The associate editor coordinating the review of this manuscript and approving it for publication was Jon Atli Benediktsson . tion, synthesis, and inverse transformation. Decision-level fusion [4] mainly focuses on the integration of multiple classifiers. Although feature-level fusion [5] starts later than other fusion methods, its advantages are obvious. Feature level fusion not only extracts the discriminant information of each modality but also reduces the complementary information among different modalities as much as possible in learned fusion subspaces, which is crucial for classification tasks. Our research of this paper also focuses on feature-level fusion.
In all multi-modal feature fusion methods, Canonical Correlation Analysis (CCA) [6] plays an important role. CCA aims at learning a correlation fusion subspace of two modalities. In the subspace, the between-modal correlation is maximal. CCA has been used in several real-world applications, including fault detection [7], multi-modal biometric analysis [8], and process monitoring [9]. CCA has some limitations as an unsupervised linear feature fusion method. To preserve the raw data's geometry structure information, local CCA [10] discovers the hidden local common information by a novel metric and develops a data-driven correlation model based on the metric. Gao et al. [11] developed locality preserving CCA of two-dimension matrix data, which VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ can capture intrinsic structures with the help of local neighbor information. Moreover, class information is important discriminant information that can significantly improve the classification performance of many feature fusion methods.
To embed class labels into correlation analysis theories, generalized CCA [12] optimizes the within-modal discriminative information when the across-modal correlation is maximized. Inspired by the regularization technique and multi-modal discriminant analysis [13], Han et al. [14] proposed a supervised correlation analysis by considering the discrimination information and correlation information of paired data. Local discriminant CCA [15] can extract nonlinear features with the discriminative structures of neighbor graphs. Similarly, supervised local-preserving CCA [16], locality preserving randomized CCA [17], local discrimination CCA [18], and hypergraph CCA [19] exploit local structure information and class label information under the correlation analysis framework, and the effectiveness of these methods has been demonstrated in image recognition tasks. The above correlation-based methods only deal with twomodal data, but the number of modalities is usually more than two in many real-world applications. With the development of correlation analysis theories, Multi-modal CCA (MCCA) [20] appears, and the method can be also called multi-set CCA, multi-view CCA, and multi-way CCA in different researches. MCCA is a generalized modality extension of CCA, and CCA can be treated as an exceptional example of MCCA. Thus, MCCA possesses a broader range of implementations and improved theoretical benefits. Up to now, research subjects of many scholars have changed from CCA-related methods and MCCA-related methods. Discriminative Multiple CCA (DMCCA) [21] maximizes intra-class correlations and reduces inter-class correlations of various modalities to improve the discriminative power of correlation fusion features. Graph Regularized Multi-set Canonical Correlations (GrMCC) [22] builds an object optimization function with within-modal discriminative scatters using the graph regularization technique, allowing learned correlation fusion features to maintain discriminative geometry structures. Yuan et al. [23] suggested a Laplacian Multi-view Canonical Correlation Analysis (LapMCCA) method that can learn nonlinear correlation fusion features by using weight constraints of nearest neighbor graphs. Shen et al. [24] devised a generalized graph embedding MCCA framework that can embed various graphs into multi-modal correlation theories, and three graph embedding correlation analysis methods were also given by means of three different graphs. To enhance the discriminative power, Hessian MCCA [25] employs the Hessian technique to capture intrinsic locality geometry structures concealed in raw multi-modal data.
Multi-modal correlation theories with supervised information are important and hot research subjects in multi-modal feature fusion. Although some supervised MCCA-related methods have been proposed, the existing methods ignore between-modal discriminative scatter information that can significantly improve the class separability of low-dimensional fusion features, especially those that are easily confused. Furthermore, the existing supervised MCCA-related methods still suffer from the well-known Small Sample Size (SSS) problem [26]. The SSS problem is that the sample dimensionality is more than the number of samples, which will lead to singular matrices. The singularity of matrices is usually fatal to the optimization of feature fusion methods. To solve these issues, we propose a novel Exponential Multi-modal Discriminant Correlation Analysis (EMDCA) method. In the method, we construct a discriminative integration scatter of all the modalities, and the scatter further integrates cross-modal discriminant scatter information on the basis of within-modal discriminant scatter information, which can better reveal the global discriminant structure of all the modalities. Besides, the method gives a decomposition-based matrix exponential strategy to elegantly address the SSS problem and improve the robustness of noises, and we further provide corresponding theoretical proofs and some intuitive analysis. To our best knowledge, the method is the first exponential-based method for solving the SSS problem of multi-modal correlation theories. By minimizing the exponential discriminative integration scatter and simultaneously maximizing exponential betweenmodal correlations, the method can obtain correlation fusion features with well discriminative power from a small number of raw data. We design extensive experiments on five visible image datasets, two infrared image datasets, and Reuters multilingual text dataset. Promising findings on the experiments indicate the effectiveness and robustness of our method.

Suppose that {Z
are the M modality datasets and z (p) k ∈R d p ×1 (k = 1, 2, . . . , N ) represents the kth samples of the pth modality dataset Z (p) . MCCA belongs to a multi-modal feature fusion method that aims at finding a correlation projection direction ϕ (p) ∈R d p ×1 correlating to Z (p) (p = 1, 2, . . . , M ). In more detail, MCCA' s optimization function is as follows: . For the convenience of description, we utilize the above representations for the same conception in the whole paper. As in [27], the optimization function of Eq. (1) can be transformed into the following constraint-based optimization problem: As pointed out in [28], ϕ (p)T C (pq) ϕ (q) constrains the between-modal correlation between correlation features ϕ (p)T Z (p) and ϕ (q)T Z (q) , and ϕ (p)T C (pp) ϕ (p) can reveal the global scatter of correlation features ϕ (p)T Z (p) . MCCA can be referred to as maximal between-modal correlations and minimal within-modal global scatters from various perspectives.

III. EXPONENTIAL MULTI-MODAL DISCRIMINANT CORRELATION ANALYSIS A. MOTIVATION
Existing supervised correlation-based methods focus on constraining within-modal discriminant scatters under the correlation analysis framework. These methods have two issues: ignoring cross-modal discriminant scatters and suffering from the SSS problem. Therefore, we try to construct a discriminative integration scatter of all the modalities by enhancing the aggregation towards cross-modal discriminative centroids of all the correlation features. However, it is challenging to compute the cross-modal discriminative centroids because the sample dimensionalities of different modalities can be different. For this issue, we follow the idea of subspace feature fusion to reverse thinking. By projecting high-dimensional data into a correlation fusion subspace, correlation features are obtained, and the correlation features from different modalities have a beneficial property, i.e. the same dimensionality. Thus, in the correlation fusion subspace, we can first construct cross-modal discriminative centroids of all modalities and then use specific derivations to obtain an implicit mathematical representation of the centroids in the high-dimensional sample space. Then the discriminative integration scatter will be constructed using the centroids. Besides, the scatter also usually suffers from the SSS problem, and thus we develop a decomposition-based matrix exponential strategy. The strategy can solve the SSS problem, which can be proved in theory.

B. FORMULATION OF EMDCA
Suppose that z gk denotes the kth sample of the gth (g = 1, 2, . . . , G) class in the pth modality dataset Z (p) , and N g represents the sample number in the gth class of each modality dataset. Besides gk in a discriminative integration correlation fusion subspace. We firstly construct a cross-modal discriminative centroid z g of the gth class: The centroid is the mean of all the samples from the gth class in the subspace, and the discriminative information of all the modalities is also integrated in the subspace. Motivated by the global modality idea [13], [14], we try to enhance the aggregation between the correlation features and their corresponding cross-modal discriminative centroids. Thus we employ the centroid z g to construct the discriminative integration scatter ρ w of all the modalities in the subspace: The physical meaning of the scatter is the Euclidean distance metric between the correlation features and their centroids. Although the scatter is a simple form of within-modal scatters, the centroids of the scatter are composed of the intra-class samples from all the modalities. Thus, the scatter also includes the intra-class samples from different modalities on the basis of the within-modal intra-class samples. However, since the centroid z g is based on the correlation projection directions of all the modalities, the optimization problem with Eq. (4) is difficult to solve the optimal projection directions under correlation analysis theories. Thus, the scatter in Eq. (4) is necessary to implement some equivalent derivations. For the convenience of derivations, we define the class mean z gk . Then the equivalent derivation of Eq. (4) can be as follows: where H is only composed of the raw samples. The matrix includes the implicit mathematical representation of the cross-modal neighbor-consistent centroids in the high-dimensional sample space.
From Eq. (5), we can find that the discriminative integration scatter includes the within-modal samples and the VOLUME 10, 2022 between-modal samples. Similar to the physical meaning [29] of minimizing the within-modal intra-class scatter, the minimum of the discriminative integration scatter can enhance the aggregation towards the cross-modal neighbor-consistent centroids, and the aggregation further constrains the consistency of cross-modal discriminative structures, which will reveal better the intrinsic neighbor relationships. Then, by minimizing the cross-modal discriminative scatters and meanwhile maximizing the pair-wise correlations of all the modalities, we construct the objective optimization problem is as follows: . H is called the discriminative integration matrix of all the modalities, and R is named as the correlation matrix. In Eq. (6), the discriminative integration matrix H cannot avoid the SSS problem, especially for raw high-dimensional samples, which will usually be fatal to the optimization of many feature fusion methods. Thus, we exploit the decomposition-based matrix exponential strategy to solve the SSS problem. More concretely, we decompose the matrix H by eigenvalue decomposition [30]: where U = [ 1 , 2 , . . . ,d ] is the eigenvector matrix of H , and represents the diagonal matrix of eigenvalues, i.e.
=diag(λ 1 , λ 2 , . . . , λd ).d is equal to M p=1 d p . Inspired by the eigenvalue correction idea [31], [32], we construct an exponential discriminative integration matrixH as follows: where exp ( ) represents the exponential eigenvalue matrix. Besides, exp ( ) is defined as follows: where I is the identity matrix, 2! denotes the factorial of 2, and 2 is the second power of , i.e. 2 = . Concretely, the exponential of matrices will be computed by the scaling and squaring method [33], [34]. Similar to Eq. (8), the exponential correlation matrixR of R is constructed as follows: where V and ∧ are respectively the eigenvector matrix and eigenvalue matrix of the correlation matrix R, and exp (∧) has the same definition with exp ( ) of Eq. (9). As a result, the EMDCA objective optimization problem is represented as follows: When the sample dimensionality is larger than the sample number, the constraint matrices will singular in many optimization problems. This SSS problem is avoided in the objective optimization problem of Eq. (11). For theoretical proofs in avoiding this SSS problem, we define the matrix exponential of the discriminative integration matrix H : The matrix exponential of the correlation matrix R is as follows: is the eigenvector of H , and λ k is the eigenvalue of H . According to properties of eigenvalue decomposition, we obtain the following equations: By summing the above equations, we can obtain Based on the definition of the matrix exponential, Eq. (14) can be reformulated into From Eq. (14), we can find that H and exp (H ) have the same eigenvector matrix and exp (λ k ) is the eigenvalue of exp (H ). Therefore, we can obtain Besides, the proof in exp(R) = Vexp ( )V T is the same with the above proof. Thus, the theorem is turned out to be correct.
Theorem 2 has proved the non-singularity of exp (H ), which can ensure the evasion of the SSS problem in theory. Additionally, the computation of inverse matrices is essential in the optimization of Eq. (19), and the time cost in computing inverse matrices is usually very high. However, Theorem 2 also proves that the inverse matrix of exp(H ) isexp(−H ), which will significantly reduce the time cost in computing its inverse matrix.

C. OPTIMIZATION
To obtain the analytical solutions of Eq. (14), we first construct the Lagrange multiplier function: where η is a Lagrange multiplier. Then the partial derivative of L (ϕ) to ϕ is set as zero, and we can obtain Based on Property 1, we can obtain Theorem 3: The Lagrange multiplier η is equal to the objective function of (19), i.e. η = ϕ T exp (R)ϕ. Proof: By multiplying ϕ T to the left of Eq. (21), we can obtain According to Eq. (19), it can be obverse that ϕ T exp (H )ϕ is equal to one. Then Eq. (24) can be equivalently translated to Thus the theorem is proved.
Eq. (23) is an eigenvalue decomposition problem, and η can also be named the eigenvalue. The optimization problem in Eq. (19) is reformulated into the computation problem of the eigenvector corresponding to the largest eigenvalue in Eq. (19) using Theorem 3 and the above optimization derivations. Thus the optimal solutions of correlation projection directions are the eigenvectors of exp(R − H )corresponding to the d largest eigenvalues, i.e. η k = [η  1, 2, . . . , N ), they correspond to the same target and their correlation features p=1 are necessary to be fused in the final recognition stage. Based on the simple fusion strategy [35], [36], the correlation fusion feature y k can be obtain (26) where y k belongs to the correlation fusion feature training set Y = [y 1 , y 2 , . . . , y N ] ∈ R d×N . For the testing sample sets whereỹ k is the kth correlation fusion features ofỸ . Finally, the nearest neighbor classifier will be utilized in the correlation fusion features, and the recognition results of the correlation fusion features are the recognition results of the corresponding raw high-dimensional multi-modal data. Besides, we give detailed descriptions of the method steps in Algorithm 1, and the flowchart of EMDCA is also exhibited in Fig. 1.

D. ANALYSIS OF EMDCA
Our method considers the within-modal and between-modal discriminative structures by constraining the discriminative integration scatter of all the modalities. To further understand the discriminative integration scatter of Eq. (5), we equivalently reformulate it into the below form: From Eq. (28), we can observe that the discriminative integration scatter ρ w can capture the intra-class variations of within-modal and cross-modal correlation features. As ρ w is reduced, the second term of Eq. (28) will compact intra-class correlation features from different modalities, and the first term of Eq. (28) can render the intra-class correlation features from the same modalities as similar as possible.
How to avoid the SSS problem is an important research subject in feature fusion of raw high-dimensional data. Our decomposition-based matrix exponential strategy effectively solves the SSS problem of EMDCA and the corresponding theoretical proofs have been given in Section 3.2. Next, we further discuss the robustness of EMDCA to noises of raw high-dimensional data. As pointed out in [37], noises of raw data focus on influencing small eigenvalues of the scatter and correlation matrices based on the raw data. However, by means of matrix exponential, the eigenvalues λ 1 , λ 2 , . . . , λd are changed to exp (λ 1 ), exp (λ 2 ), . . . , exp (λd ). According to properties of matrix exponential, the exponentials of the large eigenvalues are larger, but for the small eigenvalues, their exponentials are relatively smaller. To make it easy to understand, we give an intuitive illustration in Fig. 2. From Fig.2, we can find that the smallest eigenvalue λ 1 is 8% in the total proportion of eigenvalues, and its exponential exp (λ 1 ) will be changed to 2% of all the eigenvalue exponentials. However, the smaller the eigenvalues are, the smaller their exponentials are. For the large eigenvalue λ 6 , the proportion of its exponential is changed from 26% to 56% in Fig.2. This means that with the help of matrix exponential, the small eigenvalues with much noise information are weakened, and the large eigenvalues are enhanced. Thus the decomposition-based matrix exponential strategy ensures the non-singularity of matrices in theory and improves the robustness of noises.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
In this section, we compare our method with six representative multi-modal feature fusion methods including Graph Multi-view Canonical Correlation Analysis (GMCCA) [38], Labeled Multiple Canonical Correlation analysis (LMCCA) [39], Discriminative Multiple Canonical Correlation Analysis (DMCCA) [21], Multi-view Discriminant Analysis (MvDA) [13], Laplacian Multi-set Canonical Correlation Analysis (LapMCCA) [23], and Graph Regularized Multi-set Canonical Correlations (GrMCC) [22]. This section evaluates the recognition performance on visible image datasets, infrared image datasets, and Reuters multilingual text dataset. For image datasets, we employ a common multi-modal strategy [22], [40] to obtain three modalities of each image. In all the methods, the nearest neighbor classifier with Euclidean distances [41] is used in the final recognition stage, and the best recognition rates are reported in the tables below for all possible dimensionalities. For all the experiments, codes of the methods is written by the MAT-LAB programming language, and the running environment of the codes is MATLAB2012a under the operating system of Windows 10.

A. EXPERIMENTS ON THE VISUAL IMAGE DATASETS
The visible images are common in real-world applications, and this subsection employs five representative visible image datasets, i.e. Coil20 object visible image dataset 1 , Georgia Tech (GT) face visible image dataset 2 , ORL visible image dataset 3 , YaleB visible image dataset 4 , and Semeion handwritten digit image dataset 5 . For the datasets, we randomly choose four images per class for training, and the remaining images are treated as testing samples. The random sample experiments are repeated ten times in all, and the average recognition rates are tabulated in Table 1.
On the basis of correlation analysis theories, LMCCA further considers the intra-class scatter of every modality. DMCCA maximizes the between-modal intra-class correlations and meanwhile reduces the between-modal inter-class correlations. MvDA is a multi-modal extension of linear discriminant analysis [29], and a discriminant common subspace is learned by optimizing a generalized Rayleigh quotient. The three methods only consider the embedding of class labels from different viewpoints, and their recognition rates are also similar in Table 1. In LapMCCA and GrMCC, the correlation features preserve the nonlinear structure information hidden in discriminative neighbor graphs as much as possible. For GMCCA, the prior knowledge of common sources is encoded by graphs, and graph-induced geometry structures are considered based on minimizing distances between correlation features and common low-dimensional representations. The three graph-based methods further preserve the nonlinear geometry structures based on raw samples. Due to the sample noises, the sample-based geometry structures usually derivate from real geometry structures hidden in the raw samples. In most cases, the derivation reduces the class separability of correlation features, which is why the three compared graph-based approaches have lower recognition rates than the other methods. Using the discriminative integration scatter, our method minimizes the distances between the correlation features and cross-modal discriminative centroids and meanwhile maximizes the pair-wise correlations between various modalities. Besides, the decomposition-based matrix exponential strategy effectively solves the SSS problem of our method and enhances the robustness for noises of the raw samples. In Table 1, our method also exhibits the best recognition rates. The experimental results in the above table are the average recognition rates of ten sample randomness experiments. The smaller the variation of the recognition rates under different sample randomness experiments is, the better the robustness of the sample randomness is. The good robustness for the sample randomness will enhance the adaptability of the methods in real-world applications. For analyzing the sample robustness, Fig.3 displays the recognition rates under each sample randomness experiment of each method, and ''Sample Random No.'' of Fig.3 represents the sequence number of the ten sample randomness experiments. In most instances, as seen in Fig. 3, our method outperforms the compared methods in terms of sample randomness robustness.

V. EXPERIMENTS ON THE INFRARED IMAGE DATASETS
Infrared cameras take infrared images, and we did some experiments on the CBSR NIR image dataset 6    ten times, and the average recognition rates are reported in Table 2. LMCCA and MvDA still have similar recognition rates, and the two methods are superior to DMCCA in Table 2, which is consistent with the experimental results in the above visible image datasets. In Table 2, GMCCA and GrMCC display lower recognition rates than the other compared methods. On the infrared image datasets, EDMCA still has the best recognition performance, and the sample robustness of EDMCA is better in most cases, which can be found in Fig. 4.

A. EXPERIMENTS ON THE REUTERS MULTILINGUAL TEXT DATASET
On the Reuters multilingual text dataset, each document has five language versions, and different languages of one docu-    ment describe the same content. Thus each language description can be treated as one modality of the document. The five languages are English (abbreviated as EN), French (FR), German (GR), Italian (IT), and Spanish (SP). In the subsection, we utilize three modalities, and ten combinations of different modalities can be obtained. Ten samples in each class are chosen randomly as teaching samples, while the remaining samples are used for testing. Ten random sample experiments are performed ten times, and the average recognition rates and standard deviations are tabulated in Tables 3 and 4, respectively. Table 3 shows that the differences between different methods are relatively minor compared to the experimental results on visible and infrared image datasets. LMCCA and LapMCCA have similar recognition rates on the Reuters dataset, and their recognition rates are superior to the other comparable methods. In the vast majority of cases, our method also has the highest recognition performance. The standard deviation of Table 4 reveals the robustness of the sample randomness. The sample robustness improves as the standard deviation decreases. Compared with the compared methods, our method still possesses smaller standard deviations in most cases, which can reflect the excellent adaptability of EDMCA in real-world applications. The text dataset needs the transformation from the word-based documents to the vector samples, and all the methods is implemented on the vector samples. The transformation of the text dataset loses much effective information of original documents, which is a potential reason why the recognition rates of the text dataset is the lower than those of the image-based datasets. The extensive experimental results on the above three categories of datasets reveal the effectiveness and robustness of our method in the recognition and classification tasks.

VI. CONCLUSION
Many discriminant variations of MCCA have also been proposed for utilizing class information to improve the discriminative power of correlation fusion features. Although the variations have been widely applied to many real-world applications, the Small Sample Size (SSS) problem and the absence of cross-modal discriminant scatters are inevitable in the variations. Aiming at these issues, we construct the discriminative integration scatter of all the modalities by enhancing the aggregation towards cross-modal discriminative centroids, and we give a decomposition-based matrix exponential strategy that can solve the SSS problem and improve the robustness of noises. Then we further develop the novel EDMCA method by minimizing the exponential discriminative integration scatter and simultaneously maximizing exponential between-modal correlations, and the method can learn correlation fusion features with good discriminative power from a few raw high-dimensional multi-modal data. For the method, we not only provide corresponding theoretical proofs and some intuitive analysis, but also design extensive experiments on the visible image datasets, the infrared image datasets, and the Reuters multilingual text dataset. Encouraging experimental findings show the effectiveness and robustness of the method.