CMR-CNN: Cross-Mixing Residual Network for Hyperspectral Image Classification

With the development of deep learning, various convolutional neural network (CNN)-based methods have been proposed for the hyperspectral image (HSI) classification. Although most of them achieve good classification performance, there are still more misclassifications in the prediction map with fewer training samples. In order to address this shortcoming, this article proposes to simultaneously use pixels' spatial information and spectral information for HSI classification. Briefly speaking, a new cross-mixing residual network denoted by CMR-CNN is developed, wherein one three-dimensional residual structure responsible for extracting the spectral characteristics, one two-dimensional residual structure responsible for extracting the spatial characteristics, and one assisted feature extraction (AFE) structure responsible for linking the first two structures are, respectively, designed. With respect to experiments performed on five different datasets Indian Pines, the University of Pavia, Salinas Scene, KSC, and Xuzhou in the case of different numbers of training samples show that, compared to some state-of-the-art methods, CMR-CNN can achieve higher overall accuracy (OA), average accuracy (AA), and Kappa values. Particularly, compared with the newly proposed HSI classification methods OCT-MCNN and CMR-CNN, respectively, improves OA, AA, and kappa by 4.13%, 3.67%, and 2.75% on average.


I. INTRODUCTION
D IFFERENT from conventional optical images [1], infrared images [2], or synthetic aperture radar images [3], HSI has higher spectral resolution, meaning a larger number of spectral bands [4]. Also, this implies, much more information of scenes can be obtained from HSI. By virtue of this advantage, the related works, for example, object detection [5] and geological exploration [6], has been done and achieved some progress so far. Particularly, how to use HSI for classification becomes one hotspot in recent years.
In the early works, machine learning-based methods were often used for HSI classification, such as support vector machine [7], logistic regression [8], random forest [9], k-means clustering [10], and kernel-based method [11]. However, these traditional techniques easily yield more misclassifications, having the unsatisfactory classification accuracy. Deep learning [12] can extract more relevant features compared to manually designed features. In this regard, how to use a convolutional neural network (CNN) for HSI classification becomes one research hotspot due to its strong ability to extract high-level semantic features of HSIs.
Up to now, various CNN-driven HSI classification methods have also been proposed. For example, Cheng et al. directly explored hierarchical convolutional features for HSI classification in [13]. Lee et al. [14] proposed to extract contextual information contained in HSI for classification, and He et al. [15] used transfer learning methods based on CNN for HSI classification. Xu et al. [16] proposed an unsupervised method to realize HSI classification. Marinoni et al. [17] developed an information maximization method to find the most relevant features among pixels for HSI classification. In [18], Marinoni et al. further made use of mutual information to retrieve the most relevant features for HSI classification. Zhang et al. [19] proposed a deep CNN CloudNet for HSI cloud classification. Yang et al. [20] used a two-channel deep CNN for HSI classification. Gong et al. [21] used the multiscale feature map obtained from CNN for HSI classification. Makantasis et al. [22] utilized a supervised learning-based CNN for HSI classification. Xu et al. made use of a full CNN for HSI classification in [23]. Recently, Xu et al. [24] used a self-attention network (SAC-NET) to address the threat of adversarial attacks on HSI classification. Duan et al. [25] proposed the method of fusing dual spatial information to classify HSIs. In the latest literature [26], the author thought that more attention should be paid to the relationship between This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ pixels in the feature map. With this guidance, they constructed a network named ENL-FCN. Lin et al. [27] used generative adversarial networks for HSI classification. In a recent HSI classification task, Le et al. [28] proposed to use a spectral-spatial feature label converter, by which an improved transformer (a densely connected transformer, namely dense-transformer) was developed to capture sequence spectral relations. Bhatti et al. [29] proposed a local similarity-based spatial-spectral fusion method for HSI classification.
In general, the abovementioned methods are mainly built on the two-dimensional (2D) convolution. Actually, in recent years, researchers are gradually turning their attention from feature extraction with 2D convolution to that with 3D convolution [30], which enables us to attain the spectral characteristics of HSIs. For instance, in [31], He et al. used a 3D deep CNN to obtain the multiscale features for HSI classification. Using multiscale feature maps can greatly obtain the information in the feature maps, but it also brings about the problem of information redundancy. In [32], a feature fusion 3D deep CNN was proposed for HSI classification. Chen et al. [33] directly used 3D-CNN for HSI classification. However, both 2D convolution and 3D convolution have intrinsic shortcomings in feature extraction. For example, in [22], while contextual information was obtained through the multiscale method with 2D convolution, the spectral information of the image was lost. In 3D-CNN [30], only the 3D convolution kernel was used to extract spectral information, yet the spatial information in the pixels was lost. Without adding other strategies and structures, a single network model is always unable to extract more effective information. According to the uniqueness of hyperspectral data and previous works, some scholars tried to fuse 3D convolution and 2D convolution together for HSI classification. In a recently presented work [34], a mixed CNN named MCNN-CP with the covariance pooling was proposed for HSI classification. Covariance pooling techniques are used to extract second-order information from the spectral-spatial feature maps, and channel shifts and weighting are used to highlight the importance of different spectral bands. Besides, Feng et al. [35] proposed a hybrid CNN (OCT-MCNN) using 3D Octave and 2D Vanilla for HSI classification. In brief, the authors first utilized the spectral 3D convolution and the spatial 2D convolution to obtain hybrid feature maps, and then, employed covariance pooling to extract second-order information from the spectral-spatial feature maps for HSI classification. Another recent work constructed a HybridSN model for HSI classification [36], where the 3D and 2D convolution operations were also applied together. Besides, the impact of combining convolution kernels of different dimensions on HSI classification was explored as well. However, its convolution layers are limited so it cannot obtain the satisfactory classification performance.
Recently, the residual module [37] was adopted to increase the number of the layers of networks so as to extract more discriminative features for HSI classification. In particular, Zhong et al. proposed the spectral-spatial residual network (SSRN) in [38], where the residual block was used to connect each 3D convolutional layer for improving the classification accuracy. Paoletti et al. [39] developed a network named DPRN for HSI classification, wherein the residual block was utilized as well. Inspired by these methods, here we subtly design the cross-mixing framework of 3D residual and 2D residual structures, and develop a new HSI classification network (named by CMR-CNN). In this way, CMR-CNN can extract much deeper and more discriminative spatial-spectral features for HSI classification. Overall, the contributions of this article are as follows.
1) In order to further improve the classification accuracy, the 3D residual structure and 2D residual structure are, respectively, designed based on SSRN and DPRN. The former is responsible for extracting the spectral features, while the latter is applied to the extraction of spatial features. 2) An assisted feature extraction (AFE) module is constructed with two convolutional layers, which goal is to bridge the 3D and 2D residual structures together. By AFE, it enables us to extract the spectral-spatial information simultaneously. 3) An end-to-end CNN named CMR-CNN is proposed for HSI classification via fusing the 3D and 2D residual structures with AFE. Experiments carried out on five different HSI datasets verify its effectiveness. The rest of this article consists of the following parts. Section II introduces the proposed method, and Section III presents the experiments and discussions about experiments. Section IV is the conclusion of this article.

II. METHODOLOGY
Traditional neural networks, such as 3D-CNN [30] and 2D-CNN [22], easily loss the structural features of HSIs due to their limited layers. To deal with this problem, HybridSN combines 3D convolution and 2D convolution together for HSI classification, and DPRN introduces the residual block into the 3D convolution for HSI classification. Inspired by them, this section proposes a new network CMR-CNN, wherein one 3D residual structure, one 2D residual structure, and one AFE module are, respectively, designed, as shown in Fig. 1. Briefly, the 3D residual and 2D residual structures are, respectively, designed for extracting spectral and spatial information. Then, to bridge these two structures together, a module named AFE is further developed. Finally, the network CMR-CNN is proposed based on these three frameworks for classifying HSIs. Note that, in CMR-CNN, the principal component analysis (PCA) [40] is also adopted to reduce the redundant spectral information for the purpose of decreasing the computational complexity and avoiding the curse of dimensionality.
3D residual structure: After removing some unnecessary spectral features by PCA, we propose to use 3D convolution to extract the spectral information of feature maps. However, an ordinary 3D convolution structures could have more training errors as the depth of the network increases, which is also described as network degradation. So, to cure this disadvantage, we here introduce the residual structure into it, which is correspondingly shown in Fig. 2. Note that, the difference between the proposed 3D residual structure and that in [38] is, an additional convolution layer is performed here. Actually, in this way, the  new 3D residual structure can allow us to extract more effective and diverse features for HSI classification. The details are given in the following.
The sizes of the convolution kernels in Fig. 2 are, respectively, set to ( represents eight 3D convolution kernels of dimensions (3 × 3 × 3). Then, the steps used in [38] are adopted. That is, the convolution kernel (k i × s i × s i ) sequentially acts on the input feature maps to perform dot product with their weights and deviations. The corresponding output is the feature map P with b the batchsize, c the number of channels, d the number of spectral, m is the width, and n is the length where P i+1 is the output with the ith layer feature map, Ψ(·) is the 3D residual part, and W i−1 is determined by the convolution kernel of residual module. And the activation value v at the position (x, y, z) in the jth feature map of the ith layer can be expressed as v x,y,z here, Φ(·) is a nonlinear activation function, B i,j is the bias of the feature map of layer, r l−1 is the number of feature maps in the (l − 1)th layer and the depth of the convolution kernel, t is the width of the convolution kernel, s is the length of the convolution kernel, W i,j is the weight of the j feature maps of the ith layer, and η represents the spectral band.
Next, we analyze the differences between the proposed 3D residual structure and the traditional 3D convolution structure. For the latter, it first extracts the spectral information from the network based on a 3D convolution kernel operation, and then directly sends the extracted features to the classification network. Since the problem of gradient degradation is prone to occurring when the ordinary network is too deep, it cannot extract enough deep spectrum features for HSI classification. However, for the former, it is based on the residual structure, thereby solving the problem of gradient degradation well. Moreover, it also holds an ability to deepen the depth of the network so as to further ensure that the model maximizes the extraction of semantic information.
2D residual structure: Li et al. [30] had pointed out that, 3D convolution was unable to extract effective spatial characteristics of HSIs. To tackle this problem, here, we design a 2D residual structure similar to [39], which is shown in Fig. 3. Note that, the difference between the proposed 2D residual structure and that designed in [39] is, the classical residual block is used in this article.
In detail, we first transform P = (b, c, d, m, n) i,j into a 2D feature map by reshaping the tensor operation, which corresponding size is (b, c × d, m, n) i+1,j+1 . Then, the 2D residual structure is expressed as where I i−1 represents the data input of the 2D residual module ψ(·), W i−1 is the weight determined by the convolution kernel of the 2D residual module. φ(·) is a nonlinear activation function. Different from the common residual structure, we keep the number of the channels of the feature map unchanged for reducing the amount of calculation. The activation value v at the position (x, y) in the jth feature map of the ith layer is expressed as Noticeably, the parameters of (6) are the same as those of (3). The only difference is that (6) does not have the parameters of spectral dimension. The original hyperspectral image dataset is a three-dimensional structure, which has one more spectral dimension than common optical images. In order to facilitate the 2D convolution operation, we fuse the spectral information and spatial information from the feature map to form new feature information, which is the reason why (6) has one less parameter than (3).
Without loss of generality, hereinafter, we depict the advantage of the proposed 2D convolution residual structure against the traditional 2D convolution structure. That is, the latter only processes HSIs with simple convolution operations, which easily causes the ignorance of discriminative information. However, for the former, it can help the network obtain stronger spatial features. Therefore, compared to the traditional 2D convolution, the proposed 2D residual structure enables us to get higher classification accuracy.
AFE: The feature maps obtained by the 3D residual structure cannot be directly used as the input of the 2D residual structure due to their different dimensions. To bridge them together, we here propose an AFE structure, by which a cross-mixing 3D-2D residual structure can be correspondingly formed, as shown in Fig. 1. Note that, different from existing networks that mainly achieve the fusion at the feature level, AFE is put forward for the first time to achieve the fusion at the structure level.
In detail, AFE is mainly composed of two convolutional layers, each of which contains a 3 × 3 convolution kernel, as shown in Fig. 4. Its goal is to decrease the number of channel inputs. That is, it uses the reshaping tensor to turn the output of the 3D residual structure into the format of the 2D residual structure input. Mathematically, the relationship between the input and output of these two structures can be expressed as Among them, E l+1 is the output with the Eth layer feature map, the W l weight matrix is defined by the convolution kernel 3 × 3 and E l is the output X of the 3D residual module obtained by the reshape operation, Φ(·) is a nonlinear activation function, y is the parameter in E l+1 , x is the parameter in Eth, w is the parameter in W l , m, n are the width and height of the Eth layer feature map. It should be noted that AFE here adopts the additive fusion operation to avoid losing the information of the feature map. Generally, the reasonability behind AFE is that using the 3 × 3 convolution is able to reduce the number of the channels of feature maps. On the other hand, the AFE structure can also help the network extract more relevant feature information, wherein the additive feature fusion method is adopted to avoid the turbulence problem of the network.
Batch normalization: To further solve the problem of gradient disappearance and gradient explosion during the backpropagation of residual network, we here introduce the BN layer [41] into CMR-CNN, viz wherein x represents the parameter of the feature map, ∈ represents the set constant, γ and β represent the parameter vector of sustainable learning. By converting the data of each layer to a state where the mean is zero and the variance is one, the distribution of data in each layer is the same. In the forward propagation, changing the value of the hidden unit can reduce the covariance shift so that each layer can learn more independently. It should be pointed out that, BN-3D and BN-2D in Figs. 2 and 4 mean that the BN operation is, respectively, performed in a three-dimensional way and a two-dimensional way.
Based on the three new structures, next, we build the network CMR-CNN. Detailedly, in Fig. 1, the main spectral information of input HSI is first obtained through PCA, and then the obtained cube is input into the 3D residual structure. In the 3D residual structure, the spectral information of the feature map is extracted, which is subsequently input to AFE. AFE first reshapes the input into a feature map that can be operated by 2D convolution, then performs dimensionality reduction processing to reduce the number of channels, and uses the addition operation to fuse spectral information and spatial information together to avoid information loss. After this, the feature map is input to the 2D residual structure for further extracting the spatial information of the feature map. Finally, we downsample the feature map with global average pooling and reshape it into a vector in order to feed it into a fully connected layer for classification. It should be noted that we choose the stochastic gradient method to optimize the network model. The commonly used cross-entropy loss function is here utilized as the classification function, which is defined as where L is the sum of the loss function, N is the number of sample classes, y is the label, and s is the score for each class.

A. Datasets
In this article, we adopt five publicly available HSI datasets to test the proposed method, including the Indian Pines, the University of Pavia, the Salinas Scene, KSC, and Xuzhou. Particularly, 10% and 90% are usually selected as the training and testing percentages, such as in [36] and [34]. Different from this setting, in this article, we respectively select 5% and 95%, 1% and 99%, 0.5% and 99.5%, 20% and 80%, 1% and 99% as the training and testing set of these five datasets. More datasets information is presented in Table I. 1) Indian Pines [42] was recorded by the AVIRIS sensor, which size is 145 × 145. There are 224 spectral bands in the wavelength range from 400 to 2500 nm and the effective spectral band is 200, as 24 of them with moisture and noise interference are discarded. This data have a total of 16 crop categories and 110 366 labeled pixels. In our experiment, we randomly select 5% of each class as the training set. The actual data is shown in Fig. 5(a). 2) Pavia University [43] was acquired by the ROSIS sensor, owning 103 bands after removing 12 noisy bands. Its size is 615 × 345 with nine categories, as shown in Fig. 5(b). For this dataset, 1% of each class are used as the training set. 3) Salinas located in Salinas Valley, California, was taken by the AVIRIS sensor. The spatial resolution of this dataset is 3.7 m and the size is 512 × 217. After removing the bands with severe water vapor absorption, only 204 bands are remained. There exist 16 crop categories in this dataset, as shown in Fig. 5(c). For this dataset, 0.5% of each class are used as the training set. 4) The KSC data imagined at the Kennedy Space Center was also captured by the AVIRIS sensor, which size was 512 × 614. After neglecting the bands related to water vapor noise, only 176 bands are remained. The spatial resolution is 18 m, and there are 13 categories in total, as shown in Fig. 5(d). 20% of each class are chosen as the training set due to its smaller sample numbers. 5) The Xuzhou dataset was collected by an airborne HYSPEX hyperspectral camera over the Xuzhou periurban site in November 2014 [44]. It consists of 500 × 260 pixels, with a very high-spatial resolution of 0.73 m/pixel. The number of spectral bands used in the experiment was 436, after removing the noisy bands ranging from 415 to 2508 nm. The scene is peri-urban and is characterized by nine categories, including crops, vegetation, man-made structures, and so on, as shown in Fig. 5(e). For this dataset, 1% of each class are used as the training set.

B. Experimental Setup
To prove the effectiveness of our network, we select the following methods for comparing, including SVM [7], 2D-CNN [22], 3D-CNN [30], SSRN [38], DPRN [39],   [36], and the recently proposed methods OCT-MCNN [35], SAC-NET [24], MCNN-CP [34]. Meanwhile, for fairly comparing the performance of each method, we select OA, AA, and kappa coefficients as the evaluation criteria in experiments. OA represents the overall accuracy which is the ratio of correctly classified pixels to the total pixels; AA represents the average accuracy of each class; and Kappa represents the ratio of error reduction between classification and completely random classification, which combines the diagonal of the confusion matrix wherein line and off-diagonal terms are a robust consistency measure [33]. All experiments are performed on Tesla V-100 with a Pytorch environment. Note that, the learning rate is set to 0.001, and Epoch is set to 100 so as to compare the convergence speed of different network models.

C. Hyperparameters Setting
For CMR-CNN, three hyperparameters, i.e., the principal component value of PCA, the number of the 3D residual layer, and the number of the 2D residual layer, directly affect its performance. Thus, it is necessary for us to show how to optimally set them. To this goal, in the following, we give the detailed decision process of hyperparameters. We randomly select 1% as the validation set and the rest as training and testing sets when tuning hyperparameters for each dataset.
First, we fix the principal component value of PCA and design the number of the 3D residual layer which is selected from {1, 2, 3, 4} and a corresponding number of output channels is, respectively, 8, 16, 32, and 64. Table II lists the OA values obtained by different layer numbers. Through experiments, we found that when the number of the 3D residual layer is 4, the OA value begins to decrease, indicating that the best number of the 3D residual layer is 3. For the number of the 2D residual layer, obviously, when it is 2, the OA value begins to decrease.

D. Experimental Results
Tables IV-VIII and Figs. 6-10 are the results of different methods on these five datasets.
1) The experimental analyses on Indian Pines: Table IV shows the quantitative results of different methods on Indian Pines. Clearly, our method CMR-CNN performs best on the OA, AA, and Kappa. Particularly, compared with the method Hy-bridSN, CMR-CNN, respectively, improves by 4.19%, 8.53%, and 4.56% on these three metrics. Moreover, Fig. 6 exhibits the classification results of each network on this dataset. Through  visual analyses, it is easily found that CMR-CNN has the least areas of prediction error. In detail, the classification result of SVM in Fig. 6(c) is the worst among these methods since lots of misclassifications are present, and which Kappa value is also the lowest (73.41%) in Table IV. Compared to it, the classification results of 2D-CNN and 3D-CNN are better in Fig. 6(d) and (e). Different from these methods, SSRN has fewer misclassifications in Fig. 6(f). The classification result of DPRN is shown in Fig. 6(g). Obviously, in comparison with SSRN, DPRN has a better classification performance. Unfortunately, OCT-MCNN's classification result in Fig. 6(i) is unsatisfactory, which can also be verified by its Kappa value (88.90%) in Table IV. Fig. 6(k) displays the classification result of MCNN-CP, from which one can see that most of the categories are correctly classified. Compared with other methods, the classification effect of OCT-MCNN is poor in the case of fewer training samples. From Fig. 6(j), it can see that the recently proposed method SAC-NET also performs better. Compared to MCNN-CP, the misclassifications caused by CMR-CNN in Fig. 6(l) are a little fewer, and much fewer than HybridsSN. Therefore, it directly demonstrates that the strategy used to construct CMR-CNN is effective.
2) The experimental analyses on Pavia University: Table V lists the quantitative results of different methods on the Pavia University dataset, and Fig. 7 is the prediction maps corresponding to these methods. It can be seen from Table V that the proposed method CMR-CNN has achieved the best classification results on the evaluation indicators OA and Kapaa. In detail, compared with the HrbridSN method, CMR-CNN improves by 6.09% and 8.07% on OA and AA, and 8.13% on Kappa, respectively. Through observing Fig. 7(h) and (l), we can also demonstrate that CMR-CNN is more useful for HSI classification than HrbridSN. It should be pointed out that, among the five datasets, the University of Pavia dataset contains more outliers and more indistinguishable small regions. For some HSI classification methods proposed earlier, i.e., SVM, 2D-CNN, and 3D-CNN, the values of OA, AA, and Kappa are all lower in Table V. In addition, they all get more misclassified regions on the prediction maps in Fig. 7(c)-(e) when the training ratio is lower. At the same time, compared with these three methods and SSRN, the method DPRN with a more complex network structure achieves better classification performance in Table V, and the misclassification area caused by it is also less in Fig. 7(g). In Table V, OCT-MCNN achieves the highest value  on the AA evaluation metric among all methods. In the case of a few training samples, the recently proposed method SAC-NET achieves better classification performance in Fig. 7(j). Compared to the result of MCNN-CP in Fig. 7(k), CMR-CNN achieves a better visual result in Fig. 7(l). This is in agreement with the quantitative result in Table V, that is, CMR-CNN has the greatest OA value.
3) The experimental analyses on Salinas: Table VI shows the quantitative results of different methods on the Salinas dataset. Fig. 8 is the corresponding prediction graphes. Compared with other datasets, the sample distribution of this dataset is more regular. In order to better reflect the classification performances of different methods, we here choose 0.5% as the training ratio. In comparison with other methods in Table VI, the proposed method CMR-CNN achieves the best classification results on OA and Kappa. However, on AA, it is 0.08% lower than DPRN. Detailedly, compared with the method HybridSN, the proposed network CMR-CNN improves by 1.26%, 2.25%, and 1.40% on OA, AA, and Kappa, respectively in Table VI. The values obtained by the three methods SVM, 2D-CNN, and 3D-CNN in Table VI are not quite different from each other. So, their classification results are also similar to each other in Fig. 8(c)-(e). Compared with the first three methods, both SSRN and DPRN achieve better classification results with fewer misclassified regions in Fig. 8(f) and (g). Obviously, compared to SSRN and DPRN, the visual result of HybridSN is better in Fig. 8(h). In Fig. 8(i) and (j), in the case of fewer training samples, the classification effect of OCT-MCNN is better than that of SAC-NET. Unfortunately, for the recently proposed method MCNN-CP, the values in Table VI are not ideal, and there also exist many misclassified regions in Fig. 8(k).
4) The experimental analyses on KSC: Table VII reports the experimental results of different methods on this dataset. To visualize the performances of different methods, we further zoom in on the rectangles of prediction maps in Fig. 9, where the classes are harder to distinguish than others. Compared with the other HSI classification methods, the proposed method CMR-CNN achieves the highest scores on the three indicators OA, AA, and Kappa. Besides, the prediction map obtained by CMR-CNN in Fig. 9(l) is more accurate in visual performance. In detail, compared with HybridSN, OA, AA, and Kappa are improved by 1.51%, 2.45%, and 1.68% by CMR-CNN, respectively. Similarly, the classification results of 2D-CNN in Table VII are the worst among these methods, and its prediction in Fig. 9(d) also has a large number of misclassifications. In contrast, the classification results of SVM and 3D-CNN are better in Fig. 9(c) and (e). Compared with 3D-CNN, SSRN used the residual structure in the network architecture and obtained better classification performance in Table VII. With the same training samples, the classification result of OCT-MCNN is worse than that of MCNN-CP. On the contrary, SAC-NET performs better than OCT-MCNN. It is worth noting that compared to the other eight methods except CMR-CNN, DPRN achieves better classification performance in Table VII, and there are fewer misclassifications in the area framed in Fig. 9(g).
5) The experimental analyses on Xuzhou: To save space, hereinafter, we just briefly analyze the results of these ten methods. Table VIII reports the quantitative results of different methods on this dataset. Compared to the other methods, the proposed method CMR-CNN still achieves the highest scores on the three evaluation metrics. So, once again, the effectiveness of our method is verified. In addition, Fig. 10 shows the visual results of different methods on the dataset. Clearly, the proposed method CMR-CNN achieves the least classification error in Fig. 10(l). Fig. 11 is the confusion matrix related to CMR-CNN on Indian Pines, Pavia University, Salinas, KSC, and Xuzhou respectively. According to the distribution of the confusion matrix, we can easily see the proposed method suffers individual prediction errors on the first two datasets, but the prediction results on the latter three datasets are better. Fig. 12 shows the test results of different methods on the India Pines dataset at different training ratios. Obviously, when the training ratio increases, the accuracy of different HSI classification methods increases as well. However, no matter whether the training ratio is high or low, the accuracy of traditional methods, such as SVM, 2D-CNN, and 3D-CNN, are always lower than that of the methods proposed in recent years, like SSRN. When the training rate is high (such as 20%), the methods have little difference in experimental results. But, SSRN and DPRN show a large drop as the training ratio decreases. When the training rate is low (5%), the proposed method CMR-CNN shows a better classification performance, which indirectly reflects that CMR-CNN can still extract effective discriminant information with few training samples. Table IX further lists the average values of OA, AA, and Kappa of these ten methods on the five datasets. Overall, from this table, we can intuitively see that the proposed method achieves the best classification result. Particularly, compared with 2D-CNN and DPRN wherein only the spatial information is used, the OA, AA,     and Kappa values are, respectively, increased by 5.34%, 4.93%, 5.15%, and 1.22%, 1.24%, and 1.54%. Compared with 3D-CNN and SSRN wherein only the spectral information is used, the OA, AA, and Kappa values are, respectively, increased by 7.43%, 8.75%, 7.89%, and 3.54%, 1.84%, and 3.64%. So, this directly verifies that, in comparison with the strategy that only adopts the spatial or spectral information, the strategy using spatial and spectral information at the same time is more appropriate for HSI classification.

6) Confusion matrix and network performance under different training ratios:
Detailedly, in Table IX, HybridSN has higher OA, AA, and Kappa values than 2D-CNN and 3D-CNN. It directly proves that without adding other strategies, a single convolutional network framework cannot fully extract the discriminative information in the feature map effectively. Compared with 2D-CNN and 3D-CNN, SSRN and DPRN using residual structures also achieve better quantitative results, which proves that residual structures enable us to further extract more effective classification features. The classification performance of CMR-CNN is superior to that of SSRN and DPRN, due to the simultaneous utilization of 3D and 2D residual structures. Compared with MCNN-CP and OCT-MCNN, the classification result of CMR-CNN is better as well, which proves the effectiveness of the proposed method again. Even so, CMR-CNN is more time-consuming than some methods, such as SVM, DPRN, and SAC-NET.  Figs. 13 and 14, we can see that in the same dataset, the proposed method converges faster than MCNN-CP; for the same epoch, the OA value of CMR-CNN is higher, and the inflection point in the curve is more few. In this regard, we conduct an analysis. In CNN, there are many factors that affect the speed of network convergence and robustness. On the actual loss surface, some local minimas slow down the convergence. Another situation affecting the speed of convergence is the saddle point. The shape of the saddle point is similar to that of a saddle. The gradient is the smallest in one direction and the largest in the other. It is easy to oscillate back and forth in the direction of maximum value, slow down the convergence speed, and even cause incorrect convergence. The residual structure actually provides a "shortcut" for gradient propagation, allowing gradients to skip intermediate layers and pass directly to deeper layers. In fact, it uses the recommended skip connection. This alleviates the problem of vanishing gradients, which speeds up convergence. The proposed network model requires less training time to reach the desired value, i.e., CMR-CNN just takes almost half as long as MCNN-CP. Moreover, the curve in Fig. 13 has fewer inflection points than the curve in Fig. 14, which further verifies that the residual structure can not only ensure the accuracy of the network model, but also improve the convergence speed and robustness of the model.
2) Ablation experiments: The proposed network CMR-CNN is mainly constructed on SSRN and HybridSN. To better find the reason behind CMR-CNN, in the following, we do some ablation experiments. Table XI shows the performance of each module in our CMR-CNN on the India Pines dataset when the training ratio is 10%. It should be noted that, in order to detect the role of each module in CMR-CNN, we only compare and analyze the overall network performance of CMR-CNN, and do not compare and analyze   with other methods. The first column 3D-Conv+2D-Conv is the classification result obtained by CMR-CNN that removes the residual structure and AFE that just leave the 3D convolution and 2D convolution; the second column 3D-Res removes the 2D residual structure and AFE, leaving only the 3D residual structure; the third column 2D-Res removes the 3D residual structure and AFE, leaving only the 2D residual structure; the fourth column 3D-Res+2D-Res means the 3D residual structure and 2D residual structure after removing AFE; the fifth column 3D-Res+AFE is the combination of the 3D residual structure and AFE; the sixth column 3D-Res+AFE is the combination of the 3D residual structure and AFE; the seventh column CMR-CNN Non-Res is the result obtained by removing the residual structure from the network; the eighth column is the result obtained by the overall network CMR-CNN.
In Table XI, by, respectively, comparing the first and second columns, the first and fourth columns, it is not difficult for us to find out that the classification performance of the network is very poor without the residual structure: 3D-Conv+2D-Conv in OA, AA, and Kappa are 85.93%, 79.66%, and 84.01%, respectively, but 3D-Res+2D-Res is correspondingly 98.26%, 97.35%, and 98.66%, reflecting that the residual structure is beneficial to help the network extract deep information and improve the classification performance. The same conclusion can be obtained by analyzing CMR-CNN Non-Res and CMR-CNN. Comparing 3D-Conv+2D-Conv and CMR-CNN Non-Res, we can also easily demonstrate the effectiveness of AFE on HSI classification. Compared to CMR-CNN, when only the 3D-Res structure is retained, OA and Kappa are reduced by nearly 1.5%, and OA is reduced by nearly 2.3%. Comparing the 2D-Res structure (only the spatial information is used) and the complete CMR-CNN, we can also find that, CMR-CNN has higher classification accuracies which OA increased by 3.63%, AA increased by 4.06%, and kappa increased by 4.15%. So, using spectral and spatial information together is more apt for HSI classification.

IV. CONCLUSION
In this article, we proposed a novel CNN named CMR-CNN for HSI classification. First, we used the 3D residual structure to extract the spectral information of HSI and the 2D residual network to extract the spatial information of HSI. Subsequently, two layers of 3 × 3 convolution kernels were used to form AFE for bridging 3D and 2D residual structures together, which also allows us to further extract more hidden features of pixels. After that, CMR-CNN was proposed for HSI classification via fusing them. Experiments show the following: 1) classification accuracy can be significantly improved when the spectral and spatial information are simultaneously used; 2) residual structures enable the network to extract more effective classification features; 3) proposed method CMR-CNN has a better classification performance than the other SOTA methods. In spite of this, future work still needs to be done on how to remove the influence of noise, as the spectrometer is easily affected by factors, such as weather and light, when collecting images. Besides, we will also try to further optimize CMR-CNN for reducing the time consumption.