MSLAENet: Multiscale Learning and Attention Enhancement Network for Fusion Classification of Hyperspectral and LiDAR Data

The effective use of multimodal data to obtain accurate land cover information has become an interesting and challenging research topic in the field of remote sensing. In this article, we propose a new method, multiscale learning and attention enhancement network (MSLAENet), to implement hyperspectral image (HSI) and light detection and ranging (LiDAR) data fusion classification in an end-to-end manner. Specifically, our model consists of three main modules. First, we design the composite attention module, which adopts self-attention to enhance the feature representations of HSI and LiDAR data, respectively, and cross-attention to achieve cross-modal information enhancement. Second, the proposed multiscale learning module combines self-calibrated convolutions and hierarchical residual structure to extract different scales of information to further improve the representation capability of the model. Finally, the attention-based feature fusion module fully considers the complementary information properties between different modalities and adaptively fuses heterogeneous features from different modalities. To test the performance of MSLAENet, we conduct experiments on three multimodal remote sensing datasets and compare them with the state-of-the-art fusion model, which demonstrates the effectiveness and superiority of the model.


I. INTRODUCTION
L AND cover classification has important applications in the fields of agricultural monitoring, production layout, and urban planning. Compared with traditional ground survey methods, remote sensing technology can acquire land cover information with a broader perspective and faster speed [1]. With the continuous development of remote sensing sensor technology, multiplatform and multimodal remote sensing data for the same area are continuously generated, making it possible to use multisensor data to jointly describe land cover information. Different sensors can provide remote sensing data with different advantages and complementary characteristics, for example, hyperspectral image (HSI) can achieve simultaneous acquisition of spatial and spectral information for the observed targets and are now widely used in land cover classification tasks, but the strong spectral resolution and weak spatial resolution characteristics presented by HSI to some extent limit a large number of applications oriented to spatial resolution and sensitive characteristics of radiation information [2]. Unlike HSI, light detection and ranging (LiDAR) images are acquired through active sensing techniques, are less subject to atmospheric interference, contain rich height and shape information, and can provide complementary information for HSI images [3]. Therefore, the land cover classification effect can be further improved by combining different modal remote sensing data and making full use of the complementary advantages of multisource information.
In order to extract effective information from multisource remote sensing data and perform classification, scholars have proposed many methods. Filtering approaches is an early and commonly used method to fuse multimodal remote sensing data for classification, which effectively extracts contextual and spatial features from remote sensing images by reducing redundant spatial information, and uses these features to complete the classification task [4]. Typical filtering approaches algorithms are morphological profiles (MPs) [5], attribute profiles (APs) [6], and extinction profiles (EPs) [7]. Liao [8] et al. used MPs to extract HSI and LiDAR data features, used support vector machines for feature-level classification, and finally joint decision-level fusion for classification. Ghamisi [9] et al. used Aps to extract spatial features from HSI and LiDAR data, and achieved better classification results by concatenating the extracted features. To further improve the classification effect, Ghamisi [10]  in HSI and elevation information in LiDAR data. Another more commonly used method is based on subspace learning, Liao et al. [11] proposed a graph-based subspace embedding method to combine spectral, spatial, and elevation information for classification; Yan et al. [12] proposed an angle-based discriminant analysis approach based on Euclidean, classifying multisource features by composite kernel-based subspace learning. Hong et al. [13] proposed a cross-modal feature learning framework based on a common subspace to achieve joint classification of HSI and multispectral image (MSI). Although these traditional shallow models are successfully applied in multimodal remote sensing classification, the above methods usually choose shallow algorithms such as support vector machines [14] and random forests [15] as classifiers, and their intrinsic relationships are more complex due to the different ways of imaging for multimodal remote sensing data, which makes it difficult for shallow algorithms to use these features in an integrated way, causing the traditional feature-level fusion classification methods to exhibit some shortcomings. For example, HSI contain very complex information and have nonlinear characteristics, the traditional feature extraction methods destroy the original spatial and spectral structure in the image, making it difficult to extract these features comprehensively, thus ignoring a large amount of implicit and effective information. In addition, the number of features of HSI is large, and if combined with the features of multisource remote sensing data, it will lead to an even larger feature scale.
In recent years, deep learning techniques have been widely used in the field of computer vision and have shown excellent feature extraction capabilities, so some researchers have started to apply them to the field of remote sensing [16], [17]. Many studies have shown that deep learning has achieved remarkable results in the fields of single-source remote sensing image (e.g., HSI, MSI, LiDAR, etc.) classification [18], [19], [20], semantic segmentation [21], [22], and super-resolution [23]. In order to fully utilize the complementary information of multimodal remote sensing images, many excellent deep learning methods have been proposed, and typical models include convolutional neural networks (CNN), recurrent neural networks, and autoencoder networks. Among them, since CNN can better extract features from 2-D image data, many researchers have adopted CNN as the backbone model for classification of multimodal remote sensing data. Chen [24] et al. first used two-branch CNN to extract features from MSI/HSI and LiDAR data, respectively, and stitched these heterogeneous features to achieve joint classification of multisource remote sensing data. Based on the two-branch network, in order to achieve joint HSI-LiDAR classification, Feng [25] et al. introduced residual connection and adaptive fusion mechanism in the network. Xu [26] et al. designed a CNN with cascaded blocks, Hang [27] et al. proposed a coupled CNN network to reduce model complexity and improve classification performance through weight sharing. Zhang [28] et al. designed an unsupervised feature extraction framework based on CNN, some scholars also introduced 3DCNN in the HSI branch to better extract the spatial spectral information of HSI [29]. Different from the CNN approach, Hong [30] et al. built a deep network based on autoencoder for classification of hyperspectral and LiDAR data. Although these methods can achieve better classification results compared to shallow algorithms, they still suffer from limited feature extraction and insufficient utilization of complementary information. To solve this problem, many methods based on attention mechanisms have been proposed. FusAtNet [31] uses self-attention to enhance the feature representation of each modal data and cross-attention to assign the spatial mask of LiDAR to HSI, enhancing the spatial feature representation of HSI by LiDAR data. A3 CLNN [32] constructs a spatial, spectral, and multiscale attention mechanism and designed an efficient fusion strategy to fully fuse multisource data features.
However, there are still some problems with HSI-LiDAR fusion classification. First, in complex scenes, multiscale information is crucial to the representation of multimodal data, while existing studies pay less attention to multiscale information and have limitations in extracting multiscale features in remote sensing images. Second, how to further accurately extract spectral and spatial information from HSI and LiDAR data by using attention mechanism and fully utilize the spatial information of LiDAR data in cooperation with HSI, it remains a question to be further studied. More importantly, the feature fusion (FF) approach based on simple feature stitching often fails to achieve better classification performance because it ignores the complementarity between multimodal data, and this approach will further increase the feature dimensionality, which may lead to dimensional disaster.
To address these problems, this article proposes the multiscale learning and attention enhancement network (MSLAENet), specifically, the network adopts a two-branch CNN structure, based on self-calibrated convolutions and hierarchical residual structure to build a multiscale learning (MSL) module to extract spectral and spatial information at different scales, which enriches the feature representation of multisource data, two attention mechanisms (self-attention and cross-attention) in the network enhance the spatial and spectral feature representation and intermodal information interaction in each branch, the attention-based FF module can better achieve fusion classification of HSI and LiDAR data. Experiments are conducted on three real hyperspectral and LiDAR datasets, and the effectiveness of the method is demonstrated by comparison with existing models.
In summary, the main contributions of this article are summarized as threefold.
1) To improve the classification performance of multimodal remote sensing data by using multiscale information, a MSL module is constructed by combining self-calibrated convolutions and hierarchical residual networks, which can extract spatial and spectral information of different receptive fields and enhance the multiscale information representation capability of the whole model. 2) Considering the rich spectral and spatial information in HSI and LiDAR data, composite attention (CA) is constructed to obtain enhanced representations of spectral and spatial. Specifically, the spatial information representations in LiDAR data and spectral information representations in HSI are adaptively learned and enhanced by self-attention, and cross-attention is used to achieve cross-modal information enhancement (CME) to achieve complementary utilization of different modal information. 3) A new attention-based FF module is proposed to take location information into account and fully consider the information complementarity between two modal data to achieve efficient fusion of multimodal features.

A. Architecture Overview
The overall framework of the proposed MSLAENet is shown in Fig. 1, which uses CNN as the backbone network and contains three layers. Unlike the traditional CNN network, we build a multiscale feature learning module based on self-calibrated convolution in the second layer of the network to extract multiscale information, and in order to obtain enhanced spectral and spatial feature representations, we add an attention mechanism to the network and realize the information interaction between modalities by CME method. In addition, to avoid the problem of high feature dimensionality and insufficient FF caused by traditional FF using concatenation operation, we propose a novel attention-based FF method that can effectively fuse heterogeneous features between different modalities. It is worth noting that we set the padding and stride parameters of the convolution operation to keep the feature map size constant during the operation, and add batch normalization and rectified linear unit (ReLU) after the convolution operation in each of the three layers in turn for accelerating training and learning nonlinear representation.
The input of the network consists of HSI and LiDAR covering the same area, and fixed-size image blocks are selected for network training and testing centered on the pixel points to be classified, and the input of HSI and LiDAR branches can be expressed as X h ∈ R h×w×b h and X l ∈ R h×w×b l , the b h and b l denote the number of HSI and LiDAR bands, respectively, and h and w represent the height and width of the input image, respectively. Considering the redundancy of high-dimensional spectral information in the hyperspectral data, principal component analysis is used to reduce the dimensionality of the input data, and the reduced-dimensional input can be expressed as X h ∈ R h×w×b p , p is the number of bands after dimensionality reduction.

B. CA Module
Inspired by the human vision system, attention mechanisms have been introduced into computer vision systems, and more and more deep learning models based on attention mechanisms have been proposed and improved feature representation in many research areas (e.g., image classification, object detection, image generation, etc.) [33]. With the rapid development of deep learning techniques, attention mechanisms are now widely used in remote sensing image classification tasks [34], [35], [36]. In this article, the attention mechanism will be used to guide deep learning networks to learn more accurate feature representations.
1) Spectral Attention for HSI: HSI contain rich spectral information, and the effective use of this spectral information can improve the performance of multimodal remote sensing land cover classification. At present, a number of attention mechanisms have been used to enhance the spectral representation of HSI. But most of them only consider the internal channel information, thus ignoring the location information. However, in the HSI classification task, location information is crucial to capture the structure of the objects. Channel attention with embedded location information, coordinate attention, is a simple and efficient attention model that captures not only cross-channel information, but also orientation-aware and location-sensitive information, which helps the model to locate and identify objects of interest more accurately [37]. To obtain a more discriminative representation of spectral information in HSI, we use a spectral attention model with embedded location information in the HSI branch. As shown in Fig. 2, this module uses global average pooling to encode features along the horizontal and vertical coordinate directions for each channel of the input features, allowing the attention module to capture not only channel information, but also orientation-aware and position-aware information, which in turn improves the classification results. The coordinate attention consists of two steps: coordinate information embedding and coordinate attention generation.
The first step is coordinate information embedding. Assume that the output obtained from the HSI branch after layer1 is X h1 ∈ R h×w×c 1 . The average pooling kernel of size (h, 1) and (1, w) is used to encode each channel along the horizontal and vertical directions, respectively, so that the output after encoding along the horizontal direction is Output after encoding along the vertical direction is The second step is coordinate attention generation. To make better use of the representation with global receptive field and precise location information generated by the coordinate information embedding module, the coordinate attention generation module is designed to generate channel attention map. First, connecting z h and z w , then use the shared 1 × 1 convolution operation F 1 to perform the feature transformation.
"[·, ·]" represents the concatenate operation, δ is the nonlinear activation function. f ∈ R c/r×(h+w) is the intermediate feature map that encodes spatial information in both the horizontal direction and the vertical direction. r is the reduction ratio for controlling the block size. f is then split along the spatial dimension into f h ∈ R c/r×h and f w ∈ R c/r×w , by using two 1 × 1 convolution F h and F w to expand the number of channels of f h and f w by r times, so that they are the same number of channels as x h1 "δ" is the sigmoid function. Finally, g h and g w are used as attention weights to calibrate the weight of input x h1 . The final output is 2) Spatial Attention for LiDAR Data: LiDAR data contains rich elevation information and can convey rich information in the spatial domain. In this article, we use spatial attention to generate spatial attention weights to enhance the feature representation of LiDAR branch. Considering that in the remote sensing image classification task, the pixels to be classified are often more correlated with their surrounding pixels due to the limitation of image resolution, unlike the general spatial attention mechanism that uses global pooling operation to obtain the attention map, we use two consecutive convolution operations to obtain more accurate spatial attention weights. The adopted spatial attention structure diagram is shown in Fig. 3, assuming that the output obtained by LiDAR branch after layer1 isX l1 ∈ R h×w×c 1 , the spatial attention module consists of two convolution operations and a sigmoid function, which first uses two 3 × 3 convolution to generate a nonnormalized attention map of size h × w × 1 and then use the sigmoid function to generate the attention weight map, and finally achieve feature enhancement for input x l1 of the module by residual skip connections, the process can be formulated as where x l1 denotes the output of the spatial attention module, f 1 (−), f 2 (−) and "δ" are two convolution operations and sigmoid function, respectively.

3) Cross-Modal Enhancement:
The rich spatial information in LiDAR data can assist HSI to obtain more accurate classification results. Therefore, we enrich the spatial information of HSI through this cross-modal spatial attention enhancement mechanism by assigning the attention weights obtained from the LiDAR branch to the HSI branch. Thus, after the CA module, the output of the HSI branch can be expressed as C. MSL Module 1) Self-Calibrated Convolutions: Traditional CNNs are limited by the size of predefined convolutional kernels and lack large receptive fields, which make it difficult to capture enough high-level semantic information in remote sensing images, thus leading to less discriminative feature maps [38]. To obtain more discriminative feature representations, Liu [39] et al. proposed self-calibrated convolutions, which differs from the traditional convolutional uniformly performing convolutional operations on the original input, self-calibrated convolutions perform convolutional features transformation on the input data in two different scale spaces, that is, the original scale space and the downsampled latent space with larger receptive fields. This approach allows each spatial location to adaptively encode contextual information in distant regions, thus breaking the tradition of performing convolution in small regions (e.g., 3 × 3) to produce more discriminative features.
The workflow of the self-calibrated convolutions is illustrated in Fig. 4. First, the input feature map X ∈ R c×h×w is split into X 1 and X 2 with size of c/2 × h × w, the convolutional transformations are performed on the pairs X 1 and X 2 in the self-calibrated branch and the traditional convolutional branch, respectively, to collect different types of contextual information. Then, given four filters {K 1 , K 2 , K 3 , K 4 }, in the self-calibrated branch, using {K 2 , K 3 , K 4 } to perform the self-calibration operation on X 1 to obtain Y 1 ; in the conventional convolution branch, use K 1 to performs a simple convolution operation X 2 to obtain Y 2 = f 1 (X 2 ) = X 2 * K 1 . Finally, the Y 1 and Y 2 are concatenated as the final input Y . The self-calibration process is described as follows.
Given the input X 1 , we implement downsampling using averaging pooling, expanding the receptive field at each spatial location.
r represents the downsampling rate and strides in the pooling operation. Next, using K 2 perform feature transformation on T 1 , and use the bilinear interpolation operator Up(−) to perform upsampling by r times, the feature map is restored to the original scale size Then, the self-calibration operation can be described as follows: where "·" is element-wise multiplication, and δ(−) represents the sigmoid activation function. Therefore, the final result of the self-calibrated branch Y 1 can be written as 2) MSL Module: Different land cover types in remote sensing images exist at different scale sizes, and representing features from multiple scales is crucial for multimodal remote sensing image classification tasks. However, classification models that use uniform scale to extract features can no longer meet the demand for multiscale information for classification tasks. Inspired by the Res2Net [40], the hierarchical residual structure can represent multiscale features at a finer granularity and increase the receptive field of the network. Self-calibrated convolution implements feature transformation from different scale spaces and can obtain a rich feature representation; therefore, this article combines the idea of hierarchical residuals and self-calibrated convolution to build a MSL module.
The structure of the MSL module is shown in Fig. 5. Taking the HSI branch as an example, Let x h1 and x h2 denote the input and output of the MSL Module, for the input x h1 , first go through 1 × 1 convolution for dimensional transformation to get x h1 , then, the feature map x h1 is equally split into m feature subsets by channel, denoted as x h1i , where i ∈ {1, 2, ..., m}, so that each feature subset x h1i has the same feature dimension, and each x h1i corresponds to a self-calibrated convolution operation K i , and the output after the K i transformation is defined as y i , in order to obtain a larger receptive field, in addition to x h1i , we add y ( i − 1) to x h1i , and then fed into K i , so that y i can be described as Finally, the outputs y i obtained from each layer are joined together to obtain Y Y = [y 1 , y 2 , ..., y m ].
"[-]" means concatenate operation, to avoid the problem of gradient disappearance in the network, we add a residual connection to the 1 × 1 convolution, map features x h1 obtained before the hierarchical residual learning module are transferred to the deeper layers of the network. Thus, the output of the MSL module x h2 can be expressed as

D. FF Module
After obtaining the feature representations of HSI and LiDAR data, how to combine them for classification tasks remains a critical issue. Most existing approaches choose to use the concatenate operation to aggregate them together, however, this approach not only increases the feature dimensionality, but also ignores the contextual information, making the fusion ineffective. Inspired by the visual attention mechanism, we propose an attention-based FF module, as shown in Fig. 6. The FF module contains three inputs, the x h3 and x l3 denote the HSI features and LiDAR data features obtained after layer3, respectively.
x hl is the result after doing element-wise summation operation on HSI and LiDAR data features. As mentioned in Section B, location information is crucial to the multimodal remote sensing image classification task, so we use coordinate attention to x hl to perform feature enhancement. Considering that the direct summation of two feature maps cannot maximize the interaction between feature maps, inspired by attentional FF [41], in order to achieve complementary utilization of HSI and LiDAR data features in the FF stage, we use two 1 × 1 convolution operations f 1 that f 2 , respectively, to reduce the dimension of x h3 and x l3 to half, and then sum the two modal features in this low-dimensional feature space, followed immediately by using a 1 × 1 convolution to change the dimension of the feature map to 1, and then use the sigmoid activation function to obtain the non-normalized attention map, they are multiplied with the  I  NUMBER OF TRAINING AND TEST SAMPLES IN EACH CLASS OF THE HOUSTON  DATA original input, respectively, and added. So far, we can obtain the feature representation x hl that fully considers the relationship between HSI and LiDAR data features, finally, the features are normalized by the softmax function and multiplied with CA(x hl ) to obtain the fused features, which contain location information and maximize feature interaction, This process can be formulated as: where δ(−), θ(−) are the sigmoid function and the softmax function, respectively. CA(−) is the coordinate attention.

A. Data Description
To test the effectiveness of our proposed MSLAENet, we conducted experiments on three widely used hyperspectral and LiDAR fusion datasets.
1) Houston Dataset: this dataset was acquired by the Center for Airborne Laser Mapping, funded by the National Science Foundation at the University of Houston, in June 2012 in the University of Houston campus and surrounding area [42]. Both HSI and LiDAR modal data were included, with a band count of 144 and 1, respectively, both containing 349 × 1905 pixels with a spatial resolution of 2.5 m. There are 15 different classes and the pseudocolor images of HSI, grayscale maps of LiDAR data and ground truth maps are shown in Fig. 10(a)-(c), respectively. Table I shows the detailed classes and the number of samples used for training and testing for each category.
2) Trento Dataset: This dataset also contains similarly one HSI and one LiDAR data, collected from a rural area south of the city of Trento, Italy. HSI data are collected by the AISA Eagle sensor with 63 bands; LiDAR data are collected by the Optech ALTM 3100EA sensor. Both types of data contain 166 × 600 pixels with a spatial resolution of 1 m, containing a total of six different classes. Fig. 11 in (a), (b), and (c) shows the pseudocolor image of HSI, the grayscale map of LiDAR data, and the ground truth map, respectively. Table II shows the detailed categories and the number of samples used for training and testing for each category.
3) MUUFL Dataset: This data was collected over the University of Southern Mississippi Gulf Park, both the HSI and LiDAR data contain 325 × 220 pixels. HSI initially contained 72 bands, however, initial and final four bands are removed due to noise issues, the remaining 64 bands were used for the experiment, and the LiDAR data contained two bands. There are 11 different classes and the pseudocolor images of HSI, grayscale maps of the first band of LiDAR data and ground truth maps are shown in Fig. 12(a)-(c), respectively. Table III shows the detailed classes and the number of samples used for training and testing for each category.

B. Parameter Tuning
Our network is implemented in the Pytorch framework, all experiments of this article were conduced on a person computer configured with Intel Xeon W-2133 CPU, 32 GB RAM, NVIDIA GeForce RTX 2080 graphics card and Windows 10. In the model training process, Adam algorithm is used to optimize our network, and cross-entropy is used as the loss function of the network. Meanwhile, we choose three commonly used evaluation metrics to assess the classification performance, namely Overall accuracy (OA), Average accuracy (AA), and Kappa coefficient.
The setting of deep learning network parameters has a great influence on the model performance. In this section, we will   2) Analysis on the Input Patch Size: Different input patch size contains different amount of information, in order to evaluate the impact of this parameter on the model performance, we compare the classification results of different input patch size on three datasets. For all datasets, we set M as their optimal number of feature mappings, and we fixed the other parameters as default values, and considering that too large patch size will increase the learning time of the network, we selected the value of s from the candidate set {5, 7, 9, 11} to evaluate the impact of this parameter. As can be seen in Fig. 7, the size of the input patch has a significant impact on the model performance, especially on the Houston dataset. and the optimal patch sizes for Houston, Trento, and MUUFL dataset are 7 × 7, 9 × 9, and 7 × 7, respectively.
3) Analysis on the Number of Principal Components: The number of principal components determines the dimensionality  of the input HSI, and to evaluate the impact of this parameter, we conduct experiments on three datasets. For all datasets, we set M and s to the optimal values, respectively, with other parameters set to default values, and then evaluate the impact of this parameter by selecting the values of P from the candidate sets {10, 20, 30, 40}. Fig. 8 shows the overall accuracy achieved when setting different values of P for different datasets, and it can be seen that the optimal P values for the Houston, Trento, and MUUFL dataset are 30, 20, and 20, respectively.

4) Analysis on the Learning Rate:
The learning rate of the deep learning network can guide the network to adjust the weights of the network through the gradient of the loss function, which has a large impact on the model performance. To evaluate the impact of the learning rate on the performance of MSLAENet, for different datasets, we fix the other parameters as the optimal values and set the learning rate candidates as {0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05} to select the best learning rate by experiment. Fig. 9 reports the overall accuracy achieved when setting different lr values for different datasets. The best learning rate values for Houston, Trento, and MUUFL datasets are 0.001, 0.005, and 0.005, respectively.

C. Classification Performance
To highlight the superiority of MSLAENet, we selected seven classification methods for comparison, including two traditional machine learning algorithms SVM [43] and ELM [44], and five state-of-the-art deep learning methods, which are the contextual deep CNN model CDCNN [45], the two-branch CNN model TBCNN [26], the encoder-decoder structure-based fusion network EndNet [30], dual attention-based spectral spatial fusion network FusAtNet [31], and spatial-spectral cross-modal enhancement network S2Enet [46], Among them, CDCNN network is the classical network used for HSI classification. For the conventional methods and CDCNN model, we used LiDAR data and HSI for data layer fusion as the input to the network. For a fair comparison, we used the same training and test sets in all methods. Tables V-VII show the OA, AA, Kappa, and category accuracies obtained using different methods on the Houston, Trento, and MUUFL datasets, respectively, and the bold values in the tables represent the optimal values of the corresponding rows. From the table, we can draw the following conclusions.

1) Quantitative Comparison:
The performance of deep learning-based methods is generally higher than the performance of traditional methods, for example, for Houston, the highest OA value achieved by traditional methods is 5% lower than the lowest OA achieved by deep learning methods, which is due to the stronger feature representation capability of deep learning methods compared to traditional methods, and the fact that traditional methods fuse multimodal data at the data level and then input them into the network for classification, this method cannot effectively fuse the information across modalities.
Among all deep learning-based methods, our proposed network obtains the best classification performance. Specifically, for the Houston, Trento, and MUUFL datasets, we achieved 96.47%, 99.33%, and 92.62% OA, respectively, and the AA and Kappa metrics were higher than the other comparison algorithms. For the Houston dataset, our proposed method achieves a more significant improvement, the OA is 9.55%, 8.49%, 7.95%, 6.49%, and 2.28% higher compared to CDCNN, TBCNN, EndNet, FusAtNet, and S2Enet, respectively. Comparing other methods, it is not difficult to find, the CDCNN method stacks HSI and LiDAR data as input to the network, and this data-level fusion ignores the differences between different modal data and cannot effectively fuse the information of each modality. In TBCNN, the information of each branch cannot be effectively fused by a simple feature cascade; The learning ability of encoder decoder-based feature representation in EndNet is still limited; FusAtNet is the first proposal to be used in multimodal remote sensing classification task using cross-attention approach to achieve enhancement from one modality to another, and S2Enet proposes cross-modal interaction learning before FF to enhance the information representation of each modality. However, all of them do not consider the multiscale information in remote sensing images and lack efficient FF methods. On the one hand, our model fully extracts the spatial and spectral information in multimodal remote sensing data through the attention mechanism and achieves spatial enhancement of HSI data through the cross-modal attention mechanism. On the other hand, the introduction of multiscale information can extract more scale-related information that helps classification. In addition, our proposed fusion method will introduce location information and fully consider the relationship between HSI and LiDAR data, which can effectively integrate the complementary information between different modalities and improve the classification accuracy.
2) Visual Comparison: In addition, to better demonstrate the classification performance of different methods, Figs. 10-12 show the classification maps obtained by different classification methods using Houston, Trento, and MUUFL datasets, respectively, and for comparison, we also list the ground truth, in which different colors represent different land cover types. It is obvious that the classification maps obtained by MSLAENet show the fewer error markers, which are more similar to the corresponding ground truth, especially in Houston, where the classification accuracy for categories C1, C7, C11, C12, and C13 far exceeds that of other comparison algorithms, which further validates the advantages of the model in this article.

3) Computation Time Comparisons:
In general, deep learning methods tend to require longer time consumption due to the complex model structure. To quantitatively analyze the computational cost of different methods, we set the training epoch of all methods to 200, and we report their training time and testing time on all datasets in Table VIII. Since the traditional methods have simple models and less time consumption, we ignore them in our report. From this table, it can be seen that more complex datasets tend to require more training time. In addition, the training process of our MSLAENet takes more time, second only to FusAtNet among all compared methods, which is because of the introduction of several attention modules. however, the increase in time is acceptable because our proposed method achieves the best classification accuracy.

D. Ablation Study
In order to further evaluate the performance of each module in MSLAENet, further ablation experiments were carried out. A CNN network with three layers is used as the baseline network, and in the baseline network, we fuse HSI features and LiDAR data features by stacking them. We gradually add CA, MSL, and FF modules to the CNN network, and the impact of each module on the network performance was analyzed by different combinations of modules. Table IX shows the experimental results obtained with different modules and different combinations of modules on different datasets, and the analysis of the experimental results shows that the three modules proposed in this article can improve the classification results to different   To test the contribution of each branch attention and crossmodal enhancement approach in the CA module, we conducted an ablation study, and the experimental results are shown in Table X. SpeAtt_H denotes spectral attention for HSI, SpaAtt_L denotes spatial attention for LiDAR data, and SpaAtt_H denotes spatial attention for HSI, namely CME. It can be seen that the addition of spatial and spectral attention can enhance the feature representation of HSI and LiDAR data and achieve better classification results, while the CME approach will further enhance the classification effect, which is because the adoption of this approach makes the HSI branch acquire the spatial information of the LiDAR branch and strengthen the feature representation of HSI. To test the contribution of self-calibrated convolution in the MSL module, we compared the classification results in MSL using self-calibrated convolution and vanilla convolution, as shown in Table XI, using self-calibrated convolution will obtain better classification OA.
Moreover, we conducted additional ablation experiments on all datasets to explore the classification accuracy when using different numbers of training samples. Table XII shows the performance with different percentages of training samples from 30% to 100%, where 100% represents exactly the number of training samples listed in Tables I-III. It can be seen that as the number of training samples increases, the classification OA also increases.

IV. CONCLUSION
In this article, a network for HSI and LiDAR data fusion classification is proposed, which uses self-attention mechanism to adaptively extract spectral and spatial information from HSI and LiDAR data, and cross-attention is used to achieve CME and we use LiDAR data to enhance feature representation of HSI data; self-calibrated convolution and hierarchical residual connection are used to construct MSL module to extract multi-scale information in remote sensing images for classification; in addition, we construct a new attention-based FF module that takes location information into account and fully considers the information complementarity between the two modal data. The effectiveness of the algorithm proposed in this article is demonstrated by conducting experimental validation on three commonly used HSI and LiDAR classification datasets and comparing them with other state-of-the-art methods. However, by experimental analysis of different training samples, we find that our method is highly labeled sample-dependent. In future work, we will consider using weakly supervised or self-supervised techniques to improve this problem.
Weijun Gong received the bachelor's degree in electronic information engineering from the Wuhan University of Science and Technology, Wuhan, China, in 2009, and the master's degree in computer application technology from the Lanzhou University of Technology, Lanzhou, China, in 2012. He is currently working toward the doctoral degree in computer science and technology with Xinjiang University, Urumqi, China.
His research interests include deep learning, image classification, and emotional expression analysis.
Zhuang Chu received the bachelor's degree in software engineering from Xinjiang University, Urumqi, China, in 2021. He is currently working toward the master's degree in software engineering from Xinjiang University, Urumqi, China.
His research interests include deep learning and remote sensing image classification.
Hui Liu received the bachelor's degree in software engineering from Xinjiang University, Urumqi, China, in 2014, the Master of Engineering from the college of software, Xinjiang University, Urumqi, China, in 2017. She is currently working toward the Ph.D. degree in computer science and technology with Xinjiang University, Urumqi, China.
Her research interests include deep learning and opportunistic networks and the processing of remote sensing image data.