Multimodal Attention-Aware Convolutional Neural Networks for Classification of Hyperspectral and LiDAR Data

The attention mechanism is one of the most influential ideas in the deep learning community, which has shown excellent efficiency in various computer vision tasks. Thus, this article proposes the convolution neural network method with the attention mechanism to enhance the feature extraction of light detection and ranging (LiDAR) data. Meanwhile, our elaborately designed cascaded block contains a short path architecture beneficial for multistage information exchange. With the full exploitation of elevation information from LiDAR data and efficient utilization of the spatial-spectral information underlying hyperspectral data, our method provides a novel solution for multimodal feature fusion. Experiments are conducted on the LiDAR and hyperspectral dataset provided by the 2013 IEEE GRSS Data Fusion Contest and multisource Trento dataset to demonstrate the effectiveness of the proposed method. The experimental results have shown the superior results of the proposed method on both LiDAR and multimodality remote sensing data in comparison with several popular baselines.


I. INTRODUCTION
R EMOTE sensing image classification task plays an essential role in Earth observation, which could be used for analyzing critical information related to urban planning, natural resources management, climate change, environmental monitoring, and so on. Remote sensing data acquired from various sensors could exploit multiple physical characteristics of ground objects [1], [2], [3], [4], [5]. With the blooming development of remote sensing sensors, more and more researchers in the remote sensing community are active in the algorithm innovations to better extract the most valuable features among the multimodal remote sensing data [6]. Mostly, it is complicated and challenging for some algorithms to extract the feature of ground objects efficiently. For example, it is hard to distinguish different ground objects in the downtown area with a high building density. In this case, various remote sensing data could facilitate the algorithm to improve the image classification results more precisely [7].
Hyperspectral image (HSI) can provide detailed spectral information of various ground cover types due to its broad coverage of wavelength and high sampling rate [8], [9]. Usually, HSI contains dozens or hundreds of spectral information ranging from the visible light (0.4-0.7 μm) bands to the short-wave infrared (almost 2.4 μm) bands. Thus, HSI with sufficient spectral information could discriminate ground objects with similar spatial features [10]. Nevertheless, hyperspectral data could not contain height information of ground objects as well as highresolution spatial information. Meanwhile, there are complex mixed pixels and noising signal, which prevent the precise classification results [11], [12]. Classifying ground objects with similar spectral and spatial features could hardly distinguish ground objects only with the HSI, and lots of researchers have tried to improve the HSI classification accuracy [13], [14], [15], [16], [17], [18], [19], [20]. To this end, LiDAR data can provide elevation information to extract more precise features of various ground objects. Consequently, LiDAR data provide elevation information, which is a beneficial source for complementing the information provided solely by HSI [21].
Researchers have proposed a series of methods to better realize remote sensing image classification tasks using HSI and LiDAR data in recent years. To further strengthen the spatial feature, filtering-based methods have been proposed, which mostly could extract the regional geometrical feature, meanwhile, preserve the most critical spatial characteristic of HSI [22], [23], [24], [25], [26]. However, the filtering-based methods mostly increase the dimension of multimodal remote sensing data, which probably introduces the curse of dimensionality, decreasing the accuracy of classification results. Furthermore, the nonlinear characteristic of spectral information in HSI would be amplified when integrated with LiDAR data. The methods based on deep learning [27], [28], [29], [30], [31] could extract more complex and hierarchical features of multimodal remote sensing data, which have been experimented with in recent years with better classification results than other classical machine learning methods (e.g., support vector machine [32] and extreme learning machine [33], [34]).

A. Motivation
The methods based on deep learning behave perform better on the extraction of the complicated multimodal feature than other traditional machine learning methods [25], [31]. Among the methods based on deep learning, most would use the different kinds of convolution neural networks (CNNs) to extract the features acquired from different modalities. Meanwhile, there is lots of work trying to combine the filtering-based method with CNN to introduce more expert experience [25]. Then the features could be fused by being concatenated or point-wisely added. Besides, Hong et al. designed the common subspace representations to extract the integrated multimodal remote sensing data feature with EndNet followed a deep encoder-decoder network architecture [27]. Furthermore, researchers also attempt to combine the graph-based method with CNN to preserve the spatial edge information of ground objects [35].
Attention mechanism methods are popular in the natural language processing and computer vision area these years [36]. In the remote sensing community, researchers have also conducted experiments to explore the positive impact of the attention mechanism on deep learning-based methods [37], [38], [39], [40], [41], [42], [43]. When combing with the CNN, the attention mechanism could focus on the most vital features and weaken the impact of unnecessary features.
Hence, there is a potential space for us to explore the impact of attention mechanisms on multimodal remote sensing image classification. In the following section, we will illustrate the main contribution we made to this research.

B. Contribution
The framework of our proposed method is shown in Fig. 1. More concretely, the significant contributions in this article could be concluded as the following two aspects. r LiDAR attention module blocks: The attention mechanism module was applied to emphasize the most meaningful information contained in the LiDAR data. In this way, the feature extracted from LiDAR could better contribute to the whole multimodal data feature and final classification.

II. RELATED WORK
In this section, we will briefly introduce the background of CNN and attention mechanism.

A. Convolution Neural Network
The CNN is an efficient deep learning model to extract the hierarchical feature of image information. The CNN contains a series of convolution layers, pooling layers, and activation function [44]. Some researchers have explored the efficiency of algorithms based on CNN with multimodal remote sensing image classification tasks.
Hang et al. [28] designed a simple two-stream CNN to extract the feature of hyperspectral and LiDAR data separately. As remote sensing data have the property of a large covering area, CNN's input data are usually a patch derived from the remote sensing data (such as a LiDAR image patch with the size of 5 × 5). Besides, owing to the abundant spectral information in the HSI, Xu et al. designed a one-dimensional CNN and two-dimensional CNN to extract the spectral and spatial features separately.
However, the feature extracted by CNN is strongly influenced by the network architecture [45]. To further explore the potential feature, we have proposed a modified network to exploit multimodal hierarchical features better.

B. Attention Mechanism
As mentioned previously, the attention mechanism can recalibrate the significant impact on various feature derived from the CNN output. A few researchers have proved that attention mechanism positively impact on the HSI classification task [46], [47]. Mei et al. [48] proposed the spatial attention CNN and spectral attention recurrent neural network and proven the effectiveness of attention mechanism in HSI classification.
Nevertheless, we still need to evaluate the efficiency of attention mechanism on the multimodal remote sensing image classification task.

III. METHODOLOGY
In this section, we will first introduce the algorithm framework we proposed and how to realize the training process. Then the cascaded CNN will be illustrated. Finally, we will focus on introducing the cascaded attention CNN we proposed.

A. Method Overview
We introduce the hierarchical CNN to extract the image feature contained in the HSI and LiDAR DSM data separately. The extracted image feature will be a one-dimensional vector  as the result of CNN. We concatenate the derived feature as the multimodal image feature during the fusion stage. Then, a Softmax classifier is applied for the classification task.

B. Hyperspectral CNN
We have designed a Co-CNN hybrid network for the HSI image H M×N×K feature extraction CNN to exploit both twodimensional spatial and one-dimensional spectral HSI features. To better gain the HSI spatial feature, we adopt the 9 × 9 patch H spatial ij ∈ R 9×9 as the training sample where the centered pixel p ij has been labeled with ground truth as 2-D CNN input. We take the one-dimension spectral signal sample H spectral ij ∈ R 1×K with ground truth as 1-D CNN input for the spectral signal data.
The 1-D CNN and 2-D CNN are five convolution layers with batch normalization and exponential linear unit (ELU) activation function. The batch normalization module could provide the training process with higher training efficiency. Besides, we adopt ELU activation functions to avoid exploding gradients problems and exceed the training process. The spatial feature F spatial ∈ R 1×p derived by HSI patches and the spectral feature F spectral ∈ R 1×q will be concatenated at the feature fusion stage. The fused feature F HSI = [F spectral , F spatial ] ∈ R 1×(p+q) will go through full connection layer and the Softmax loss function to predict the classification results. The prediction result as where, W ∈ R (p+q)×C represents the weights matrix in the prediction layer, C is the number of categories, exp(F HSI W) c is the exponential function to each element corresponding to class c, and the predicted result on the left-hand side shows the probability of that pixel belongs to each category.

C. LiDAR Cascaded Attention CNN
The overall process of the LiDAR neural network contains the cascaded block and attention block. Following the descending kernel size strategy we mentioned in the cascaded block, the raw LiDAR patch data will be enrolled with a convolution with kernel size 3 × 3 with BN and ELU functions as in Fig. 3.
Given the LiDAR patch image, cascaded block and attention block will help us locate the key edge feature of ground object height. Then ELU activation and max pooling and flatten functions help us gain the one-dimensional LiDAR DSM feature.
1) Cascaded CNN: Toward the LiDAR image, we designed cascaded-based CNN to exploit the ground object height information in case of losing a key height feature in the propagation process. In detail, we follow the descending kernel size strategy with skip connection and drop-out operation to exploit the valuable height feature contained in LiDAR data.
Following the training strategy of HSI CNN shown in Fig. 4, in the cascaded block, we maintain the combination of batch normalization and ELU activation function to provide an effective and stable training process and parameters learning results. At the same time, drop-out operation is highlighted to avoid trained features that lack multiscale characteristics.
2) Attention module: The attention module is mainly composed of spatial attention module and channel attention module. The detail network architecture is as in Fig. 5.
The object height feature extracted by a cascaded block will be fed into an attention block, exploiting spatial and channelwise attention based on an efficient framework. The attention block is mainly composed of the channel attention module and spatial attention module, and we define the feature exploited by the cascaded block as F ∈ R M×N×H . Thus, the whole attention block could be demonstrated as follows: where, (2) and (4) I  COUNTS OF HOUSTON TRAINING AND TESTING GROUND TRUTH   TABLE II  COUNTS OF TRENTO TRAINING AND TESTING GROUND TRUTH operation ⊕ represents elementwise sum between features, operation ς represents ELU activation function, f Avg represents average pooling function, f Max represents max pooling function.
In the channel attention part, we operate max pooling and global average pooling separately for the input feature, gaining different descriptors, including edge and smooth features for the ground objects. Different descriptors will go through a weight parameter shared multilayer perception f MLP with one hidden layer, which would help us gain the channel attention map with H × 1 × 1 data size. Then, an elementwise summation will be applied toward max-pooling and average-pooling features. Finally, we also follow the network design strategy, allowing fused features to be activated by the ELU activation function for a smoother model training process.
We generate a spatial attention map to highlight the interspatial object height information in the spatial attention sector to enhance the corresponding spatial feature. The feature separately goes through the max-pooling and average-pooling layers following the channel axis. Then, we fused these features with an elementwise summation. The extracted feature along the channel axis is then convolved and activated by the ELU function to get the final spatial attention map f spatial (F ). As shown in (2) and (3), the input feature will be multiplied by f spatial and f channel to get the enhanced feature F .

IV. EXPERIMENT
This section, we will introduce the experiment datasets, experiments settings, and final experiment results.

A. Dataset Description
In this experiment, we have conducted our algorithm on Houston and Trento datasets, which contains LiDAR and HSI information, to evaluate the efficiency of the cascaded CNN and attention modules.
Houston dataset [49] is captured in Houston, USA. The dataset contains one air-borne HSI and LiDAR DSM data with 349 × 1905 pixels. The spatial resolution has been registered on both HSI and LiDAR DSM data with 2.5 m. The HSI contains sufficient spectral information with 144 bands, and the hyperspectral sensor CASI-1500 captures 0.38-1.05 μm spectral data.
Trento dataset [50] is composed of HSI and LiDAR DSM data captured in Trento, Italy. The registration image size is 600 × 166 with a 1-m spatial resolution. The hyperspectral data contain 63 bands ranging from 0.42 to 0.99 μm. The HSI and LiDAR data are separately captured by AISA Eagle and Optech ALTM 3100EA sensors.
The original multimodal remote sensing datasets are two images with tiff format containing the ground-truth label. To modify the raw data as the standard model input data, we have normalized standardization and recorded the location index information for the ground-truth samples.

B. Experimental Setting
To evaluate the efficiency of our proposed method and compared methods, we conducted experiments on Intel(R) Core(TM) i7-7700HQ CPU, GTX 1060(Ti) GPU, 16 GB of RAM, and Ubuntu 18.04 version under the same experimental conditions. We have conducted overall accuracy (OA), average accuracy (AA), and Kappa coefficient metrics to prove the algorithm's performance. Meanwhile, to ensure the reliability of experiments, all the experiment results are the average results of ten experiments with the same parameter settings.
Towards the multimodal datasets, we have randomly selected half amount of training samples as the validation data to help optimize the performance. When training the Houston dataset, we set the batch size as 100 and the training epoch as 80. While training the Trento dataset, we set the batch size as 100 and the training epoch as 13. We have utilized the fine-tune strategy to improve the multimodal attention algorithm performance during the training stage. We have separately trained the hyperspectral CNN and LiDAR cascaded CNN to save the trained model, then trained the multimodal neural network with initialization of the saved model. We have selected the Adam optimizer with 0.001 on training LiDAR data and 0.0001 on hyperspectral data. While conducting fine-tune training, we also choose Adam as an optimizer with a 0.001 learning rate. In case of overfitting the data, we design a 0.25 ratio dropout operation in the fusion stage. Other parameters has been listed in the framework.

C. Results and Analysis
This article compares the proposed method with classic machine learning methods, including SVM [32] and ELM [33].
Besides, we also introduced the fundamental Co-CNN [28] methods based on CNN to further prove the efficiency of proposed cascaded attention network. The final experiment results list in Table III and IV. As shown in Tables III to V, under the same training epochs condition, although the proposed methods cost more training time, but achieve higher classification performance with nearly model parameters. The proposed methods have achieved better performance on OA, AA, and Kappa key metrics.
As the classification results shown in Figs. 8 and 9, the deep learning-based methods achieve better classification performance than classical machine learning methods on both datasets.
Our designed framework highlights the LiDAR ground objects' height information by utilizing an attention mechanism and cascaded multiscale network. For Houston data, it is clear that the multimodal data with the proposed method has better performance on trees and Parking lot 2, which can easily be predicted as similar health grass and Parking lot 1 ground object categories. For Trento data, our proposed method has achieved tremendous results on roads, with similar strong object spectral and spatial features with buildings owing to the sensors' overlook perspective. As shown in Fig. 9(a), (d) and (f), the Co-CNN method does not perform well (85.45% accuracy) in the class of the road, in which several pixels have been classified as buildings because of lacking a specific height LiDAR feature. Besides, roads are easily predicted as ground owing to a similar height between road and ground class. Our proposed methods focus both on LiDAR contextual spatial info by multiscale cascaded network and attention mechanism to enhance precious LiDAR info to achieve 93.51% accuracy.

V. CONCLUSION
In this article, our proposed multimodal attention-aware convolutional network has fully utilized the height feature of ground objects provided by LiDAR data, which has achieved outstanding performances on easily confusing categories and overall accuracy. We have also conducted classic machine learning methods (SVM and ELM) and deep learning-based methods to compare the efficiency of the proposed methods. Because of the substantial feature divergence between various modalities, the proposed methods have proved how to strengthen the feature derived from the source without sufficient original image information. In the future work, we will continue to explore more possibilities to narrow the feature diversity among multiple modalities to achieve better classification performance, not limited the feature augmentation, fusion methods, robustness evaluation, or higher efficient training strategy.