A Dual Attention Neural Network for Airborne LiDAR Point Cloud Semantic Segmentation

With the development of airborne light detection and ranging (LiDAR) technology, it has become a common and efficient way to collect large-scale 3-D spatial information. However, efficient and automatic semantic segmentation of LiDAR data, in the form of 3-D point clouds, remains a persistent challenge. To address this, a dual attention neural network (DA-Net) is proposed, consisting of two different blocks, namely, augmented edge representation (AER) and elevation attentive pooling (EAP). First, the AER can adaptively represent local orientation and position, thereby effectively enhancing geometric information. Second, the captured local features of centroid points are utilized to further encode discriminative features using the EAP with the learned attention scores. Finally, a location homogeneity (LH) module is devised to explore the long-range relationship in an encoder–decoder network. Benefiting from the dual attention module, geometric information hidden in unorganized point clouds can be effectively propagated. Besides, the LH forces the network to pay attention to the semantic consistency of elevated objects, which facilitates both point- and object-level point cloud semantic segmentation for scene understanding. A benchmark dataset is used to assess the proposed method, which achieves an overall accuracy of 85.98% and an average F1 score of 72.31%. In addition, comparisons with other latest deep learning methods on the 2019 Data Fusion Contest dataset further demonstrate the robustness and generalization ability of the proposed method.


I. INTRODUCTION
29 C OMPARED with oblique photogrammetry [1] and very 30 high-resolution optical satellite images [2], airborne light 31 detection and ranging (LiDAR) technology has the advan-32 tage of direct 3-D data acquisition with high precision and 33 density. Nowadays, airborne LiDAR has been used in many 34 surveying and geographic mapping applications, such as power 35 corridor detection [3], [4], forest biomass estimation [5], 36 3-D modeling [6], and road extraction [7]. For these appli- 37 cations, semantic segmentation in terms of assigning a label 38 to each 3-D point is a fundamental step in postprocessing. 39 However, efficient semantic segmentation of airborne LiDAR 40 point clouds remains a challenge due to the unordered and 41 irregular nature of the data [8]. 42 In previous years, many methods adopted classic machine 43 learning classifiers and handcrafted features for point cloud 44 semantic segmentation. These classifiers include support vec-45 tor machine (SVM) [7], [9], random forest (RF) [10], [14], 46 Markov random field (MRF) [11], conditional random field 47 (CRF) [12], [13], expectation maximization (EM) [38], and 48 AdaBoost [39]. Indeed, this type of method is likely to achieve 49 satisfactory performance in certain scenes. However, feature 50 engineering, referring to the creation of a set of handcrafted 51 features, is often time-consuming. Besides, these methods 52 often require prior knowledge of a given study area to choose 53 an optimal classifier [14]. 54 In recent years, deep learning methods, especially convo-55 lutional neural networks (CNNs), show the capacity to deal 56 with challenging tasks, such as image classification [15], [16] 57 and object detection [17], [18]. Although deep convolutional 58 networks demonstrate outstanding performance in the field 59 of 2-D computer vision, they cannot be directly applied to 60 3-D point clouds. Some researchers choose to convert point 61 clouds to images or regular 3-D voxels before feeding them 62 into CNN-like architecture [19], [20], [21]. The key to these 63 methods is to transform such disordered, unstructured data into 64 a regular grid array for easier processing, which, however, 65 could potentially cause spatial information loss. Moreover, 66 determining the projection angle or selecting the voxel reso-67 lution remains an issue. To completely solve the disorder and 68 permutation invariance of point clouds, a point-based method 69 called PointNet using the multilayer perceptron (MLP) and 70 the max-pooling operation is proposed [22]. Though simple, 71 PointNet served as a milestone that inspired many future stud- 72   shape distributions [74], and signatures of histograms [75]. 129 Considering that the above 3-D descriptors can be expensive to 130 compute, some work [14], [76] used more general covariance 131 matrix-based features, which can effectively represent a local 132 geometric pattern. Furthermore, a variety of supervised clas-133 sification methods (e.g., SVM, RF, and Adaboost) are applied 134 for ALS point cloud semantic segmentation. However, due to 135 the neglect of contextual information from neighborhoods, the 136 segmentation results of these methods are often noisy, which 137 manifests as label inconsistency [40]. To address this issue, 138 a common strategy is to use individual point class possibility 139 as a unary term in probabilistic graphical models, such as 140 CRF and MRF [11], [13]. For example, Niemeyer et al. [41]  Besides, they perform a tradeoff with results obtained from 180 the bagged decision tree classifier to optimize the wrong 181 classification of boundary points [20]. To achieve a more 182 efficient data conversion, Rizaldy et al. [42] suggested con-183 verting point clouds into a panoramic image with multichannel 184 (e.g., elevation, intensity, the number of echoes, and height 185 difference). Subsequently, an end-to-end full CNN was applied 186 for pixelwise classification [42]. However, the transformation 187 process from 3-D to 2-D inevitably brings about the loss 188 of geometric information, which leads to misclassification, 189 especially in complex scenes [8].     [70] proposed an attention-based graph 272 convolutional network to explore the relationship between 273 local neighborhoods. Despite remarkable performance on 274 indoor datasets, few works attempt to employ the atten-275 tion mechanism for semantic segmentation of outdoor point 276 clouds. Considering that the existing convolution operators 277 cannot capture geometric features from the neighborhood, 278 Li et al.
where [] denotes the concatenation operation; || || is the 331 Euclidean distance between p i and its neighboring point p k i .

332
For multidimensional orientation, we define a vector o k i ∈ R 4 333 to represent the direction from p i to p k i . To model local geometrical information of each edge, a cylindrical reference 335 frame is introduced, as shown in Fig. 1(a). Inspired by [66], 336 an oriented coordinate system ( p i , ) that uses plane 337 through p i oriented perpendicularly to centroidal vector is 338 defined. To obtain , the centroid p c i of the local neighborhood 339 (i.e., KNN) is introduced. Then, is defined based on the 340 unitized direction vector from p i to p c i . Given a point p k i , the 341 coordinate value α is defined as the perpendicular distance 342 to , and β is defined as the signed perpendicular distance to 343 as follows: (2) 345 In order to form o k i , we utilize the partition strategy to 346 partition a feature space into four bins based on the cylindrical 347 reference frame. As shown in Fig. 1(b) and (c), the 2-D space 348 is first partitioned into S a = 2 areas using half of α max along 349 the plane . Each sector is further divided into S z = 2 zones 350 according to the sign of β along the centroidal vector . 351 It is worth noting that, using the partition strategy, o k i is 352 represented as a binary vector with a length of S a × S z . 353 In other words, a value of 1 corresponds to the located region 354 in the cylindrical reference frame. Overall, this kind of local 355 representation has two distinctive benefits: 1) the cylindrical 356 reference frame enables us to learn rotation-invariant feature 357 representation and 2) the centroidal vector can reduce the 358 effect of some disturbances, such as occlusion and sampling 359 due to the robustness of the centroid of the local neighborhood. 360 For each neighboring point p k i with its corresponding fea-361 ture f k i , an aggregated edge feature,ḟ k i , can be computed as 362 follows: where denotes an MLP with a batch normalized layer and 365 an ReLu activation function; [] denotes the concatenation 366 operation.

367
2) Elevation Attentive Pooling: Given the gathered KNN 368 featureḞ i {ḟ 1 i ,ḟ 2 i , …,ḟ k i }, it is important to design an 369 ensemble block for integrating such features. Meanwhile, 370 extensive work has proved that elevation information derived 371 from LiDAR data plays an important role in the classification 372 task. Instead of using the attention weights directly generated 373 from the z coordinate to adjust the final feature map per 374 channel [33], we aim to extract effective feature representation 375 in the local neighborhood based on geometric distance and 376 the z coordinate. Therefore, an EAP block is proposed to 377 adaptively aggregate theḞ i using the attention mechanism.

378
As illustrated in Fig. 2, to form aggregated features, the 379 EAP first performs dot product on z coordinate in neighboring 380 points and relative distance computed by 3-D coordinates 381 based on the Euclidean distance. Specifically, the encoded 382 elevation feature, k i , can be expressed as follows: where "·" denotes the dot product; x k i , y k i , and z k i are the rela-386 tive coordinate in the Cartesian coordinate system. Meanwhile, 387 where [] denotes the concatenation operation. In addition, 391 a shared MLP denoted as is used to form the attentive scores.

392
Followed by a softmax function, the attentive scores s k i can be 393 computed as follows:  Thereafter, another matrix multiplication of A and F C is 428 performed. In addition, a scale parameter μ is applied to 429 progressively increase the weight of long-range dependencies. 430 Thus, for a point cloud P, the augmented feature F out ∈ R N ×d 431 can be formulated as follows: where the operation "+" denotes the elementwise sum and the 434 operation "·" is the matrix multiplication. To summarize, the 435 LH encodes the long-range dependencies to aggregate geo-436 metric features. Specifically, after delivering such information 437 (blue dotted arrow in Fig. 3), a point can automatically capture 438 the important geometric patterns coming from p n . As depicted 439 in Fig. 3, p 3 receives features from p n , thereby enriching its 440 semantic representation.

496
The ISPRS Vaihingen dataset 1 obtained by the Leica 497 ALS50 system in Germany is used to evaluate the proposed 498 method [13]. The information of each point is given by 499 {x, y, z, intensity, number of echoes, total number of echoes}. 500 In addition, a multispectral orthophoto image with a resolution 501 of 0.02 m corresponding to IR-R-G in this area is provided. 502 By calculating the band value of the corresponding pixel in the 503 raster image corresponding to each point, the corresponding 504 color information can be obtained, as shown in Fig. 5(a). 505 In particular, we use spectral features while training our 506 network. The dataset has already been divided into a training 507 set and a test set by the organizer. The training set contains 508 753 876 points, and the test set contains 411 722 points. They 509 are classified into nine categories, including powerline, low 510 vegetation, impervious surfaces, car, fence/hedge, roof, facade, 511 shrub, and tree.   The calculation formula is given as follows: in which the precision rate and the recall rate are calculated 556 as follows: 2) Implementation Details: The proposed model is imple-559 mented using the framework of Tensorflow 1.14.0. We use an 560 RTX 2080 Ti GPU to train the network model, which took 561 about 10 h. During the training stage, the Adam optimizer 562 is used to optimize the model with an initial learning rate of 563 0.005. We employ the cosine decay measurement to adjust the 564 learning rate, and the number of decay steps is 30,000. The 565 batch size, the decay rate, and the max epoch are set to 6, 0.5, 566 and 1000. In our encoder-decoder network, we set the number 567 of neighbor points for upsampling and downsampling to 32. 568 In the construction of the KNN graph, we use the setting of 569 K = 16. The model parameters configured are derived from 570 a series of comparative experiments.  Fig. 6, it seems that the categories of most 575 points in the test data scene can be correctly identified. 576 As depicted in Fig. 7, the major misclassified points appear at 577 the border between hedges and shrubs. In particular, our model 578 can accurately distinguish between roof and tree categories 579 even though they are similar in height and have no apparent 580 boundaries.

581
In addition, the quantitative results are displayed in a 582 confusion matrix (see Table II), which shows that the F1 583 scores of five out of the nine categories are more than 75%. 584 The categories with better classification accuracy are low 585      Fig. 9(a)]. On the other hand, thanks 603 to the introduced spectral features, roofs, impervious surfaces, 604 and facades can be effectively identified. The line graphs in 605 Fig. 9(b) present the 128-D features of points of each category. 606 Accordingly, it can be concluded that the proposed DA-Net 607 can capture high-level discriminative features for the semantic 608 segmentation of point clouds. These discriminative features 609 enable our method to work robustly in complex scenarios.  Table III, it can be seen that, when K is set as 16, our 616 method achieves the best result. When K is greater than 16, 617 the avg F1 score decreases significantly as the parameters of 618 the DA module increase. In this case, redundant information 619      at the corresponding place. The results show that adding the 642 LH module helps to boost the classification performance. 643 Specifically, most categories benefit from the introduction of 644 the LH module in terms of the F1 score. This is because 645 the LH helps the DA-Net take the long-range dependency 646 into account, thereby correcting some potential classification 647 errors.  Table VI, in which 654 OA and average F1 scores are shown in the last two columns. 655 As indicated in Table VI, the proposed method outperforms 656 all the other methods in terms of OA and average F1 score. 657 Specifically, the OA and average F1 scores are 0.7% and 3.0% 658 higher than the best of other methods, respectively. Note that 659 it achieves state-of-the-art results in six categories (e.g., pow-660 erline, impervious surfaces, car, roof, facade, and tree).

661
In comparison, the NANJ2 [20] achieves the best results in 662 the two categories of low vegetation and shrub. WHUY4 [19] 663 is 17.4% higher than ours in terms of the F1 score of 664 fence/hedge. However, it did not achieve the best performance 665 in terms of OA and average F1 score. It seems that these 666 methods tend to precisely classify a specific category rather 667 than all categories, resulting in inferior performance on the 668 average F1. One can find that our method can produce more 669 consistent results across most categories, as the weighted loss 670 is applied to deal with the imbalance of the data. On the other 671 hand, the proposed DA-Net has achieved excellent results 672 in categories with significant elevation differences, such as 673 powerline, roof, and car. One possible reason is that the EAP 674 amplifies the difference in height distribution and, therefore, 675 improves the recognition ability.

676
Furthermore, the proposed DA-Net is compared with 677 other advanced deep learning models, as shown in 678  [32], and GACNN [54]. Note that OA and average F1 682 scores are also listed in the last two columns. As illustrated 683 in the table, the proposed DA-Net has the highest F1 score in 684 five categories (low vegetation, impervious surfaces, car, roof, 685 and tree), outperforming other models by 1.6%, 1.1%, 1.0%, 686 1.0%, and 1.0%, respectively. The F1 score of our method in 687 the powerline is slightly lower than GACNN [54]. A possible 688 reason is that the method introduces a density attention unit, 689 which benefits this category greatly. The A-XCRF model [36] 690 adopts a postprocessing technology to deal with overfitting; 691 thus, it can achieve a better F1 score in the two categories 692 of fence/hedge and shrub, which are difficult to distinguish. 693 The GADH-Net [33] improves the classification result of 694 fence/hedge (44.2%) by learning discriminative geometry rep-695 resentation. Nevertheless, the proposed DA-Net still achieves 696 the best results in terms of the OA (85.9%), which validates 697 the effectiveness of the DA and the LH.

698
As shown in Fig. 10, Table X. 767 The proposed DA-Net achieves the best performance with 768 an average F1 score of 83.3%. In addition, it improves the 769 best results of other models in high vegetation and building 770 categories by 0.1% and 0.6%, respectively. The F1 scores 771 of high vegetation and building by RandLA-Net are slightly 772 In this article, a DA-Net is proposed for semantic seg-800 mentation of airborne LiDAR point clouds. To better obtain 801 local topological structure features, a DA module is designed, 802 composed of an AER and an EAP. Between them, the AER can 803 construct a spatial representation that is robust to orientation 804 and position. The EAP focuses on the elevation information 805 to enhance local structural features. Finally, an LH module is 806 designed to explore long-range dependencies. By recursively 807 employing the DA and LH, an end-to-end architecture that 808 can be directly applied to the raw LiDAR point cloud is 809 constructed.   Xia Tao received the bachelor's degree from Jiangsu 1144 Normal University, Xuzhou, China, in 2020. She is 1145 currently pursuing the master's degree in photogram-1146 metry and remote sensing with the Key Laboratory 1147 of Virtual Geographic Environment, School of Geog-1148 raphy, Nanjing Normal University, Nanjing, China. 1149 Her research interests include deep learning 1150 and high-resolution remote sensing image change 1151 detection.

1152
Yaqin Zhou received the bachelor's degree 1153 from Nanjing Forestry University, Nanjing, China, 1154 in 2020. She is currently pursuing the master's 1155 degree in cartography and geographical information 1156 engineering with the Key Laboratory of Virtual Geo-1157 graphic Environment, School of Geography, Nanjing 1158 Normal University, Nanjing.

1159
Her research interests include heterologous satel-1160 lite imagery dense matching.