OFFS-Net: Optimal Feature Fusion-Based Spectral Information Network for Airborne Point Cloud Classification

Airborne laser scanning (ALS) point cloud classification is a necessary step for understanding 3-D scenes and their applications in various industries. However, the classification accuracy and efficiency are low: 1) point cloud classification methods lack effective filtering of the large number of traditional features, 2) significant category imbalance and coordinate scale problems in ALS point cloud classification. To address these problems, this article proposes an airborne LiDAR point cloud classification method based on deep learning network with optimal feature fusion-based spectral information. This method involves the following steps: First, multiscale point cloud features are extracted, and random forest method is used to filter the features, while spectral information is fused to obtain a point cloud feature dataset with less but better data. Second, to adapt to the characteristics of the airborne point cloud, the improved RandLA-Net can simultaneously retain the advantages of random sampling and learn deeper semantic information by fusing the constructed point cloud features with the local feature aggregation module in the network. Third, four fusion models are constructed to verify the effectiveness of the optimal feature fusion-based spectral information network (OFFS-Net) model for airborne point cloud classification. Last, these models are trained and tested on Vaihingen 3-D dataset. The OFFS-Net achieves overall accuracy score of 84.9% and F1-score of 72.3%, which are better than the mainstream methods. This also validates that the proposed OFFS-Net point cloud classification method, based on the advantages of geometric feature and spectral information is excellent.


I. INTRODUCTION
W ITH the development and popularity of light detection and ranging (LiDAR) technology, it is easier to obtain point cloud data. Airborne laser scanning (ALS) is a spatial information acquisition technology for rapidly acquiring 3-D point clouds of ground objects. It has been widely used in many fields, such as forest monitoring [1], [2], powerline detection [3], Manuscript  and 3-D building reconstruction [4], [5]. The high-precision classification of point cloud is the basis for the understanding and analysis of these 3-D scenes. However, ALS point cloud classification is a challenging problem and a current research hotspot due to the disorder, irregularity, and nonuniform density of the original point cloud. Nowadays, a large number of methods have been proposed by research scholars for the mainstream algorithm of airborne LiDAR point cloud classification. The classification methods are mainly divided into two types as follows.
1) Machine learning point cloud classification algorithms. These methods use single or multiple shallow machine learning algorithms, such as support vector machine [6], [7], AdaBoost [8], random forest (RF) [9], Markov random field [10], and conditional random field [11], [12]. These methods focus on the selection and design of features for point cloud classification. However, these features are generated from the local domain of each point. The machine learning algorithms use only the local features as the input to classify each point individually, ignoring the spatial correlation between neighboring points. Moreover, these types of algorithms rely considerably on the researcher's a priori knowledge, which is not satisfactory when faced with complex scenes for airborne LiDAR point cloud classification. 2) Deep learning point cloud classification algorithms. The significant success of deep learning models in applications, such as speech recognition, image semantic segmentation, and natural language processing has attracted the attention of point cloud classification researchers. Unlike the machine learning methods, deep learning can learn features directly during the training process. Most of the early deep learning point cloud classification studies transformed point clouds into regular 2-D images [13], [14] or 3-D voxels [15], [16]. However, this transformation caused computational inefficiency, high memory consumption, and the loss of 3-D spatial information and inherent physical attributes. At present, many deep learning models exist that are directly applied to the original point cloud, such as PointNet [17] do pioneering work, which is first used to learn feature point by point. The global features of point clouds are integrated with maximum pooling, but the local feature information cannot be extracted effectively. PointNet++ [18] is an improved structure This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ that enhances the extraction of local information based on the PointNet. PointConv [19] proposes a spatially continuous convolution method that can effectively reduce the computer memory consumption. PointCNN [20] attempts to learn the X-transform convolution operator to transform the disordered point cloud into the corresponding regular sequence, and then uses the convolutional neural networks (CNN) structure to extract local features. The segmentation prediction and guidance [21] network converts the input point cloud data into geometrically simple hyperpoint structures by geometric homogeneous partitioning of 3-D point cloud and constructing a graph neural network to handle the task of semantic segmentation of large field point clouds. 3D-CTN [22] proposes a novel layering framework that combines convolution with efficient local feature learning and transformer with superior global information integration for point cloud classification. Although the aforementioned methods can extract local features and achieve good performance in related fields, some of them have excessive memory consumption or high computational complexity. This leads to the inability to design deeper network structures, which results in insufficient semantic information extraction, especially in large scale point cloud scenes.
For ALS point cloud, it is difficult to achieve fine classification of terrain with sparse 3-D coordinates, which still needs to be supplemented by traditional features. Shi et al. [23] proposed a point cloud feature classification method based on PointNet that fused point clouds and remote sensing images. Zhao et al. [24] proposed an airborne point cloud classification method based on deep residual networks, which used four shallow features to construct multiscale point cloud features: 1) normalized elevation, 2) surface change rate, 3) intensity, and 4) vegetation index. Li et al. [25] introduced geometric matrix features in the graph convolution method to enhance the description of point cloud geometric features. In this method, the spectral information from airborne multispectral LiDAR point clouds was also used to further improve the accuracy. Widyaningrum et al. [26] supplemented the airborne point cloud information with color information from aerial orthophoto images based on the DGCNN. Li et al. [27] extracted 40-D features by setting five scaled spherical neighborhoods, and subsequently used the back-propagation neural networks to achieve point cloud semantic segmentation. Dai et al. [28] proposed a method based on PointNet++ for airborne LiDAR point clouds by fusing geometric CNNs with multiple shallow-level features. Li [29] proposed an attention map geometric convolution operator for extracting spatial geometric structure features, constructing feature pyramids to integrate features at different scales, while incorporating airborne multispectral LiDAR data for point cloud classification. Wang [30] proposed a weakly supervised framework model based on deep learning while incorporating spectral information for point cloud classification. Researchers have proposed the use of various features for point cloud classification, but lacked a selection process for these features. If all features are directly input to the network for point cloud classification, it will increase the operation time due to the increase of feature dimension of sample data. Furthermore, the addition of irrelevant features does not effectively improve the classification accuracy of the point cloud.
Based on the above problems, this article will discuss the following two aspects. 1) To overcome the inadequacy of the network, this article uses the RandLA-Net network proposed by Hu et al. [31] in 2020. The network uses random sampling to reduce the memory consumption and computational complexity. It also introduces an attention mechanism to integrate local features so that the segmentation can be performed in large-scale point cloud scenarios.
2) To resolve irrelevant feature inputs, this article utilizes an RF approach to screen the optimal point cloud feature, which reduces the feature dimensionality and the irrelevant features input. Thus, this article proposes an airborne lidar point cloud classification method using point cloud optimal feature fusion of spectral information in deep learning network (OFFS-Net). The specific contributions are as follows.
1) In this article, we propose to use the RF method to filter various point cloud features. These features include the geometric features, spatial distribution features, and other features. The filtering provides optimal feature, while fusing spectral information to construct a point cloud feature dataset with a lower number of better features. in the encoding stage to reduce computational complexity and memory consumption but also uses the nearestneighbor interpolation in the decoding stage to further improve the computational and memory efficiency. The deep learning framework is also improved for airborne point clouds, and methods, such as coordinate scale processing are added to the network to enhance the applicability of the model to airborne LiDAR point clouds.

II. METHOD
It is difficult to achieve fine classification relying only on the sparse 3-D coordinates of airborne LiDAR point clouds, which still needs to be supplemented with hand-designed features. In order to reduce the input dimension of sample data and effectively achieve fine point cloud classification. In this article, we first construct multiscale point cloud feature sets. Subsequently, we use the RF method to filter out the optimal feature of the point cloud. Finally, we fuse the spectral information.
The optimal feature and spectral information are input to the neural network with an encoder-decoder structure to verify the effect of the proposed method on the improvement of point cloud classification accuracy.
A. Feature Selection With RF 1) Point Cloud Feature Statistic: Compared with aerial image data, airborne LiDAR point cloud data have higher dimensionality and can express both 3-D spatial and 2-D planar information. In addition to the features relating to the actual physical meaning, geometric features can be extracted from the airborne point cloud data. In this article, four types of features are selected for participation in the optimal feature selection, which include elevation related features, plane-related features, spatial distribution features, and echo features. Spectral information as a supplement to the selection of optimal feature. a) Elevation related features: The advantage of 3-D point cloud data over traditional 2-D data in spatial environment perception is more obvious, which the feature of elevation Z can be obtained intuitively from the point cloud data. Table I shows its specific expression and practical meaning, where z i denotes the elevation of the ith domain point, and z denotes the average elevation.
b) Plane-related features: The plane-related features are mainly reflected in that different features have different plane. Table II shows the expression and practical meaning of the 2-D plane planar features. Ax + By + Cz + D = 0 indicates that a fitted plane is obtained according to the principle of least squares. Γ is the matrix composed of the coefficients of the plane equation, and I is the full one column vector.
d) Echo features: The echo features mainly refer to the fact that the echo counts of LiDAR are different for ground objects. Table III shows the specific expressions and practical meanings of these indices, where N s is the single echo, N l is the last echo, N f is the first echo, N i is the intermediate echo, and N all is the number of search and domain points that form a point set.
e) Spectral information: The optical sensor on the airborne LiDAR can provide remote sensing images with rich textural and spectral information, which makes it easier to identify the ground object character. Point cloud data with spectral information are obtained by aligning and fusing the airborne point cloud with remote sensing images. In this article, the spectral information is used as supplementary information to verify whether the integration of spectral information on the basis of optimal features substantially improves the point cloud classification results.
2) Importance Screening of Point Cloud Features Using RF Algorithm: The purpose of feature selection is to identify key feature from the point cloud feature set and remove irrelevant features or redundant feature. This step reduces the dimensionality of the feature, and thus, improves the model training speed and learning performance [32], [33]. In this article, we use the RF algorithm to evaluate the importance of features. The basic idea is the calculation of the averaged contribution value of each feature to each tree in the RF, and the comparison and ranking of contribution values of different features. The contributions of different features to each tree can usually be measured by using the Gini index or out-of-bag (OOB) error as an evaluation index. The algorithm uses the bagged sampling technique, which can effectively reduce the risk of overfitting and is robust against noise and generalization [34]. The method in this article selects the importance of the OOB error measure feature X in the following steps.
1) For each decision tree, select the corresponding OOB data to calculate the OOB data error, denoted as errOOB1. 2) Randomly add noise interference to feature X of all samples of OOB data. The value of samples at feature X can be changed randomly. The OOB data error is recalculated and denoted as errOOB2. 3) Assuming that there are N trees in the forest, the importance of the feature X is given by (errOOB2 − errOOB1)/N . The reason for this value to be indicative of the importance of the feature is that if the accuracy of the OOB data decreases substantially after adding random noise, i.e., errOOB2 increase, it indicates that this feature significantly impacts the prediction results of the sample, which in turn indicates a relatively high level of importance. The spectral information is obtained by the fusion of the point cloud with the remote sensing image, and information of the other features is obtained for the original point cloud. In this article, when selecting the optimal feature, only the features other than the spectral information are ranked in importance. Combined with the research area characteristics, the neighborhood radius are set, which are 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.2, and 1.5 m. The importance of the features at these different scales are ranked using the RF method [35]. Fig. 1 shows the experimental results. It can be gathered that the NH has the highest importance score in the whole point cloud feature dataset, therefore, it will be selected as the optimal feature in this article. Besides, if multiple optimal features are required, the features with higher importance scores should be selected as additional information for point cloud classification.

B. Fusion Model
In this article, four models are used for comparison experiments in order to further verify the effectiveness of optimal feature fusion of spectral information. The original point cloud data are first fed into the neural network input layer, as shown in Fig. 2(a), and subsequently passed through a symmetric encoder-decoder architecture. This architecture will be described in detail in Section II-C. Fig. 2(b) demonstrates the original point cloud fused with optimal features and input to the encoder part. Fig. 2(c) shows the fusion of the original point cloud with spectral information, which is subsequently used as input to the encoder part. Fig. 2(d) indicates that the optimal feature is fused with the spectral information and input to the encoder part.
The network models shown in Fig. 2 The OFFS-Net(O) model is used as a validity experiment to verify the optimal feature in point cloud classification. Its input consists of 3-D point cloud coordinates and the optimal feature, NH. The input feature matrix used for conducting semantic segmentation experiment based on optimal feature is given as follows: The OFFS-Net(S) model is used as an experiment to verify the validity of spectral information in point cloud classification. Its input consists of 3-D coordinates and spectral information in LiDAR point cloud data. The input feature matrix used to conduct a semantic segmentation experiment based on spectral information is where, (N IR, R, G) represents the spectral information.
The OFFS-Net model is used as an experiment to verify the effectiveness of optimal feature fusion spectral information. Its input consists of 3-D coordinates, optimal feature, and spectral information in LiDAR point cloud data. The input feature matrix used for conducting point cloud classification experiment based on optimal feature fusion of spectral information is given as

C. Overall Network Structure
The proposed OFFS-Net model is based on the RandLA-Net semantic segmentation network. Fig. 3 shows the overall network, which can be divided into two parts: 1) encoder and 2) decoder. There are four encoding layers in the encoder, and each encoding layer is operated by a local feature aggregation (LFA) module and random sampling (RS) module. The former will enable the coding layer to gain a larger field of perception and increase the point cloud feature dimension (8 → 32 → 128 → 256 → 512). The LFA module is described in detail in Hu [31] and is used directly in this article without further introduction. In large-scale scenes, random sampling is used as the most appropriate way to randomly select 1/4 part of the points as the next stage of the point cloud each time the sampling is performed There are four decoding layers in the decoding stage, and the input of the decoding layers consists of two parts: 1) one part comes from the encoder in the same stage, and 2) the other part comes from the output features of the decoder in the previous layer First, the k-nearest-neighbor (k-NN) algorithm is used to find the nearest neighbor point for each query point, which is extended to the previous point cloud by the nearest neighbor interpolation method. Second, the dimensional features of the points after upsampling are reduced using an multilayer perceptrons (MLP) network, while the decoded features are stacked with the corresponding features from the encoder stage using a skip connection operation. Last, the features are output using a linear layer and an activation function.
The predicted point cloud prediction labels are obtained using SegHead, which is composed of three linear layers and a dropout layer. The final output is (N × n class ), where N and n class represent the number of point clouds and the category, respectively.

D. Airborne Point Cloud Model Design
The mainstream point-based point cloud deep learning networks are usually designed for indoor targets and outdoor site point clouds, while only a few networks are designed for ALS point cloud. LiDAR point cloud data for indoor or outdoor scenes is usually acquired at close-range, which is slightly different from the ALS point cloud acquisition method. As a result, there are significant differences in terms of point cloud density, scene complexity and area range of the point cloud. Also, when airborne point clouds are used for complex and various types of topographic data, regional differences can cause variations in feature types, making the amount of point cloud data for each type of feature unbalanced. Therefore, in this article, we use the following three methods to improve the adaptability and robustness of the above network for airborne point cloud data.
1) Category Balance Processing: Since the airborne LiDAR point clouds are oriented to complex and diverse terrain data, regional differences will cause changes in ground object, while different point cloud regions will cause imbalance in data volume among various types of point clouds. In order to ensure that the trained network is only determined by the deep green learnable weights of each category, and is not affected by the difference in the number of categories in the initial dataset, we assign weights to each category of point clouds when the data are read where c denotes the category label of the point cloud, num c denotes the total number of c categories, and num all is the total number of points. 2) Coordinate Scale Processing: The planar coordinates are leveled and downscaled, and the 3-D coordinates of the point cloud are normalized to avoid the influence of the coordinate scale problem on the network extracted features. This is carried out as follows: where (X, Y, Z) denote the preprocessed 3-D coordinates,

A. Data Introduction
To verify the validity of the method proposed in this article, experiments are conducted using the publicly available point cloud dataset from the Vaihingen area benchmark provided by the International Society for Photogrammetry and Remote Sensing (ISPRS). The data area is an urban center with lush vegetation in the summer season. The average density of point clouds in the whole area is 6.7 points/m2 at 30% heading overlap and 60% side overlap. Fig. 4 shows the data selected for this article, which are displayed according to the category information. The blank areas in the figure are due to the missing airborne LiDAR scans. There are 753 876 and 411 722 points in the training and test sets, respectively. These points are annotated into nine categories: 1) P owerline, 2) Low_veg, 3) Surf ace, 4) Car, 5) Hedge, 6) Roof , 7) F acade, 8) Shrub and 9) T ree, as detailed in Table IV. It can be gathered from the results for traditional point cloud features screening module shown in Fig. 1 that the optimal feature selection is NH, and the results are shown in Fig. 5. The results of the fusion of airborne point cloud and remote sensing images for the experimental data in the survey area are shown in Figs. 6 and 7. Fig. 6(a)-(c) represent the training set, the remote sensing images corresponding to the training set and the training dataset after fusing remote sensing images, respectively. Fig. 7(a)-(c) represent the test set, the remote sensing images corresponding to the test set and the test dataset after fusing remote sensing images, respectively.

B. Training Settings
The experimental platform is based on CPU i7-8700 K and NVIDIA RTX2080 GPU in Ubuntu 16.04 environment, using    CUDA9.0 to accelerate the GPU computing. The deep learning framework is Tensorflow-GPU 1.11.0 with Python 3.5. During network training, K and N in each batch are set to 16 and 4096, respectively. The number of iterations is 100 epochs, the initial learning rate is 0.01, and the learning rate decay coefficient is 0.5.

C. Evaluation Indicator
The performance of a model is usually evaluated by several commonly used metrics, including the overall accuracy (OA), F1-score, and mean intersection-over-union (mIoU). The OA measures the classification performance of all categories as a whole and is calculated as the ratio between the correctly classified points and the total number of points in the test set. The F1-score measures the classification performance of each category based on the precision and recall of the classification model. The intersection-over-union (IoU), which is the ratio between the intersection and the concatenation of two sets, can be interpreted as the ratio between the intersection of the segmentation results and the ground truth scenario, and their concatenation. The average interaction ratio of each category is denoted by mIoU. Each evaluation index is calculated as follows: Precison = T P T P + F P In   analysis of the four experimental results show that the proposed OFFS-Net model is optimal in terms of the evaluation index of the involved ground object categories F1-score and the comprehensive evaluation indices OA, Avg. F 1, mIoU , except for, Hedge and Shrub. Compared with the base reference experiment OFFS-Net(B), there is an improvement of 3.5% and 3.7% in the OA and mIoU evaluation metrics, respectively, and an improvement of 2.9% in the Avg.F 1. As Table VI shows, the proposed method has the longest training time among all the four models in the same configuration environment; however, the difference in training time is not significant. Moreover, the OFFS-Net model has the optimal performance with respect to all the comprehensive evaluation indices, which indicates to a certain extent that the proposed method improves the classification accuracy and considers the computational efficiency.
As Fig. 8 shows, most of the energies of the four experimental confusion matrices are concentrated in the diagonal, and most of the categories achieve acceptable performance. Compared with the other three models, the OFFS-Net energy is more concentrated. It can be observed from the classification error plot in Fig. 9 that the method proposed in this article can predict most points in the test region correctly compared to the other three methods. As seen in the partially zoomed-in part of Fig. 10, the misclassification phenomenon of the method proposed in this article is considerably reduced on impervious ground and roof. Therefore, it can be concluded that the optimal feature fusion of spectral information proposed in this article enables the deep learning network to focus more on the local features of the point cloud, and the misclassification phenomenon is effectively reduced. This demonstrates the effectiveness of the proposed method in semantic segmentation of complex scenes.
The aforementioned comparative experimental results underscore the effectiveness of the proposed OFFS-Net method in terms of computational efficiency and classification accuracy. Table VII show the evaluation indices for each category. The F1-score in seven out of nine categories are higher than 60%, which can effectively identify most of the categories, and the OA is 84.9%. Good results are obtained for P owerline, Low_veg, Surf ace, Roof , and T ree categories, and the worst classification results are obtained for Hedge and Shrub categories. According to the confusion matrix shown in Fig. 8, it can be gathered that the Hedge are mostly predicted as Shrub, and the Shrub are mostly predicted as T ree. This confusion may arise as the height of the hedge is similar to that of the Shrub and the Hedge has a small vertical area, and the fused spectral information are not obvious. The Shrub misclassification may occur as the Shrub and T ree have similar topological and spectral information.
2) Comparison With Other Methods: The OFFS-Net method proposed in this article is based on the RandLA-Net semantic segmentation network. The experimental results demonstrating its high efficiency have also been described in the original article [31,Sec. 4.2]. Therefore, only the experimental results are discussed in the subsequent analysis in this article.
Comparing our method with other state-of-the-art methods, the official ISPRS website provides different experimental results. The error comparison graph shown in Fig. 11 demonstrates that for complex features prone to prediction errors, the method proposed in this article can considerably suppress misclassifications. Table VIII show the F1-scoreas well as the OA and Avg.F 1 for each category corresponding to the proposed and other methods. Although the NANJ2 model is slightly better than the proposed model in terms of the OA value, it uses the RGB information and a large number of features, such as intensity and roughness as inputs for point cloud classification. On the other hand, the algorithm proposed in this article uses only optimal features and spectral information as inputs. As far as the evaluation metrics are concerned, the OA   TABLE VIII  COMPARISON RESULTS OF OUR MODEL AND THE TOP SEVEN MODELS ON THE ISPRS BENCHMARK   TABLE IX  COMPARISON OF THE RESULTS OF THIS METHOD WITH PREVALENT POINT-BASED MODELS value focusses excessively on the main categories and ignores the other categories in the dataset, while the Avg.F 1 value can reflect the model performance with respect to all categories and is more meaningful for evaluation. The Avg.F 1 value obtained by the method proposed in this article is 72.3%, which is 3% higher than the Avg.F 1 value of the state-of-the-art NANJ2 model. The proposed method achieves higher accuracy in most categories, especially in P owerline classification, where the number of points is sparse.
A comparative analysis with other deep learning methods is performed and the statistical validation results on the Vaihingen, Germany urban semantic dataset are shown in Table IX.
Compared with other point cloud semantic segmentation algorithms, such as Pointnet++, PointSIFT, KPConv, D-FCN, etc., the algorithm proposed in this article can obtain a better ground objects classification results. The OA and Avg.F 1 values of the method proposed in this article are improved by 3.6% and 6.7%, respectively, compared with PointNet++. A comparison of the evaluation indices of different ground objects in Table IX shows that the proposed method achieves the optimal F1-score for P owerline, Low_veg, Roof , F acade, Shrub, and T ree categories. The F1-score for Low_veg, F acade, and Hedge categories are also similar to the optimal value. The results obtained in the Car category are not satisfactory. On the one hand, the NH feature of Car are not obvious and the surface area of cars is small, therefore, a lower amount of spectral information is fused. On the other hand, the training set of the Car is too small for the network to effectively learn the features of cars and other features.

IV. CONCLUSION
In this article, an OFFS-Net for airborne point cloud classification method was proposed. First, this article adopted the RF approach to optimally filter various multiscale features of airborne point clouds, and simultaneously enhance the classification accuracy of different ground objects. The spectral information was incorporated to build a lower number of better point cloud feature datasets, which reduced the feature dimensionality and removed irrelevant features. Second, the approach proposed in this article was based on the RandLA-Net semantic segmentation network, which was modified to accommodate the coordinate scale of airborne LiDAR point clouds and the differences in the number of different categories. This modification enhanced the generalization capability of the model. In addition, the proposed method could improve the point cloud classification efficiency, thanks to its improved computational efficiency and reduced memory consumption. Last, to demonstrate the effectiveness of the proposed method, several experiments were performed to validate the method on the ISPRS Vaihingen airborne point cloud data. The results showed that the proposed OFFS-Net model outperformed the most popular existing point cloud classification models and achieved advanced classification performance in terms of OA and Avg.F 1. However, further research needs to be done in the case of ground objects edges and detailed parts of complex scenes. In the future work, we will try to apply some new research of close-range scenario research to the ALS point cloud classification study. This can be carried out by adding density-aware convolution modules to accommodate the effects of uneven point cloud density distribution and constrain the boundaries of complex features. The method proposed in this article can be used to process more complex point cloud data acquired by different sensors.