LiDAR Data Classification Based on Improved Conditional Generative Adversarial Networks

Light detection and ranging (LiDAR) data contains the height of different objects and records the elevation information of ground objects, so it plays an important role in land classification. In recent years, deep learning has been widely used in LiDAR data classification due to its strong ability to extract features. However, deep learning methods usually need sufficient training data to achieve better classification results. In order to solve this problem, a new classification method combined conditional generative adversarial network (CGAN) with residual unit and DropBlock, is proposed here for the classification of LiDAR data, called as RDB-CGAN. CGAN expands the generated samples to training data to improve the classification performance when the training samples are relatively small. Residual unit increases the network depth of the generator to improve its generation capability and utilizes shortcut connection to transfer the input information directly to the output to solve degradation caused by increased network depth. DropBlock improved the generalization of the network by dropping a whole area with spatial information correlation so that the network can learn the remaining features. The experimental results on two different LiDAR datasets show that RDB-CGAN significantly improved the classification performance of LiDAR data compared to several state-of-the-art classification methods.


I. INTRODUCTION
LiDAR (light detection and ranging) technology is a new and rapidly developing data acquisition technology in recent years. With the increasing demand of LiDAR in the earth observation market, its application field and depth are also expanding. It has represented a new direction in the field of earth observation [1]. It has the characteristics of accurate, fast and direct acquisition of three-dimensional information, and is widely used in many practical applications, such as environmental monitoring, topographic mapping, urban three-dimensional modeling and coastal survey.
Different from ground objects classification based on point cloud data [2], [3], this paper studies the LiDAR raster digital surface model (LiDARDSM) data classification. It is obtained by sampling data into a regular grid, which represents the height of different ground objects and records their The associate editor coordinating the review of this manuscript and approving it for publication was Gerardo Di Martino . elevation information [4]. It has been applied to a variety of scenes. Lo et al. extracted various parameters and indicators of trees from LiDAR data by using multi-level morphological active contour algorithm [5]. Zhao et al. effectively detected the building area in LiDAR data [6]. Priestnall et al. studied the method of extracting urban features from LiDAR data [7].
The task of LiDAR data classification is usually based on pixel classification, and its accurate classification is of great significance to distinguish different land cover categories. Ghamisi et al. used extraction profiles and composite kernel support vector machine to study the classification of LiDAR data, and obtained high classification accuracy [8]. Wang et al. combined morphological profiles (MP) with convolution neural network (CNN) to improve classification accuracy of LiDAR data [9]. Lodha et al. used support vector machine (SVM) to classify LiDAR data and obtained high accuracy [10]. Xia et al. used a new integrated classifier to combine hyperspectral with LiDAR data, and verified the effectiveness and potential of the integrated classifier [11]. Sasaki et al. used decision tree classifier to study the average height of each type of land cover [12]. He et al. combined spatial transformation and convolution neural network(CNN) to distinguish the best input discrimination for improving the accuracy of data classification [13]. Zhou et al. used LiDAR data to study urban land cover classification, and accurately drew the urban land cover map [14]. Wu et al. proposed a multi-scale super-pixel classification method based on LiDAR normalized digital surface model data hierarchy, which significantly improved the extraction accuracy of urban impervious surface [15]. Khoddadzadeh et al. used multi-feature learning to fuse hyperspectral and LiDAR data for classification, which made it possible to extract multiple types of features from the fused image [16]. Wang et al. combined squeezenet with octave convolution to form a dual neural network, which reduced space redundancy and improved classification accuracy and efficiency [17]. Xie et al. first used an automatic designed CNN for the classification of LiDAR data [18]. Hong et al. proposed a fully connected dominated network EndNet for classification of HSI and LiDAR data [19]. Wu et al. proposed a classification algorithm combined Octave convolution with Capsule network(OctConv-CapsNet). It made the most of the spatial information and the high-and low-frequency information to obtain high classification accuracy of LiDAR data [20].
The LiDAR data classification methods based on deep learning need a large number of training samples, but data acquisition is difficult in the field of remote sensing, high cost is a common phenomenon. When the number of training samples is limited, the classification accuracy is better in the training stage, but the over fitting problem is more serious in the test stage. The generation of generative adversarial network (GAN) provides a good way to solve the over fitting problem [21].
GAN consists of a generator(G) and a discriminator(D). The goal of the generator is to fit the real data distribution as much as possible and generate new data to confuse the judgment of the discriminator. The purpose of the discriminator is to judge whether the input data is from the generator or from the real data. Both G and D need to be continuously optimized to improve their generation ability and discrimination ability respectively. This learning has been optimized to find a Nash equilibrium between the generator and the discriminator. GAN has been used in many applications of image processing. Zhang et al. put forward stacked generative adversarial networks (StackGANs) to generate high-resolution realistic images [22]. Lee et al. used GAN to study image superresolution application [23]. Quan et al. used GAN to realize image compression and perception reconstruction, which greatly enhanced the quality of reconstructed images [24]. Li et al. used asymmetric GAN to realize conversion application of non matching image to image [25].
Compared with the traditional machine learning algorithm, GAN extends the training data generated by the generator, making the network training more sufficient and further improving the performance of the network. Many scholars have applied GAN and its variants in the field of remote sensing data classification. Zhan et al. proposed Hyperspectral GAN (HSGAN) and applied to the semi-supervised classification of hyperspectral data. The experimental results proved that the network achieved better classification accuracy with a few category labels [26]. Lin et al. proposed multiple-layer feature-matching generative adversarial networks (MARTA-GAN), which realized the unsupervised classification of remote sensing data and significantly improved the classification accuracy [27]. Zhu et al. proposed 1-D GAN using spectral features and 3-D GAN using spatial features and spectral features of hyperspectral data to classification. The experiment results proved that the classification accuracy of the two networks were better than the existing classification methods [28]. Wang et al. proposed the Caps-TripleGAN classification model, which combined the capsule network with TripleGAN, to verify its classification performance for hyperspectral data [29].
Although GAN has performed well in the field of image processing, it has some disadvantages, such as too free training and uncontrollable image generation. Therefore, Mirza et al. proposed conditional generative adversarial network (CGAN) [30], adding category labels to the generator and discriminator, so that the generated images could be artificially controlled. Therefore, this paper makes improvements on the basis of CGAN. The residual unit and DropBlock are added to the generator and discriminator respectively to further improve the classification accuracy of the model.

II. BACKGROUND
A. CGAN CGAN develops on the basis of generative adversarial network, which changes GAN's unsupervised learning into supervised learning. The structure of CGAN is shown in Figure 1. Generally, CGAN consists of two parts: generator model G and discriminator model D. G can capture the potential distribution of real sample data and generate new data, while D is a binary classifier, which can judge whether the input sample is true or not. The information flow in CGAN is fed forward from a model G that generates pseudo data to the second model D, which evaluates the output of the model G.
To understand the distribution of G over real data x, we assume that x is sampled from real sample data distribution P(x) and z is sampled from prior distribution P(z) of input noise variables. G accepts random noise z and additional information y as input and generates a mapping to data space G(z|y). Additional information y could be category labels or data in different modes. D (x|y) estimates the probability that x is a real sample from the training data. In the optimization process, D is trained to maximize discriminate data sources, while G is trained to minimize log(1-D(G(z|y)). Therefore, the ultimate goal of optimization is to solve the minimax Through calculation and evaluation, it is found that when D has a high probability distribution of real samples, the gradient in G may disappear and the training process will stop. When the classification accuracy of D is high, in order to ensure that G has an appropriate gradient, the loss function of D usually maximizes the probability of classifying samples as true rather than minimizes the probability of classifying samples as false.
The formula for the D to optimize is as follows: The formula for the G to optimize is as follows: The training of the model can be completed by optimizing formula (2) and (3). D tries to identify the image generated at this time, while G network is responsible for generating pseudo-data image, at the same time, G network generates more realistic image as far as possible.

B. RESIDUAL UNIT
CNN can extract robust features from complex data and train original pixel values to obtain final classification results. Many studies have shown that network depth also affects its performance to a large extent, and deep CNN usually has better learning ability and image feature expression ability than shallow CNN. However, as the number of CNN layers increases, the number of parameters of the network also increases explosively. At the same time, gradient explosion or gradient disappearance is caused, which makes the training process unstable, leads to the network convergence difficulty and finally causes the network degradation. He et al. proposed the residual network (ResNet) to solve the degradation problem, whose main contribution is the deep residual learning [31]. The residual unit achieves the learning identity mapping, ensures that the identity layer of input and output is the same, and does not increase the number of network parameters.
The structure of the residual unit is shown in Figure 2, where x represents the input, H (x) represents the output, Here, F(x) + x is the shortcut way to perform the addition of the corresponding elements. This method neither introduces additional parameters while not increases the complexity of the calculation. It avoids the redundancy caused by redundant network layer and makes the network effectively solve the problem of network degradation in the case of gradient descent.
The operation of the residual unit is shown in formula (4)- (6). Formula (4) is the residual function to be learned by the network, where σ is the non-linear function ReLU, and W 1 and W 2 are the weights of the weight layer 1 and the weight layer 2 respectively. Formula (5) is the definition of residual unit. Formula (6) represents that when the dimensions of the input vector x and the residual function F are inconsistent, the input vector x should be performed linear mapping W s to achieve dimensional matching.
With the development of deep convolutional neural network, the network model becomes more and more complex, which also brings the disadvantages of convergence speed and stability reduction. Most of the methods adopt the L1 and L2 norm regularization and weight sharing to improve the convergence speed and stability, but this method cannot reduce the complexity of the network model and can only improve the stability of the local network. For this reason, Hinton et al. proposed Dropout regularization [32], which can significantly reduce the number of network parameters and effectively improve the convergence speed and stability by hiding some neurons with a certain probability during training. However, Dropout does not perform as well in the convolutional layer with spatial information as it does in the full connected layer. Therefore, Ghiasi et al. proposed DropBlock [33], which adopts a structured dropping method to better handle features in the convolutional layer. The processing of feature map by Dropout and DropBlock is shown in Figure 3. The left image represents the feature map of LiDAR data, the middle image represents the processing of feature map by Dropout, and the right image represents the one processed by DropBlock. Dropout uses random dropping to allow the network to learn the rest of the features. However, in the convolutional layer, the image has the correlation of spatial information, and these randomly dropped features will be compensated by the surrounding relevant feature information and enter the next convolutional layer. As a result, the network cannot fully learn all the features, leading to the misclassification. But DropBlock randomly drops a continuous block area in the feature map, this area contains all the relevant information of a particular feature. Therefore, after it is dropped, feature information disappears, forcing the convolutional layer to learn the features of other remaining regions, ensuring correct classification and improving network generalization. DropBlock has two main parameters: block_size and γ . block_size represents the size of the dropped block, and when block_size = 1, the DropBlock becomes regular Dropout. γ represents the inactivation probability, the probability of Bernoulli. The calculation method is shown in formula (7), where keep_prob is the probability of keeping a neuron in the regular Dropout, and feat_size is the size of the feature map.

III. PROPOSED METHOD A. ALGORITHM DESCRIPTION
In this paper, CGAN, residual unit and DropBlock are combined to improve the network structure for the classification of LiDAR data, called as RDB-CGAN. The model structure is shown in Figure 4. The improvements mainly include three aspects: (1) Due to the strong feature extraction capability of CNN in the field of image processing, it is applied to the generator G and discriminator D of CGAN. Fractional-strided convolutions and stride convolutions are used to replace the pooling layer in G and D respectively to form a fully convolutional network structure for improving the feature extraction ability of network.
(2) Deep network usually has better ability to learn and express features than shallow network. Therefore, residual unit is added to G to increase the depth of the network, so as to improve its generation capacity and make the generated data more similar to the real data. Meanwhile, residual units can delete gradient disappearance and gradient explosion caused by the increase of network depth. (3) In order to improve the classification accuracy of discriminator, DropBlock is added to the network. By dropping a whole area with spatial information correlation, it enables the network to learn the remaining features, so as to achieve correct classification and improve the network generalization ability. From Figure 4, G uses both noise and category labels as input to generate false data. D receives labeled LiDAR data and false data, and the final output results include the identification results of the data source and the classification results of the data.

B. FRAMEWORK OF THE PROPOSED METHOD
Our network architecture is proposed on the basis of the original CGAN framework. Both the G and D adopt fully convolution structure without pooling layer. Additionally, we add residual unit to G for increasing the network depth to improve the generation ability of G. In order to improve classification accuracy, DropBlock is added to D. D is divided into Sigmoid classifier and Softmax classifier. The final output includes both the true and false categories of the data and the classification results of the LiDAR data.
The random noise z and the category label c of the LiDAR data are both input G to generate fake data, which is defined as X fake =G(z|c). The real LiDAR data with the corresponding category label is input D together with the false data generated by G. The output of D includes both the true and false data of the data and the results of the classification.
Obviously, the additional category label information improves both the generation ability of G and the classification ability of D. For the input of false data generated by G into the D, this method not only expands the original training data, but also effectively improves the classification accuracy of D. Finally, when the false data generated by G is very similar to the real data and cannot be correctly distinguished by D, G can be considered to have learned the distribution of the data completely. They have reached the condition of Nash equilibrium and finally the global optimal result is obtained theoretically.
In the training classification section, we assume that the original training data have N classes. At the beginning, each generated sample is given a label and passed to the discriminator, and the fake samples are trained in the network with these labels. But the generated fake data does not belong to any of N classes, so we create a new class for them, called as N+1 class.
The loss function of the proposed network consists of two parts, one is L S to judge whether the data source is true or false, and the other is L C to classify the real LiDAR data and false data. In the network, the false data and real data are both input of D, and the Sigmoid classifier and Softmax classifier respectively output the true or false and LiDAR data classification results.

+ E z∼p z (z) [log D(G(z|c), n + 1)] (10)
For D, the ultimate goal is to maximize L S + L C and the purpose of G is to minimize L S -L C .

C. STRUCTURE OF GENERATOR AND DISCRIMINATOR
The generator in this paper adopts the fully convolution network, uses fractional-strided convolution instead of pooling layer, and removes the full connection layer to increase stability. Firstly, category label c and noise z subjected to uniform distribution are input G at the same time, and then after a series of convolution operations, the final output is 64×64×1 false data. Additionally, we added residual unit into G to deepen the number of network layers. By increasing the number of network layers, the network learns more features, so that the generated data by G is closer to the real data. The residual unit structure used in this paper is shown in Figure 5. Through shortcut connection, the network is changed from the original simple learning data characteristics to focus more on residual learning from input to output, thereby generating more realistic data. The residual unit also solves the problem of network degradation caused by deepening the number of network layers.
We add batch normalization (BN) to the convolutional layer and residual unit to stabilize the learning process of the network. BN can reduce the poor initialization and model collapse problem, which can also ensure gradient propagation to the deep layer of the model. In addition, in order to keep the stability of the network, BN is not applied to the last convolutional layer at the output of G. The ReLU activation function is also added to the convolutional layer and residual unit to improve the learning speed of the network, and the Tanh activation function is added in the last convolutional layer.
In this paper, D s structure is similar to G s structure, which is also a fully convolution network. Strided convolution is adopted instead of pooling layer, and the full connection layer is removed to increase stability. D receives both the false data generated by G and the real data as input. After convolution operation, it reaches the output terminal. The output terminal contains two classifiers, Sigmoid and Softmax, which output the true or false data sources and classification maps respectively. In addition, DropBlock was added into D to make full use of elevation spatial information from LiDAR data. DropBlock has two advantages in extracting LiDAR data features: (1) The data in the convolutional layer has strong spatial correlation. Therefore, when the feature map is processed in a general way of random dropped, the dropped features will be made up by the related features of the surrounding neighborhood. Therefore, the dropped part will be passed to the next convolutional layer, which makes the random dropped method ineffective. But the DropBlock randomly drops a whole area containing all spatial information features, eliminating all relevant features and forcing the convolutional layer to learn the remaining features. (2) A phenomenon often occurs when the convolutional layer extracts features, many different feature maps containing overlapping regions are sent into the convolutional layer. However, the features extracted by the convolutional layer from the overlapping regions are basically the same, which results in the redundancy of features. When DropBlock processes different feature maps, by randomly dropping a unit block in the overlapping area, completely different spatial information features will be left. Therefore, it not only solves the redundancy of features, but also increases the diversity of features. Additionally, when DropBlock processes feature maps every time, the selected block area is variable. Therefore, the characteristics dropped by the same feature map in different training stages are different, and the remaining features are also different. This makes spatial information features fully learned and utilized. Therefore, based on the above two advantages, DropBlock can effectively improve classification accuracy. In addition, similar to G, we apply BN for all convolutional layers except the output to promote the training performance. We also used the LeakyReLU activation function in D, which is an improved version of the ReLU activation function and performs better.

A. DATA DESCRIPTION
In this paper, two LiDAR datasets were used to evaluate the performance of the proposed classification algorithm, Bayview Park datasets and Recology datasets. They are all from the 2012 IEEE International Remote Sensing Image Convergence Competition and collected in the city of San Francisco, CA, USA. Figure 6 shows the Bayview Park dataset, which consists of 300 × 200 pixels with a spatial resolution of 1.8m. It contains seven land classes, which are building1, building2, building3, road, trees, soil and seawater. Figure 7 shows the Recology dataset, which consists of 200 × 250 pixels with a spatial resolution of 1.8m. It contains eleven land classes, which are building1, building2, building3, building4, building5, building6, building7, trees, parking lot, soil and grass.

B. EXPERIMENTAL SETUP
In order to evaluate the classification performance of RDB-CGAN network proposed in this paper, experiments were conducted with two classical machine learning classification methods, including SVM, Random Forest, and  four deep learning classification methods, including CNN, GAN, ResNet and ResCapNet. In the experiments, the total number of samples is 5000 and we randomly selected the samples to ensure the credibility. The samples are divided into training sets and testing sets. When selecting training sets, we randomly selected samples from each class of all the samples. The sum of the training sets is 400, 500, 600, 700, respectively and the rest are used as testing sets. To reduce the computation, the input datasets size is 64 × 64. The datasets were linearly mapped to [−0.5, 0.5] before input. In addition, the classification task of the LiDAR data is based on pixel classification, which is the interpretation process of remote sensing images. So the input is the pixel neighborhood.
The structure of G and D for RDB-CGAN model is shown in Table 1. G consists of five convolutional layers and two residual units. The input includes N category labels and uniformly distributed noise (size of 100). The category labels were connected to noise after passing embedding layer and reshaped into a 100 × 1 × 1 three-dimensional tensor, which was input into G and eventually output 64×64×1 false data. The noise z subjects to uniform distribution. D consists of four convolutional layers and two DropBlock layers. The input includes the generated data by G and the real data, and the output includes the true or false data and the classification results of LiDAR data. We used Adam gradient optimization algorithm in the network, where the initial learning rate of Bayview Park dataset and Recology dataset were set to 0.0001 and 0.0002 respectively. The batch size was set to 64, the number of training epoch was set to 300, and keep_prob and block_size in DropBlock were set to 0.8 and 5 respectively.
The selection of the parameters is the result of experiments verification. In the experiment, the range of the  two parameters is, keep_prob ∈ {0.6, 0.7, 0.8, 0.9}, block_size ∈ {3, 4, 5, 6, 7}. When different keep_prob and block_size are selected, the OA coefficients generally showed an upward trend and then a descending trend. The best OA results were obtained when keep_prob is chosen as 0.8 and keep_prob is chosen as 5.
The kernel function of the SVM was set to the radial basis function (rbf), the rbf coefficient defaults to ''auto'', and the penalty parameter of the error term was set to 100. The estimates of the Random Forest for the two datasets were set to 30. When training the GAN model, the initial learning rate for the Bayview Park dataset and the Recology dataset was set to 0.0002, and when training ResNet and ResCapNet models, the initial learning rate was set to 0.001.

C. EXPERIMENTAL RESULTS AND ANALYSIS
The experiment is carried out under the Windows operating system. It is based on the open source deep learning framework Pytorch, and the programming language is Python. The experimental equipment includes 3.2GHz CPU and NVIDIA GTX1060 graphics card. We adopted overall accuracy (OA), average accuracy (AA), kappa coefficient (K), classification accuracy of each class on two datasets and classification maps to evaluate the performance of the proposed classification model. Table 2 and Table 3 provide the classification results of Bayview Park and Recology dataset by different classification methods when 400, 500, 600 and 700 training samples are selected respectively.
In order to ensure the accuracy of the experiment and the stability of the model, 10 times training and testing were  performed on the two data sets, and the obtained experimental results were calculated for the mean and standard deviation.
It can be seen that, the discriminator effectively uses the false samples generated by the generator and makes full use of DropBlock to learn the feature of all classes. So RDB-CGAN achieved the highest accuracy and the best OA, which were 96.98 0.96% for the Bayview Park dataset and 97.250.82% for the Recology dataset. For Bayview Park dataset, OA achieved by RDB-CGAN were 20.92%, 7.06%, 6.84%, 4.40%, 2.31% and 0.77% higher than SVM, Random Forest, CNN, GAN, ResNet and ResCapNet, respectively. For Recologyk dataset, OA achieved by RDB-CGAN were 19.23%, 5.97%, 4.67%, 3.36%, 1.44% and 0.94% higher than SVM, Random Forest, CNN, GAN, ResNet and ResCapNet, respectively. Figure 9 show the comparison results of different classification methods when 400, 500, 600 and 700 training samples were selected respectively from two datasets. It can be intuitively seen from the figure that the proposed method in this paper has the best classification accuracy. Table 4 and Table 5 provide the classification accuracy of each class by different methods on Bayview Park dataset and Recology dataset. According to the classification results of each class shown in the two tables, RDB-CGAN achieved high classification accuracy for the classes with higher height in the two datasets, and significantly improved the classification accuracy for the classes with lower height, such as seawater, parking lot and grass. Figure 10 and Figure 11 show the comparison of classification result of each class by different methods on two VOLUME 8, 2020   datasets. It can be seen that RDB-CGAN almost achieved the highest accuracy in each class. Figure 12 and Figure 13 show the classification maps of the two datasets by different methods. As a subjective evaluation index, the classifica-tion maps can show the classification effect more intuitively. It can be seen from the maps that, compared with traditional machine learning methods such as SVM and random forest, the classification effect of deep learning method is   greatly improved, and the area of misclassification is greatly reduced. It can also be observed from the maps that the method proposed in this paper is more effective in classifying objects of lower height, such as seawater, soil and grass. Therefore, combined with classification accuracy and classification maps, it can be concluded that the RDB-CGAN has VOLUME 8, 2020  better classification effect on LiDAR data compared to other methods.

Figure 8 and
In addition, in order to further demonstrate the advantages of the method proposed in this paper, we conducted longitudinal comparative experiments. Table 6   Recology dataset when the number of training samples is 700. It can be seen that RDB-CGAN has the highest accuracy for each class for two datasets.

V. CONCLUSION
In this paper, a new RDB-CGAN model was proposed to classify LiDAR data. The model combined CGAN, residual unit and DropBlock to effectively improve the classification accuracy of LiDAR data. The residual unit is introduced into the generator to alleviate network performance degradation caused by increasing network depth and its shortcut retains complete feature information to further improve the generation capacity of G. Moreover, the false samples generated by the generator were put into the training set, which effectively expand the amount of data and make the network fully trained. Using the characteristic of DropBlock in the convolutional layer to randomly drop the regional features, the discriminator can fully learn the feature of each class and effectively improve the classification accuracy.
We carried out experiments on two LiDAR datasets and compared with six known methods to verify the effectiveness of the proposed method. The experimental results indicated that when the number of training samples was 700, RDB-CGAN achieved 96.98% and 97.25% in terms of OA on Bayview Park dataset and Recology dataset, respectively, which were better than other classification methods. In the future, we may introduce GAN and its variants into the semi-supervised classification of LiDAR data.