3D Point Cloud Classification for Autonomous Driving via Dense-Residual Fusion Network

Compared with the state-of-the-art architectures, using the 3D point cloud as the input of the 2D convolutional neural network without preprocessing will restrict the feature expression of the network. To address this issue, we propose a high-precision classification network using bearing angle (BA) images, depth images, and RGB images. Due to the development of unmanned vehicles, determining how to recognize objects from the information collected by sensors is important. Our approach takes data from LiDAR and a camera and projects a 3D point cloud into 2D BA images and depth images. The RGB image captured by the camera is used to select the region of interest (ROI) corresponding to the point cloud. However, only adding input information is not enough to improve the classification ability of general convolutional neural networks. In our approach, we use a Dense-Residual Fusion Network (DRF-Net), which consists of Dense-Residual Blocks (DRBs). The Dense-Residual Fusion Network can achieve 97.92% accuracy with three input formats on a KITTI raw dataset.


I. INTRODUCTION
Object classification is widely used in various fields, such as biomedicine, production processes, home safety, elderly care, etc. In recent years, with the development and the prospect of advanced driving assistance systems, determining how to effectively make use of the information obtained by the sensors has become an important issue. Dalal and Triggs [1] propose histograms of oriented gradients (HOG) with the linear based SVM for human detection in 2D images. Calculating the gradient (including the size and orientation) of each pixel and dividing the image into cells, the gradients in each cell are connected in a series to obtain the blockwise HOG characteristic descriptor. To acquire HOG-like features, RCNN [2] first applies CNN on object detection in 2D images. Features are easier to obtain, and the performance is improved.
The 2D images are usually taken by cameras, which are easily affected by other lighting sources. LiDAR emits The associate editor coordinating the review of this manuscript and approving it for publication was Jenny Mahoney. a laser beam to a target and obtains the 3D point cloud through its reflection. For 3D point cloud object classification, VoxNet [3] divides a point cloud into voxels and transforms them into available features. The MVCNN [4] achieves state-of-the-art performance by rendering images from different angles of the point cloud and combining the features through view pooling. However, these data representations result in a huge number of calculations. PointNet [5] directly uses the raw data from the point cloud to perform both classification and segmentation tasks. PointNet++ [6], which is the advanced version of PointNet [5], uses a hierarchical neural network to extract local features concatenated with high level features. PointGCN [7] transforms a 3D point cloud into graphs. Using graph signal processing techniques like graph convolution and multi-resolution pooling leads to a better classification performance. Since the graph information of the point cloud plays an important role in the classification accuracy, DGCNN [8] adopts a dynamic strategy that considers both local and global features to update the graph before each edge convolution to reach state-of-the-art performance. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ PointHop [9] adopts k-nearest neighbors to group points in the point cloud. The points in the same group are divided into eight octants around the group center. Attributes are calculated from each octant to obtain local descriptors, of which the features are further reduced by Saab transform [10]. Furthermore, PointHop updates parameters in a feedforward fashion rather than backpropagation. By so doing, PointHop can achieve comparable classification performance while requiring much lower training complexity.
Although targets can be well classified by 3D point clouds, the amount of data in the point cloud is huge and takes a substantial amount of time to calculate. Douillard et al. [11] propose segmenting ground points and retaining non-ground points, which not only benefit the subsequent segmentation but also reduce the amount of data in a scene. Recent studies have shown that removing the ground points makes it easier to segment the region of interest (ROI), and projecting the ROI point clouds into 2D image makes it easier to identify the ROIs and reduces computation costs.
Combining RGB images with point clouds as input has become a trend in classification tasks and has exhibited promising performance. Recent studies demonstrate that combining additional information, such as bird's eye views or depth images, with RGB images can further improve accuracy. Börcs et al. [12] project the ROI point cloud into depth images which shows the outline. Lin et al. [13] further project ROIs into Bearing Angle (BA) images to show more details which contain the corners and the edges of ROIs. However, the textures of the obtained BA images are sometimes confusing and become counterproductive. Considering these factors, we integrate BA, Depth and RGB images by the preprocessing procedures to boost performance.
During preprocessing, the ground points are removed from the point cloud and the nonground points are then grouped and projected into the BA image and the depth image. The RGB image corresponding to the clustering result can be generated by KITTI's transformation matrix. With these various representations of ROIs as inputs, we propose a dense residual fusion network for classification.
In summary, the contributions of this article are: 1) In addition to the BA image, we add the depth projection and the RGB image corresponding to the point cloud. Through the combination of the RGB image and the information of the depth image and the bearing angle images, the input information of the neural network is full of diversity. 2) We present the Dense Residual Fusion Network (DRF-Net). The architecture uses the dense residual block, which is more conducive to transfer the information and gradients than the residual module. In addition, we also explore the neural network fusion structure, and obtain the proportion of dense residual modules required before and after feature fusion through experiments, so that the feature map can not only be fully extracted before fusion but also can be fully integrated after fusion.
The remaining of this article is composed as follows: The related works are introduced in Section II. Our approach including preprocessing procedure and the proposed network architecture is detailed in Section III. The experiment results and the error analysis are discussed in Section IV. Finally, the conclusion and future work will be provided in Section V.

II. RELATED WORK
Convolution neural networks (CNNs) have been shown to have superior performance in terms of object detection tasks. LeCun et al. [14] propose LeNet with 5 layers and is regarded as the pioneer of CNN. Krizhevsky et al. propose AlexNet [15] which applies ReLU, dropout, and maxpooling. Using ReLU as the activation function solves the vanishing gradient problem, makes the training more efficient and improves the classification accuracy. The dropout mechanism prevents the training from overfitting. Applying the pooling mechanism not only downsamples the feature map but also extracts higher level features. To achieve better performance, the structure of the CNN grows deeper and deeper. K. Simonyan and A. Zisserman propose VGG-Net [16], which has 11-19 convolution layers.
As a network becomes deeper, the degradation problem follows. In addition, it is hard to ensure that the features transmitted to the next layer represent better than those of the previous layers during the forward pass. He et al. [17] propose the Deep Residual Network (ResNet) which applies skip connections to achieve identity mapping. The goal of residual learning is to learn the difference between the output and the input information. Thanks to the success of the residual learning, ResNet won 1st place in the ILSVRC 2015 classification task with more than 100 layers. Based on the concept of skip connections, Huang et al. [18] propose dense connected network (DenseNet) which uses dense connections to allow feature reuse. Comparing the ways of information combination, ResNet uses element-wise addition, while DenseNet applies concatenation in the direction of the channel dimension. By so doing, DenseNet can increase the variation of input to enhance the representation capability and thus can achieve a lower error rate on the ImageNet dataset than ResNet. In this work, we propose the dense residual block (DRB) to integrate the advantages of feature refinement by the ResNet and feature reuse by DenseNet.
To perform tasks on object detection, R-CNN [2] first introduces region proposals within the image to detect multiple objects. Instead of classifying each region of interest (ROI), fast R-CNN [19] proposes ROI pooling, which shares the feature maps with each ROI. Faster R-CNN [20] replaces selective search [21] with the region proposal network (RPN), which aims to locate the ROIs using a CNN and thus is more efficient. YOLO [22] achieve real-time object detection with a one-stage detector that conducts object position detection and object classification in one step. Lin et al. [23] propose the feature pyramid network (FPN) to detect objects of different sizes. The fully convolutional network (FCN) [24]  can extract features from input images of different sizes and perform semantic segmentation.
The CNN-based methods have also been widely applied to autonomous driving with LiDAR. The MV3D [25] takes a bird's eye view, front view, and RGB images as inputs to extract feature maps which are then gathered together by a fusion network. Börcs et al. [12] proposed a method to detect vehicles and pedestrians by using depth images. Lin et al. [13] project the clustered point cloud into bearing angle images (BA images) for classification. As such, DRF-Net further takes the advantages of the above methods to make the point cloud classification more accurate.

III. APPROACH
We use a part of the KITTI raw dataset [26] as training and testing data that adopts Velodyne HDL-64E to collect point cloud information. The KITTI dataset contains various urban and suburban scenes that are very suitable for our research. Velodyne HDL-64E is a multi-beam LiDAR with 64 layers, and each layer contains 2,084 points, so there are 133376 points in a scene and 64 points in each scanline. The preprocessing pipeline shown in Fig. 1 consists of four steps: (a) remove ground points via a ground point detection algorithm; (b) produce ROIs (region of interests) from the point cloud; (c) select segmented ROIs corresponding to the RGB image, and (d) project the point cloud into bearing angle images and depth images. The BA, Depth and RGB images obtained by the preprocessing are classified by the dense residual fusion network. We will detail each step and the network architecture in the following subsections.

A. ADJUSTED THRESHOLD FOR GROUND POINT DETECTION
In a point set, the ground point accounts for 30 to 50% of the point cloud. Due to the efficiency of the ground point findings, we follow the method described in [13] as our ground point detection method. We calculate the height of the scanning point to locate the first ground point on each scanline. As shown in Fig. 3, P i is the scanning point. With the height H of the sensor, the distance l i between the LiDAR and the scanning point, the angle θ i between l i and the horizontal plane, we can obtain the height H p of the scanning point as If H p is smaller than the threshold height H th which is set to 15 cm, the point P i can be regarded as a ground point. According to the specification of Velodyne LiDAR HDL-64E, the angle θ i lies within the range of The ground point detection method considers the height of the scanning point P i to determine the first ground point. After detecting the initial ground point, the next ground point VOLUME 8, 2020  on the scanline is determined by the slope between the former and the next point.
Suppose a ground point is labeled P 1 (x 1 , y 1 , z 1 ), and P 2 (x 2 , y 2 , z 2 ) is the next point. The slope can be defined as We observe that the points near the LiDAR are denser than those far from the LiDAR. In the cases where two consecutive points are scanned close to the LiDAR, the distance between the two points will be shorter, and when two consecutive points are scanned far from the LiDAR, the distance between them will be longer. To compensate the effect of distance on the slope, we adjust the threshold slope as where d c and d f are the predetermined distance of the near area and the far area, respectively. d (1,2) is the distance between P 1 and P 2 . T 0 is the threshold slope while the distance is between d c and d f . α and β are constants. If the slope θ is smaller than the threshold slope T adjust , the next point would be considered to be a ground point.

B. ROI PRODUCED BY FLOOD-FILL ALGORITHM
After labeling the nonground points, they are clustered by the flood-fill algorithm [27]. The flood-fill algorithm is composed of two steps. In the first step, nonground points are clustered in each scanline. There is a threshold distance d 1 to determine whether the two consecutive nonground points belong to one cluster. If the distance between the points is smaller than d 1 , they are assigned to the same cluster.
In the second step, the clusters in each scanline are grouped into objects. Two threshold distances, d h and d v , are used to determine whether the clusters belong to the same object in the horizontal and vertical directions.

C. TRANSFORM THE 3D-POINT CLOUD INTO BEARING ANGLE AND DEPTH IMAGE AND OBTAIN RGB IMAGE
The depth image represents the proportion of the distance in gray level. The pixel value can be defined as where d farthest is the distance of the farthest point in the point cloud, and d i is the distance between the current point and the LiDAR. According to [28], the bearing angle image (BA image) represents more details in the point cloud. Fig. 4 shows the angle θ i between the laser beam and the line segment d (P i ,Pi+1) of two consecutive points P i , P i+1 . To transfer the point cloud into a BA image, θ i can be represented as where l i and l i+1 are the distances of P i and P i+1 , respectively, measured from LiDAR. d (P i ,Pi+1) is the length of the line segment connecting two consecutive points P i and P i+1 . The pixel value of each point is calculated by To obtain the ROI of the RGB image, the transforming matrix provided by the KITTI dataset [26] is adopted. Coordinates of the points are projected onto the RGB image. We segment the part that corresponds to the ROI. With the RGB image, BA image, and depth image, the next stage is classifying the input images using DRF-Net. Fig. 2 shows in an exemplary frame the ROI along with the depth image, the BA image, and the RGB image. The depth image shows the contour of the frame, and the BA image contains more details that make the picture look more three-dimensional.

D. NETWORK ARCHITECTURE
The information extracted from different input formats needs to be carefully fused. Two sorts of fusion strategies, early fusion and late fusion, are considered. The early fusion strategy allows information to be fused at the front feature level. The late fusion strategy combines different local decisions from different sources to avoid the dominance by one of the input formats. We find that the early fusion performs better than the late fusion in extensive experiments, so we adopt the early fusion in our DRF-Net. The DRF-Net architecture is shown in Fig. 5. The network consists of three dense-residual blocks to extract features from each input source. The features are concatenated and processed by three convolution layers to learn higher level representations. Fig. 6 shows the structure of the dense residual block (DRB). Each dense-residual block is composed of three residual blocks [17] followed by one max-pooling layer and one convolution layer with 1 × 1 kernel size which is used to reduce the dimensionality. Batch normalization and ReLU are attached after each convolution layer. Batch normalization (BN) was proposed to solve the problem of the internal covariate shift in [29]. BN reduces the sensitivity of the model to network parameters, makes the network learning more stable, and the training speed is faster.
The residual block includes two convolution layers with a shortcut. The output of ith residual block can be defined as: where x i is the input of the residual block and W i,c is the cth convolution layer in the ith residual block. BN denotes the operation of batch normalization and σ is the ReLU activation function. When the network gets deeper, it will make the low-level information disappear after multiple stacked layers. In order to reuse feature maps and increase information flow, we add dense connections. The output of the l th Dense Residual Block DRB l can be shown as: where M l = MaxPool(x l ⊕ R l,i ⊕ R l,i+1 ⊕ R l,i+2 ), M l is regarded as the output of the maxpooling layer and x l is the input feature map of lth DRB. The symbol ⊕ denotes the concatenation operation. R l,i is the output of ith residual block and W l,(1×1) is the 1 × 1 convolution layer in the lth DRB.
where P t denotes the probability of the final prediction. γ is the parameter which is set to downweight the well-classified examples. We set γ = 2 in our experiment.

IV. EXPERIMENTS
Five scenes from the KITTI dataset [26] are adopted in our experiments. The raw data from Residential 2011_09_26_drive_0035 and Campus 2011_09_28_drive_ 0021 comprise the training set. The testing set contains the raw data for Residential 2011_09_26_drive_0020, 2011_09_30_drive_0027 and Campus 2011_09_28_drive_ 0039. The input images are divided into three categories: pedestrians, cars, and street clutter. In our dataset, we classify cyclists as pedestrians. There are totally 2,000 images in the training set and 1,200 images in the testing set (400 images in each category). The network is trained with AdamOptimizer using Tensorflow, wherein the parameters β 1 and β 2 are set to 0.9 and 0.99, respectively. The learning rate is set to be 0.0005. We run the proposed RF-Net on GTX 1080 Ti GPU with a batch size of 16 for 200 epochs. The input images are resized into a fixed resolution of 96 × 96.

A. ABLATION STUDIES
In Table 1, we investigate the impact of various input combinations on the output accuracy. When taking the RGB image as the single input, the performance of our proposed network is better than the other alternatives. The depth image shows the texture of the point cloud in 2D. The BA image enhances the details of outlines and corners. The RGB information makes it easier to recognize the object in each ROI. The fusion of the BA image with the RGB image improves the accuracy from 90.25% to 96.50%. The fusion of the depth image with the BA image makes the extracted features more robust, and hence improves the accuracy from 87.17% to 92.08%. The fusion of the depth image with the RGB image improves the accuracy from 87.17% to 96.80%. Fusing all three types of features leads to the best performance of accuracy 97.75%. In short, entering three types of images simultaneously for classification can indeed improve overall detection accuracy.
To compare the DRB with the residual block, Table 2 shows the results of the 2-input fusion networks that applying the residual block [4] and the dense residual block, respectively. The reason we choose only 2 input sources is to save training time. It can be noticed that the model using the dense residual blocks achieve better average accuracy than that using residual blocks by 1.4% to 6.4%. Dense connections not only transmit more information flow within a block, but also efficiently prevent both vanishing and explosive gradients.
After concatenating feature maps extracted from different input sources, we need to combine them for the feature integration. As Table 3 shows, adding a dense residual block after feature fusion improves the accuracy from 97.75% to 97.92%. We accordingly take our DRF-Net with an extra DRB as the baseline model in the following experiments. Fig. 7 illustrates different combinations of DRB numbers used before and after fusion that may affect the accuracy. In Table 4, the accuracy improves as the number of the DRBs     before fusion increase from (a) to (d). It can be observed that the more DRBs increase before fusion, the better feature is extracted. In (e), the decreased accuracy shows the importance of feature integration after fusion. In (f), a decision fusion structure is used, which combines the predictions from each input obtain worse accuracy than (d) and (e). As a result, fusing the information in the feature level is more appropriate than fusing in the decision level. We find that model (d) has the best accuracy of 97.92% and thus we choose (d) as our final model.
Loss function plays an important role while training. Compared to the cross entropy, the focal loss pays more attention on hard examples. Table 5 compares the effects of the focal loss and cross entropy used to train our best model (Fig. 7(d)). Although the focal loss drops the precision for classifying cars by 1.2%, it improves the accuracy for pedestrians and street clutter by approximately 3.25% and 0.75%, respectively. In general, the focal loss appears to be more suitable than the cross entropy to train the proposed model.
We also investigate the impact of attention mechanism. We adopt the attention module [31] propose by Hu et al to learn which channels in the DRBs are worth putting more weights. Table 3 shows that the scheme with channel attention (CA) mechanism decreases the accuracy by 0.09%.  The reason for the accuracy drop may be that the channel weights computed by the Sigmoid function are always smaller than 1, and consecutive multiplications with these weights will make the feature values become smaller and smaller. Consequently, the global average pooling may fail to extract features as the basis, and hence we do not adopt attention mechanism in the proposed DRF-Net.
With the widespread use of LiDAR, many new architectures are proposed to perform classification for point clouds.
To be compared, point sets segmented by our preprocessing procedure are input to these models. Table 6 lists the accuracies of models proposed by PointGCN [7], Point-Net [5], PointNet++ [6], PointHop [9], DGCNN [8], and Lin et al. [13]. The model by Lin et al. [13] projects the point clouds into BA images, while those of DGCNN and PointGCN convert the point clouds into graph signals. The other models' input point clouds to the neural networks without pre-processing. For a fair comparison, we also listed the results of our DRF-Net without the RGB input image in Table 6. Compared with other models, our model keeps stable classification accuracy for all three classes and achieves an average accuracy of 92.08%. The inclusion of RGB information further enhances the accuracy of street clutter and pedestrians, and reaches to the best accuracy 97.92%.

B. ERROR ANALYSIS
We analyze the classification error for our proposed 3-input model by tracking the distributions of prediction results in each category in Fig 8. Our proposed model performs well in identifying pedestrians. Most of the mispredictions occur when pedestrians are classified as street clutter, accounting for 2 mispredictions (0.5%). As shown in the RGB image of Fig 9(c), the pedestrian overlaps with the other person's hand. This misclassification comes from the reduced resolution due to the long distance between LiDAR and object. The error rates of regarding cars as pedestrians and street clutter are similar. Most of the false predictions take place in the scenes which consider street clutter as cars, accounting for 11 mispredictions (2.75%). Fig 10 shows an example of street clutter, which is misclassified as a car due to the overlapping of different objects. In order to improve accuracy, the preprocessing may need to include more semantic information from RGB images to help perform segmentation of ROIs.

V. CONCLUSION
In this article, we propose a framework that projects a 3D point cloud into 2D images as input to the DRF-Net. The DRF-Net leverages dense residual blocks to extract features from multiple input sources, which in turn leads to better classification performance. We also explore the fusion structures to further improve the accuracy of classification. Compared to other classification models, the proposed DRF-Net achieves better accuracy by transforming the point cloud into BA and depth images. For future work, we need to tackle similar issues faced by the R-CNN. For example, there are too many ROI selection processes that may incur excessive computations. Since all selected ROIs have to perform classification by a neural network, the system memory may run out quickly. Inspired by the Faster R-CNN, we would attempt to combine RPN with ROI to implement a more practical classification scheme.