Transformable Dilated Convolution by Distance for LiDAR Semantic Segmentation

LiDAR semantic segmentation is essential in autonomous vehicle safety. A rotating 3D LiDAR projects more laser points onto nearby objects and fewer points onto farther objects. Therefore, when projecting points onto a 2D image, such as spherical coordinates, nearer objects appear larger than more distant objects. Recognizing a closer object requires a larger receptive field, whereas recognizing a nearer object requires a smaller receptive field. However, existing CNNs have always used the same receptive field, making it difficult to express objects of various sizes in a single-sized receptive field, restricting their performance in terms of the recognition of larger (or nearer) objects that require a larger receptive field. In response to these limitations, we propose a transformable dilated convolution (TD Conv) to adjust the convolution filter’s size according to the input distance. Leveraging the distance information of LiDAR and dilated convolution, a large convolution was applied to nearby objects, and a small convolution was applied to farther objects. The proposed method yielded good performance when recognizing nearer objects or larger objects such as roads and buildings and showed similar performance to the conventional method for farther or smaller objects. To test the proposed method, we used the SemanticKITTI dataset.


I. INTRODUCTION
Semantic segmentation is used to recognize various complex traffic environments. Several studies [1], [2], [3], [4] on 2D image-based road environments were conducted following the release of the Cityscape [5] and Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) datasets [6]. However, 2D image-based semantic segmentation methods have limitations in measuring the exact distances between vehicles and road obstacles.
Accurate measurements of these distances are essential in autonomous vehicle safety. With the recent release of the LiDAR point-based semantic segmentation datasets SemanticKITTI [7] and nuScenes [8] for road environments, research on point-based semantic segmentation for road The associate editor coordinating the review of this manuscript and approving it for publication was Kathiravan Srinivasan . environments is actively being conducted [9]. In this study, we considered 3D LiDAR-based semantic segmentation to solve the limitations of 2D methods.
LiDAR point semantic segmentation methods can be divided into point-based, voxel-based, and range-based methods. Other methods include processing LiDAR data by converting them into graph type data [32], [33], [34], [35]. However, these methods are not widely used for autonomous vehicles. Processing is time consuming in many point environments.
Point-based methods extract and directly process a distinct point's features through a multilayer perceptron (MLP)-based network. Methods such as PointNet [10] and PointNet++ [11] typically process indoor point clouds, while RandLA-Net [12] and kernel point convolution (KPConv) [13] process outdoor point clouds. However, the MLP-based network slows down when many points, such as in a road environment, are being processed. It is difficult to expect real-time performance from these methods because there are many points to process from the high-channel LiDAR mainly used in autonomous vehicles. If the data are reduced to minimize computation, the performance will become poor, and the results will be difficult to use.
The voxel-based method uses a 3D convolution by dividing the 3D space into a predetermined grid. This type of method includes VoxelNet [14], attentive feature fusion with adaptive feature selection for sparse semantic segmentation network ((AF)2-S3Net) [15], TORNADO-Net [16], and sparse point-voxel neural architecture search (SPVNAS) [17]. However, the appropriate data for this method are limited, and the convolution operation is inefficient. In addition, 3D convolution is slower, as it includes more computation compared to 2D convolution.
The range-based method processes points by projecting 3D point data onto a 2D image using spherical coordinates of 2D convolutional neural networks (CNNs) [18], [19]. CNN can be applied directly, and it is the most used method because it requires the least amount of computation among the above three methods. The RangeNet [20] and SalsaNext [21] methods use an encoder-decoder structure and improve their performance using the CNN method in postprocessing. However, the standard CNN has a fixed-size kernel, i.e., it has the same receptive field regardless of the object's size. In addition, 3D spatial information is lost as it is projected as 2D data.
Therefore, it is difficult to express various types of objects. To solve this problem, convolution methods that are more flexible than existing convolution methods have been proposed. Deformable convolutional networks [22] dynamically learn the sampling locations by predicting the offsets for each image location. Pixel-adaptive convolution [23] multiplies the weights of the filters and a spatially varying kernel to make the standard convolution content adaptive. For ADCNN [24], a method of acquiring various spatial information is proposed by learning the expansion coefficient for each location. Since these methods are adaptive convolution methods based on 2D images, they are not considered for 3D point cloud data.
In this paper, we propose the TD (transformable dilated) convolution method for LiDAR semantic segmentation. This method takes advantage of the characteristics of 3D LiDAR data; these data are radial, objects closer to the sensor are larger, and objects farther from the sensor are smaller. 1) Considering these features, TD Conv uses dilated convolution [25] to apply a large kernel for near objects and a small kernel for far objects. 2) In addition, the proposed TD Conv method is added to the spatial attention block structure of Squeeze-SegV3 [26] to solve the problem that 3D information is lost in transformed 2D data; thus, its performance is improved. Moreover, the inference time is reduced by replacing the 7 × 7 convolution kernel of the spatial attention block [26] with the 3 × 3 TD Conv proposed in this study. The contributions of this paper can be summarized in two parts. 1. We propose TD convolution, which is computationally efficient by considering the characteristics of 3D LiDAR data.
2. By replacing the SAC (spatially adaptive convolution) module of SqzV3 with the proposed TD block, the speed of the network and mIoU for objects close to LiDAR are improved.
This paper continues as follows. In Chapter II, we describe the data transformation process. In Chapter III, we present the proposed convolution. In Chapter IV, we present the network structure, and in Chapter V, our experimental results using SemanticKITTI are described.

II. DATA TRANSFORMATION
The SemanticKITTI dataset consists of 3D Cartesian coordinate data, while LiDAR data are a set of points scanned by laser scanners. Let p k ∈ R 1×4 (k = 1, · · · ,K ) be the kth point of LiDAR data as: where x k , y k , z k are the Cartesian coordinates of the point, and i k is the intensity data. Then, the point cloud of LiDAR is represented by the row vector P ∈ R 1×4K as: We transform the LiDAR point cloud data into spherical coordinates before applying it to a 2D CNN.

A. SPHERICAL COORDINATE TRANSFORMATION
The spherical coordinate system is a coordinate system for a 3D space, where the position of a point is specified by the radial distance (r), the azimuth angle (θ), and the elevation angle (ϕ). Note that the azimuth angle is the angle in the horizontal plane, and the elevation angle is in the vertical plane. VOLUME 10, 2022 A spherical coordinate system was used in some studies to process ground LiDAR data. The original raw data of a 3D LiDAR are expressed in spherical coordinates ( Fig. 1). In the LiDAR coordinate system {L}, the azimuth angle (θ) is in the horizontal plane (X-Y), and the elevation angle is in the vertical plane (Y-Z). The scanning resolution of the LiDAR in the horizontal direction determines the resolution of the azimuth angle ( θ). Additionally, the number of LiDAR channels in the vertical direction determines the resolution of the elevation angle ( ϕ).
If the point of LiDAR data (x k y k z k i k ) is projected onto the spherical coordinates (θ, ϕ), (x θ ϕ , y θ ϕ , z θ ϕ , i θ ϕ ) is defined as the values of the spherical coordinates. For the LiDAR data point (x k y k z k i k ), the corresponding values of the spherical coordinates are obtained by Then, the position of spherical coordinate (θ, ϕ) is determined from (x k y k z k ) as If (θ, ϕ) space is divided by × grids, then the X-channel, Y-channel, Z-channel, and I-channel of LiDAR data can be represented by the matrices S X , S Y , S Z and S I ∈R × as Fig. 2 shows the final input data of the LiDAR data projected in spherical coordinates. From the data, there are a total of four channel images for semantic segmentation.

III. TRANSFORMABLE-DILATED CONVOLUTION
In this study, we propose a convolution that changes the kernel size to apply a receptive field that fits the object size. As seen in Fig. 3, the rotating 3D LiDAR has a rotation resolution of θ and radial data characteristics. Therefore, many LiDAR points are projected on a nearby object and fewer points are projected on an object that is farther away, as confirmed in Fig. 4, where LiDAR data are transformed to spherical coordinates. Thus, the required receptive field for recognizing a particular object is different for each object.
The spherical coordinate data using the LiDAR data can be used to measure the distance value, making it possible to approximate an object's size. The proposed TD Conv changes the size of the convolution kernel according to the distance using such features.
The number of computations increases if the kernel size is increased. Therefore, dilated convolution, a method of expanding the receptive field without increasing the number of computations, is used. As the dilation factor is adjusted  according to the distance, the kernel size is also consequently adjusted.

A. STANDARD CONVOLUTION & DILATED CONVOLUTION
In a 2D image of size × , the conventional convolution (a 3 × 3 kernel with a dilation of 1) operation at the θ, ϕ position is calculated as Here, a θ,ϕ and f θ,ϕ are input and output features at the θ, ϕ positions, respectively, w is the trained weight of the convolution, and i, j are the kernel coordinates. Let n be the dilation factor; then, the dilated convolution operation can be expressed as Fig . 5 shows the difference between the standard convolution and the dilated convolution with dilation factors of 1 and 2, respectively. It shows that the sum of the standard convolution and the 2-dilated convolution is the same; however, the receptive field of the dilated convolution is wider. Therefore, it is possible to control the receptive field effectively.

B. TRANSFORMABLE DILATED CONVOLUTION
The TD Conv method proposed in this paper is shown in Fig. 6. The order of transformable convolutions at the positions of θ, ϕ is as follows: 1) Calculate the dilation factor from the distance data and create a dilation index map. 2) In the dilation index map, create a dilated Conv kernel that fits the size of the dilation factor at positions θ, ϕ. VOLUME 10, 2022 3) Execute the convolution operation using the created dilated conv kernel. The distance value d θ ϕ of the distance map is calculated by The LiDAR distance data can be represented by the matrices S D ∈ R × as The dilation factor k θ ϕ in the dilation index map from d θ ϕ is calculated by where c is the maximum dilation factor and a is the gradient considering the distance and the maximum dilation factor. The minimum dilation factor is 1, and a floor function is applied. Equation (17) is expressed as a graph, as shown in Fig. 7. Dilation index data can be represented by the matrices S K ∈ R × as Let the proposed TD Conv function be called F, and it can be expressed as follows.
where a θϕ , k θϕ are the input feature values and dilation factors of the proposed convolution at the θ, ϕ positions, and f θ ϕ is the output of the proposed convolution at the θ, ϕ positions. Using (17), an example of changing the convolution kernel according to distance when a is −0.05 and c is 5 is shown in Table 1. The dilation factor calculated from 0 m to 10 m is 4; for 20 m to 30 m, it is 3; for 40 m to 30 m, it is 2; and for 50 m to 80 m, it is 1.
The dilation factor is calculated from a distance through (17), while a convolution kernel is generated according to the dilation factor. Subsequently, the proposed convolution creates a large kernel with a near distance and a small kernel with a far distance so that the receptive field can be adjusted according to the object's size.  Fig. 8 shows the proposed network structure using TD Conv. The inputs for TD Conv are S X , S Y , S Z , S I , i.e., 4-channel images projected in spherical coordinates, and distance data S D. The size of the output data is 20 × 64 × 2048 because the output has to classify 20 classes.

IV. NETWORK STRUCTURE
The overall network structure uses an encoder-decoder, and the encoder's features are added to the decoder through a skip connection [31], [32]. This structure is then used in Squeeze-SegV3. The SAC module of the existing SqzV3 encoder is changed to the TD block, which is the proposed method.
The encoder consists of a TD block and downsampling. The TD block is a convolution module that applies TD Conv, and a 2 × 2 convolution with a stride of 2 used for downsampling to reduce only the width. Only the width is reduced because the height value of the converted input data size is small enough to 64.
In the decoder, a 3 × 3 transpose convolution with a horizontal stride of 2 is used for upsampling. Two residual blocks [30] are used for feature extraction, and loss is added for each size for multiscale accuracy. In this case, the loss consists of 3 × 3 conv and softmax. The TD block, as shown in Fig. 9, is applied to the proposed network as a number of channels, where s × s are the height and width, respectively, and n(n=3) is the size of the convolution kernel. Its inputs include maps of features, distances, and coordinates of the previous layer. This particular TD block is a modified spatially adaptive convolution (SAC) module proposed for SqueezeSegv3. The SAC module uses an elementwise multiplication of features from 3D coordinates to feature the previous layer's values to focus on 3D information. Moreover, the TD block is designed to fit the receptive field according to the object's size by applying convolution that can transform input features and 3D distance information.
TD Conv is initially applied to extract features according to distance and elementwise multiplied to focus on 3D information. In addition, residuals are applied before output. To train the proposed network, we use a multilayer crossentropy (CE) loss. [33], [34]. The loss is calculated by where out i is the output from stage 1 to stage 5 (Fig. 8). i is each stage number. For each step, the output size is 1/8, 1/8, 1/4, 1/2, and 1 times the size of the input data.

V. EXPERIMENT
We used a computer with an Intel Core i9 10th Gen. CPU, Nvidia GeForce RTX TM 3090 GPU 2ea, Ubuntu 16.04 OS and PyTorch were used to perform the experiment. Specifically, 64 × 2048 (height by width) input data from SemanticKITTI were used to experiment with TD Conv. Semantic KITTI's sequences labeled 1-10 were used for training and validating (8), while the unlabeled sequences 11-21 were used for testing. In addition, the size of the input data was 64 × 2048. That is, input data were a very large image on the horizontal axis, but a small image on the vertical axis. Therefore, only the horizontal axis was dilated.
The evaluation metric used was the mean intersection over union (mIoU) and the IoU of each class. mIoU is the mean value of IoU over all classes. In addition, we used stochastic gradient descent (SGD) as the optimizer, with a starting learning rate of 0.001. The warm-up performed 1 epoch, and then a cyclic learning rate scheduler performed 150 epochs. Table 2 shows a comparison of the proposed model with previously published methods of projecting onto 2D images [20], [21], [23], [24], [25], [26]. The proposed method's performance in projecting larger objects, such as roads, parking areas, and sidewalks, was better than that of the previous methods. Because the proposed method was able to obtain a suitable receptive field area for a large object, it also showed better performance on average-sized objects, such as people and vehicles, and no significant performance reduction was observed.
However, the method yielded relatively low performance on smaller objects or objects with details that needed to be recognized, such as bicyclists, motorcyclists, and traffic signs. This issue is attributable to the fact that detailed recognition is difficult when the convolution size is expanded. Fig. 10 shows a model of the mIoU by distance among the projection methods. For the evaluation data, the 8th sequence, which is the validation set, was used. For the comparison model, a high-resolution network (HRNet) [1] among 2D semantic segmentation methods and SqueezeSegV3 [23], a previous LiDAR projection method, were compared. In the proposed method, the proposed convolution was modified on the existing SqueezeSegV3 (Fig. 8).
HRNet is a model with good performance in 2D imagebased segmentation and consists of a total of 4 stages. HRNet + TD Conv is a model in which TD Conv is additionally applied at the starting point of each stage to the existing HRNet, and the input and output are the same as the proposed network. Fig. 10 shows that the smaller the distance is, i.e., the closer to the sensor the object is, the greater the performance improvement. Thus, it can be assumed that the previous methods did not have a sufficient receptive field for nearby  objects. HRNet and HRNet + TD Conv greatly improved the overall performance, especially at close distances. Table 3 shows the comparative results between SqzV3 [23] and the proposed method at different distances using the validation dataset. ''Other-ground'' and ''Motorcyclist'' were excluded from the performance comparison because the IoU was close to 0 for both methods. The comparison was carried out using the same method as in Table 2. From 0 to 20 m, the proposed method showed better performance than SqzV3 for most classes. The finding indicates that the proposed method yields better results when the distance to the object is shorter. However, the performance at 20 m was slightly lower for smaller objects such as bicycles and traffic signs.
From 20 m to 50 m, the proposed method showed relatively high IoUs for larger objects such as roads, sidewalks, and parking lots. For smaller objects, the proposed method yielded similar or slightly lower performance. The proposed method showed a slightly higher mIoU, which is approximately 3% higher at distances up to 20 m and slightly higher at farther distances. These findings show that the proposed method yielded slightly lower performance on smaller objects but higher performance on larger objects. Fig. 11 shows a comparison between the segmentation results generated by the proposed (SqueezeSegV3 + TD Conv) and existing (SqueezeSegV3) methods. As indicated by the blue circle in Fig. 11(a) and Fig. 11(c), the existing method could not match the ground truth elements. Likewise, as indicated by the red circles in Fig. 11(a) and Fig. 11(b), the existing method could not distinguish the elements 'Road', 'Parking,' and 'Sidewalk,' which were close to each other and of similar shape. In contrast, the proposed method accurately determined the elements and distinguished each element regardless of shape similarity and proximity to other elements. Table 4 shows a comparison of the performance and time for different segmentation input types. The performance of the point-based method was measured using PointNetbased KPConv, and the performance measurement of the voxel-based method was carried out using TORNADO-NET. The performance of the range image-based method was compared with that of the proposed method. Speed was measured by calculating the average speed of identical 1,000 LiDAR sequences.
In addition, Table 4 shows that the proposed method achieved the highest speed, 5 fps faster than the existing SqzV3. In the conventional SzqV3, converting the 7 × 7 kernel to 3 × 3 TD Conv increased the speed by 30%. The total number of parameters of SqzV3 and the proposed method were 26 M and 21 M, respectively, and the proposed method was approximately 20% lighter than the existing SqzV3. The point and voxel methods achieved better performance than the proposed method. However, the speed was <10 fps, while LiDAR acquired data at 10 fps. Considering this, a method offering <10 fps is difficult to use in an autonomous vehicle.
Overall, the proposed method's projection of large objects, such as roads and buildings, was similar to the ground truth with a high intersection over union (IoU) for roads, sidewalks, and buildings, as seen in Table 2. Furthermore, the road and sidewalk parking areas were accurately located, similar to the results of more accurate findings for nearby objects, as shown in the graph in Fig. 10(b).

VI. CONCLUSION
In this paper, a new convolution method for LiDAR point cloud data projected into spherical coordinates was proposed. Since LiDAR has a radial characteristic of distance, the image is converted to spherical coordinates, nearby objects appear VOLUME 10, 2022 very large, and distant objects appear very small. In segmentation and object detection, the required receptive field is different for each object size to recognize an object. TD Conv was proposed to fit a suitable receptive field for each object.
The experiment was a SemanticKITTI benchmark test for LiDAR segmentation. This confirmed that the TD Conv's performance was better than previous projection-based methods. In addition, through the distance-specific results (Fig. 10), it was confirmed that the effect of the proposed method at a close distance was greater. However, for distant or small objects, the performance was not significantly improved or was poor.
In the future, research on far-distance attention will be conducted to improve the detection performance of even small objects in spherical LiDAR coordinates. In addition, the results, by applying the proposed method to 3D object detection, will be checked.