Semantic Point Cloud-based Adaptive Multiple Object Detection and Tracking for Autonomous Vehicles

LiDAR-based Multiple Object Detection and Tracking (MODT) is one of the essential tasks in autonomous driving. Since MODT is directly related to the safety of an autonomous vehicle, it is critical to provide reliable information about the surrounding objects. For that reason, we propose a semantic point cloud-based adaptive MODT system for autonomous driving. Semantic point clouds emerge with advances in deep learning-based Point Cloud Semantic Segmentation (PCSS), which assigns semantic information to each point in the point cloud of LiDAR. This semantic information provides several advantages to the MODT system. First, any point corresponding to any static object can be filtered. Because the class information assigned to each point can be directly utilized, filtering is possible without any modeling. Second, the class information of an object can be inferred without any special classification process because the class information is provided from the semantic point cloud. Finally, the clustering and tracking module can consider unique dimensional and dynamic characteristics based on class information. We utilize the Carla simulator and KITTI dataset to verify our method by comparing several existing algorithms. In conclusion, the performance of the proposed algorithm is improved by about 176% on average compared to the existing algorithm.


I. INTRODUCTION
The Point Cloud Semantic Segmentation (PCSS) research field is growing rapidly with the development of deep learning technology. The PCSS network can provide a Semantic Point Cloud (SPC), assigned semantic information to the point cloud obtained from the LiDAR (Light Detection And Ranging) sensor. Semantic information includes classification information of points needed in autonomous driving, such as a vehicle, pedestrian, cyclist, building, or terrain( Fig.  1). By providing three-dimensional and semantic information at once, the semantic point cloud helps understand the whole scene around the autonomous vehicle.
PCSS can be divided into offline PCSS and online PCSS according to inference time. Offline PCSS has higher accuracy than online PCSS, but taking a long inference time is the disadvantage. For that reason, offline PCSS is mainly used in applications that do not require real-time characteristics, such as map construction. Online PCSS performs semantic segmentation of the point cloud at a real-time level. Due to real-time capability, online PCSS can be applied to various autonomous driving tasks such as object detection, SLAM, and trajectory planning. According to the semantic KITTI benchmark, one of the popular benchmarks for PCSS, the 2D CNN-based online network with the highest performance has mIoU of 59.9 %, as shown in Table.1. The table shows that progress is being made actively enough to produce reliable performance within a short period. Due to this real-time capability and reliable performance characteristics, PCSS can be applied to many applications required to perceive the surrounding environments, such as object detection, recognition, and localization. This paper will focus on applying online PCSS to Multiple Object Detection and Tracking (MODT) Multiple Object Detection and Tracking (MODT) is detecting and tracking multiple objects around the ego-vehicle. Through MODT, the state of objects, such as position, attitude, and shape, can be estimated. The estimated object state is directly associated with autonomous driving safety since this information helps predict the future trajectory and generate collision-free maneuvers. Various perception sensors are used for MODT, such as cameras, LiDAR, and radar. Among them, LiDAR provides accurate 3D depth and geometric information in the form of point cloud data. Since LiDAR calculates range information using Time of Flight (ToF), measured with the time for the reflected light to return to the receiver, it is robust to illumination change. Due to these advantages, LiDAR is emerging as an essential sensor for MODT.
There are three steps for performing LiDAR-based MODT: pre-processing, clustering, and tracking. At the first pre-processing step, raw point cloud data is refined with various pre-processing algorithms, such as the ROI extraction filter and ground removal filter. ROI extraction filter is for extracting point clouds which are included in the region of interest. A ground removal filter is used to remove the ground's point clouds since the ground does not need to be detected in most cases. Through these types of filters, computing resource loss is reduced by eliminating unnecessary point clouds. However, since they use only geometric shape information about the target object, point clouds not included in predefined geometric shapes are difficult to filter out. This problem can be solved if semantic information of each point, such as road or drivable road, is enabled.
The second step is clustering, which is objectifying point cloud data to provide detection information. Detection information includes location and dimension information of surrounding objects. For clustering point clouds, there are various algorithms such as k-mean clustering and DB-SCAN (Density-based Spatial Clustering of Application with Noise). K-mean clustering algorithm requires the preset number of clusters; on the other hand, the DBSCAN algorithm does not require a preset number of clusters. For that reason, the DBSCAN algorithm is more suitable to apply in the dynamic environment of autonomous driving. The DBSCAN defines a point cloud as a cluster if the number of points within a specific radius is greater than the minimum number of points. Therefore, the performance of the DBSCAN depends on predefined parameters such as radius and the minimum number of points. However, clustering performance deteriorates when several objects with various classes are close together or overlap because suitable parameters are different according to dimension characteristics for each class. This problem can be overcome using semantic information from the semantic point cloud.
The last step is the tracking step, which estimates the dynamic state of objects based on the clustering information. The tracking step can be subdivided into three steps: data association, track management, and dynamic filtering. In the data association step, a track is associated with measurement for updating the track state. There are popular algorithms such as GNN (Global Nearest Neighbor) and JPDA (Joint Probability Data Association). GNN algorithm associates track with the nearest measurement. The JPDA algorithm is a statistical approach that is more robust to clutter than the GNN algorithm since it is based on expected value. However, since both algorithms depend on distance or uncertainty information except for class information, wrong associations with other classifications can occur. Wrong association problems can be overcome with class information in the semantic point cloud.
The track management step manages tracks with initializing, deleting, and updating. If a measurement is not associated with tracks in the data association step, a new track is created with initializing using this measurement information.
On the other hand, if a track is not associated with measurement more than a few times, this track is deleted. Associated measurement with the track in the data association step is used for updating the state of track at the dynamic filtering step. In the dynamic filtering step, many kinds of dynamic filters, such as Kalman Filter (KF), Extended Kalman Filter (EKF), Unscented Kalman Filter (UKF), and Interacting Multiple Model (IMM) filter, are utilized for updating the state of the track. Among them, the IMM filter can consider multiple models at once based on probability and provide better tracking performance in model transition. These dynamic filters have the two-step: prediction and measurement update step. In the measurement update step, the state is updated with measurement data associated with the data association step. The prediction step predicts the state based on a motion model, which can be modeled with a mathematical equation. Existing algorithms depend on a preset motion model, which does not consider the unique dynamic characteristics of surrounding objects. However, the class-adaptive motion model can be adopted during the dynamic filtering step if there is class information.
LiDAR-based MODT can benefit from using semantic information of semantic point clouds. First, in pre-processing step, needed points for MODT are extracted directly us-ing semantic information assigned to each point. Semantic information-based pre-processing makes it possible to extract points that cannot be modeled with geometric equations. Second, class-adaptive clustering can be achieved. When performing clustering modules using semantic information in the case that multi-class objects are close together, they can be clearly classified by class and clustered. In addition, a parameter adaptive for the object mutual distance characteristic can be applied based on the class information. Finally, classadaptive tracking can also be achieved using semantic point cloud. Using not only distance and uncertainty information but also class information during data association can be prevented from being related to other class. Also, a classadaptive model can be set in consideration of the unique dynamic characteristics of the object through class information.
This paper proposes a novel method about semantic point cloud-based adaptive MODT for an autonomous vehicle. The overall system consists of two parts. The first part is semantic information-based filtering. This step transforms a raw point cloud to a semantic point cloud through an online PCSS network. Using converted semantic point clouds, points about on-road objects are divided into three semantic groups: pedestrian, cyclist, and car. The second part is class-adaptive MODT. Class-adaptive clustering provides 3D object detection (position and dimension of the object) for each semantic group by considering unique object geometric shape information. Using this detection information, class-adaptive tracking provides the track information (position and velocity) with considering unique object dynamic characteristics. To evaluate the proposed algorithm, the performance of the existing LiDAR-Based MODT and our proposed algorithm are compared using Carla simulation and the KITTI dataset.
The main contribution of our paper is a proposition of a new framework that applies semantic point cloud to LiDARbased MODT. The main summary of contributions are as follows: • Semantic point cloud-based pre-processing becomes simplified. Points can be processed using only semantic information of semantic point clouds with no need for any geometric modeling. • Class-adaptive clustering can be achieved. Nearby multi-class objects can be clustered clearly since points have semantic information. Moreover, class-adaptive clustering parameters improve clustering performance by considering unique dimension characteristics. • Class-adaptive tracking can be achieved. The classadaptive motion model, which is considered unique dynamic characteristics of each object, can be applied to estimate the state of the object.
The rest of this paper is organized as follows. In section II, previous studies are introduced. Next, the system architecture is introduced in section III. In section IV and V, the proposed algorithm is explained in detail. Section VI describes the evaluation results, and we conclude the paper in the final section VII.

II. PREVIOUS STUDIES A. POINT CLOUD SEMANTIC SEGMENTATION
Point Cloud Semantic Segmentation (PCSS) can be divided into offline PCSS and online PCSS according to inference time. In this paper, we utilize an online PCSS network to apply to an online MODT. There is various online PCSS framework, which can be subdivided into four categories in terms of input point cloud representation. 3D CNN-based model uses 3D point cloud as input, which is not converted into other formations [1], [2]. Graph-based model used graph structures for representing point cloud. In 3D CNN and graph-based models, the overall performance is relatively higher than other types of models, but because of its long inference time, it is not often used as an online PCSS network [3]. Point-wise MLP-based model utilizes multi-layer perceptron (MLP) for extracting features. Especially, RandLa-Net, which has the fast performance to inference with light network architecture, is used for both online PCSS and offline PCSS [4]. 2D CNN-based model projects point cloud into a 2D domain, providing fast inference time. SqueezeSegV2 and RangeNet++ project point cloud into a spherical coordinate for applying the 2D CNN model [5], [9]. PolarNet uses polar grid data representation about point cloud [6]. 3D-MiniNet learns a 2D representation from the 3D point cloud [7]. SalsaNext utilizes a context module as an encoder which replaces the ResNet encoder blocks and a pixel-shuffle layer as a decoder. This model has the highest performance among 2D projection-based methods, which also ensure realtime level inference time [8]. In this paper, RangeNet++ and SalsaNext are utilized for the online PCSS model since both models have reliable accuracy and fast inference time performance.

1) Object Detection
There are many algorithms for object detection, especially clustering. The clustering algorithm can be divided into four parts; partitioning-based, model-based, density-based, and ML-based. Partitioning-based method such as k-mean clustering algorithm constructs k partitions and then evaluates them by minimizing error. Since the number of clusters must be specified in advance, this method is not suitable for use in dynamic situations such as autonomous driving. Model-based method is based on a probabilistic model with considering uncertainty. However, there are disadvantages, such as computational complexity and clustering depending on the model. Density-based method clusters areas of higher density than the remainder of the data. There is a popular density-based algorithm, DBSCAN (Density-Based Spatial Clustering of Applications with Noise). This algorithm cluster points with radius and the minimum number of points parameter. If the number of points in a radius is more than the minimum number of points, points in a radius are considered as the same cluster. This algorithm can be used easily with the point cloud and provide a polygonal cluster that preserves the shape of an object. However, since the same parameters are VOLUME 4, 2016  [13]. Most ML-based approaches provide not only position and shape information but also class information about 3D bounding boxes. However, most 3D object detection datasets provide label information in the form of a 3D box, which loses the detailed shape of the object when approximated to the 3D box format. In addition, it is hard to consider static objects around the ego-vehicle since most datasets do not provide the label information of static objects. For that reason, our algorithm is built upon the base DBSCAN algorithm and utilizes semantic point cloud for maintaining the details and considering the overall environment.

2) Object Tracking
Tracking algorithms can be divided into end-to-end MLbased approaches and dynamic filtering-based approaches.
End-to-end ML-based approaches are developing fast and have reliable performance; however, they still have limitations [14]- [17]. They require a dataset for the training model and time to training them. If there is a non-labeled object, it is difficult to detect or track them. In addition, the physical properties of an object are hard to consider directly with this type of method. For that reason, dynamic filteringbased tracking that can directly consider the physical movement characteristics is still actively used in various situations. Kalman filter-based tracking algorithms can provide an optimal solution when the motion model is modeled to a linear function [18]- [20]. Extended Kalman Filter (EKF) or Unscented Kalman Filter (UKF) can estimate nonlinear motion models; however, multiple motion models are hard to consider [21]- [23]. In contrast to the methods above, the Interacting Multiple Model (IMM) filter-based tracking method can consider multiple motion models simultaneously in the Bayesian framework [24], [25]. Most dynamic filtering-based approaches have an absence of class information about objects. This makes the tracking problem more challenging due to the difficulty of considering the dynamic characteristics of objects that are adaptive to class information. However, direct semantic information from the semantic point cloud can provide some solutions to this type of method.

III. SYSTEM ARCHITECTURE
The overall system is composed of two parts, as shown in Fig. 2. The first step is semantic information-based preprocessing. This step can be subdivided into online PCSS network (a) and Semantic information-based pre-processing (b). Using the online PCSS network (a), we can convert raw point cloud input into a semantic point cloud. RangeNet++ and SalsaNext are applied to the online PCSS, which have a good balance of accuracy and real-time characteristics. Through the semantic information-based pre-processing (b), on-road objects points, including pedestrians, cars, and cyclists, are extracted. The second step is class-adaptive MODT (Multiple Object Detection and Tracking). This step can be subdivided into two parts, class-adaptive clustering (c) and class-adaptive tracking (d). As can be seen in Fig. 2(c), independent classadaptive clustering modules are executed with each of the three-point clouds as input: pedestrian SPC, cyclist SPC, and car SPC. Each class-adaptive clustering module comprises a DBSCAN (Density-Based Spatial Clustering of Application with Noise) algorithm. The search radius and the minimum number of points parameters in the DBSCAN algorithm are adapted according to the classification information of the input semantic point clouds. After that, clusters for each class can be obtained as output: pedestrian clusters, cyclist clusters, and car clusters. Using clusters for each class, independent class-adaptive tracking modules are performed ( Fig. 2(d)). The class adaptive tracking module is based on the IMM-UKF-JPDAF tracker, which considers the unique dynamic characteristics of objects using semantic information. We can finally obtain class-adaptive MODT results in the form of bounding boxes for each classification, including track ID, position, velocity, heading, and dimension.   learning network. PCSS Network for on-road MODT must have the following conditions. First, the real-time capability must be guaranteed. Second, segmentation performance for an on-road objects must be reliable. In this paper, RangeNet++ and SalsaNext are adopted as PCSS networks that satisfy the above conditions. RangeNet++ (Fig. 3) projects point cloud into 2D spherical coordinate and apply it to 2D CNN module [9]. Segmented results are reconstructed into a 3D point cloud, and KNN (K-nearest neighbor) search is used as postprocessing. With these processes, RangeNet++ achieves 52.2% for mIoU and 12 fps for inference time as shown in Table.1. RangeNet++ has reliable performance for cars and road especially, however, low performance for pedestrians and motorcyclists.

IV. ONLINE POINT CLOUD SEMANTIC SEGMENTATION
SalsaNext (Fig. 4) is developed with the base of SalsaNet. (Fig. 4) SalsaNext utilized a contextual module for gathering global context information by larger receptive fields. Pixelshuffle layer is applied for improving computational efficiency, and central encoder-decoder dropout enables higher network performance. Due to these advantages, SalsaNext has the highest performance among 2D projection-based methods, as shown in Table.1. Additionally, SalsaNext has reliable segmentation performance for pedestrians and cyclists as well as a car.
Consequently, in this step, the point cloud is converted into a semantic point cloud by utilizing the above online PCSS network. Semantic point cloud includes labels of various static objects and dynamic objects in the form of RGB information. For example, in Fig. 1, the car is represented as blue, the pedestrian is red, the motorcycle is deep blue, and the motorcyclist is claret color. Ultimately, this semantic information makes it possible to provide abundant environmental information to MODT for surrounding objects.

B. SEMANTIC INFORMATION-BASED PRE-PROCESSING
The pre-processing step is the essential process of extracting point clouds that need to be MODT. If there is no preprocessing step, unnecessary computation is conducted with decreasing computation resource and system accuracy. In this step, the semantic point cloud resulted from PCSS is pre-processed for extracting on-road object points. Since the semantic point cloud contains semantic information in RGB, points about the on-road object can be extracted using RGB information. In this paper, we consider the three On-road objects: cars, cyclists (including motorcycles and motorcyclists), and pedestrians. Extracted point clouds are divided and formed independent point clouds by class. As a result, three groups of point clouds corresponding to cars, cyclists, and pedestrians were obtained.
There are advantages to semantic information-based preprocessing compared with the existing algorithm. For example, small sculptures and children can be treated as the same objects if only dimension characteristics are considered. However, since small sculptures are static object, it does not need to be tracked. Instead, it damages computing resources and becomes a factor that lowers the tracking accuracy when it is close to children. Semantic information becomes a key to solve the above problems. Initially, semantic informationbased pre-processing can extract points of any class that cannot be modeled with geometric equations. Secondly, the VOLUME 4, 2016   [8] overall computing resource of the system can also be reduced by extracting only the points necessary for MODT.

A. CLASS-ADAPTIVE CLUSTERING
The class-adaptive clustering step consists of independent class-based clustering modules for cars, cyclists, and pedestrians as shown in Fig. 2(c). Each class-adaptive clustering module is based on the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm. DBSCAN Algorithm is a density-based clustering algorithm, which clusters the high-density parts (Fig. 5(a)). DBSCAN need to be tuned with parameter minPts and representatively. MinPts means the minimum number of points, and means radius from the core point. If there are more than minPts of points within the radius from the core point, it is recognized as a cluster. As for minPts, it is related to the noise characteristics of LiDAR. For instance, the more noisy point cloud is provided from LiDAR, the larger minPts must be applied. Similarly, if the resolution performance is high, the larger minPts value is set. value needs to be set considering mutual distance properties of objects. According to the class of objects, objects have different minimum mutual distances. For example, large objects generally have large mutual distances. However, considering class-based characteristics is challenging since the raw point cloud cannot provide class information of each point.
In the previous step, the car, cyclist, and pedestrian point cloud can be obtained. Using these class-independent point clouds, class-adaptive clustering can be conducted by utilizing class information (Fig. 5(b)). The class-adaptive parameter can be constructed based on the class information. There is two-component to be considered when constructing DBSCAN parameters: mutual distance (distance between objects) according to dimension characteristic and LiDAR resolution characteristic. MinPts is set to be the same since this value is related to noise characteristics of LiDAR. However, must be tuned by considering the mutual distance and LiDAR resolution characteristic. has to be bigger than horizontal and vertical resolution value and smaller than minimum mutual distance. In the case of a car, the minimum mutual distance between objects is assumed when stopping in front of a traffic light. For that reason, minPts car is set at 1.0m, considering the average mutual distance of the car in the above case. Similarly, cyclist also uses mutual distance when stopping in front of a traffic light: minPts cyc is set 0.7m. In pedestrian cases, the stride length of the person is used as the criterion. Assuming the average pedestrian height is 160cm, minPts ped can be set to 0.6m. Finally, clustering results for cars, pedestrians, and cyclists can be obtained with this class-adaptive parameter.
There are several advantages to class-adaptive DBSCANbased clustering. First, since each class receives an inde- pendent class point cloud as an input, there is no influence between points with different classifications. Therefore, there is no problem that different class objects are treated as the same object when they are close or overlapped. Second, clustering performance can be improved by applying parameters suitable for object mutual distance characteristics based on class information.

B. CLASS-ADAPTIVE TRACKING
Class-adaptive tracking is conducted with the results of the class-adaptive clustering, which contains the position and dimension information of the objects. This step consists of three independent modules: car-adaptive tracking module, pedestrian-adaptive tracking module, and cyclist-adaptive module as shown in Fig. 2(d). Each module is based on the . Transition probability between multiple motion models can be modeled into a transition matrix. Row 1, column 2 means the probability of conversion from the CV model to the CTRV model.

Interactive Multiple Model-Unscented Kalman Filter-Joint
Probabilistic Data Association Filter (IMM-UKF-JPDAF) tracking algorithm, which combines the IMM-UKF dynamic filter and JPDA data association algorithm. The tracks are managed with initialized, updated, or deleted through track management.

1) Data Association Step With JPDA Algorithm
A track is associated with measurement, updating the track state at the data association step. In this step, Joint Probabilistic Data Association (JPDA) algorithm is utilized as a data association algorithm. The JPDA algorithm calculates marginal probabilities for track updates by enumerating all possible joint events. Since the JPDA algorithm considers all feasible joint events, it provides reliable performance even in the presence of clutter like the complex urban environment. However, conventional data association algorithms, including JPDA are hard to consider the class information of the measurement due to the absence of class information. As shown in Fig. 6(a), the absence of class information causes problems that can be related to measurements with other class information. Whereas, the above problem can be solved because the measurement information obtained based on the semantic point cloud is divided by class and goes through the data association step independently as shown in Fig. 6(b).

2) IMM-UKF Dynamic Filtering
In IMM-UKF dynamic filter, the IMM filter is used to estimate the object's accurate and stable state by selecting a model suitable for the movement of an object among various motion models. Each model consists of independent filters, and three motion models are selected based on the UKF to consider even the nonlinear motion model: constant velocity, constant turn rate, and random motion models. State equation and measurement equation for model j(m j ) at sampling time k are as follow: x where f j represents the system function about target motion m j to estimate, and h j represents the measurement function. VOLUME 4, 2016 u k means input vector, and z k is measurement vector. w j,k and v j,k denote process noise and observation noise respectively, assumed to be zero-mean white Gaussian state. Each noise are with covariance matrices Q and R. IMM filter integrates the estimated results of multiple UKF dynamic filters about the system model set M = (m 1 , m 2 , ..., m n ). The transition among multiple models in the IMM algorithm is controlled by utilizing a time-invariant transition matrix, which conforms to the Markov chain: where matrix component π ji means mode transition probability from model j to i: using the above transition probabilities, mixing probabilities µ j|i k for each model are calculated as: Applying mixing probabilities, mixed estimates {x 0i can be calculated as: where A representsx j k−1|k−1 −x 0i k−1|k−1 . The unscented Kalman filter provides to predict and update the estimates and covariances for ith model with a non-linear stochastic model. Consequently, updated estimatesx i k|k and covariances Σ i k|k can be obtained. Finally, the overall estimate and covariance are given in the following equation:

3) Class-adaptive IMM-UKF Dynamic Filtering
Based on the class information of the detected object, three tracking modules are designed to consider the dynamic characteristics of each class. When designing each tracking module, several parameters need to be considered: Q matrix and transition matrix Π. Q matrix is related to the unique dynamic properties of the object, especially acceleration characteristics. For that reason, it must be tuned according to the class information of the object. When the state x k is, where v means velocity and θ is yaw angle. Based on the above state, the process noise covariance matrix Q is expressed as: Each component of the above matrix is: 11) In this equation, σ can be tuned differently as a class considering the object's unique acceleration characteristics. Generally, the average acceleration of a car in an urban case is between 1 to 1.5m/s 2 . The average acceleration of cyclist is 1.8m/s 2 and pedestrian is up to 0.5m/s 2 . By applying the acceleration characteristics for each class, the σ parameter in Eq. 10 constituting the Q matrix are set as follows.
σ car = 1.5 σ cyclist = 1.8 σ pedestrian = 0.5 (12) The transition matrix needs to be set considering the class information because the dynamic characteristics are different depending on the class type. However, since existing algorithms suffer from obtaining class information of the object, it is challenging to apply a class-adaptive transition matrix. Whereas, since our method can get the class information of the object, the class-adaptive transition matrix can be used based on the class information. Our approach applies three motion models to the IMM-UKF dynamic filter for estimating on-road objects: constant velocity model, constant turn rate velocity model, and random motion model. Based on these models, the transition matrix can be set as shown in Fig. 7. It can be seen that the first row and the second column represent the probability value of the transition from the CV model to the CTRV model. In order to consider the model transition characteristics for each class, the transition matrix for car, cyclist, and pedestrian is set as follows: where Π car and Π cyclist need to consider non-holonomic constraints of car and cyclist so that third row and column values are set to zero. The class-adaptive tracking module offers various advantages in LiDAR-based MODT systems. First, class information provides reliable data association without being disturbed by clusters with different class information. Second, the unique dynamic characteristics of each class can be considered. The dynamic filtering stage has parameters that need to be adjusted to account for different dynamics, these parameters can be modeled using class information. As a result, class adaptive tracking can be performed based on a semantic point cloud.

VI. VERIFICATION
To verify our method, the tracking performance result from MODT is derived using KITTI Dataset and Carla Simulator. Since the KITTI dataset provides non-synthetic data, tracking performance can be verified in real-world scenarios. However, KITTI raw dataset used in this paper does not provide a ground-truth labeled semantic point cloud. Therefore, the segmentation performance of the PCSS network can affect the overall tracking performance. For this reason, the Carla simulator is also used to evaluate our method. The Carla simulator provides a semantic point cloud labeled as ground truth, so performance can be verified without affecting PCSS performance.
This paper validates the class adaptive module by comparing the performance of our method with the base classical algorithm that does not use SPC. In addition, the object tracking algorithm based on the Kalman filter is also compared to show the appropriateness of selecting the basic algorithm for multi-class object tracking. Three algorithms are compared with the proposed method: IMM-UKF-JPDA algorithm, KF-GNN algorithm, and KF-GNN algorithm with SPC. Since two PCSS models are used in the KITTI dataset verification, two comparison groups are generated in the proposed algorithm and the KF-GNN algorithm, respectively, to which SPC is applied. Those algorithms are evaluated using the evaluation tool from the KITTI dataset. This tool is based on the CLEAR MOT metrics, one of the most used tracking metrics. However, since it was developed to evaluate 2D MOT, the 2D IoU in this tool is modified to 3D IoU to consider 3D MOT. Finally, by using the Carla simulator and KITTI dataset, we can achieve verification of four algorithms, including our method.

A. EVALUATION METRIC
To evaluate our method, the CLEAR MOT metrics are adopted. There are two representative metrics: Multiple Object Tracking Accuracy (MOTA) and Multiple Object Tracking Precision (MOTP). MOTA usually indicates the overall tracking performance, and MOTP shows localization precision. These are computed as:  where m t is the number of misses, f p t means the number of false positives, and mme t is the number of mismatches, where i,t d i t is the total error in estimated position for matched track-measurement pairs and t c t means the total number of matches made. This value means the 3D IoU value for the tracked object. In this paper, the overlap between 3D cuboids is measured with a 0.25 IoU threshold. Furthermore, Mostly Tracked object (MT), Mostly Lost object (ML), and a total number of Identity Switches (IDS) are used for evaluating tracking performance. MT and ML are metrics related to tracking quality: how many percent of the track is tracked against the ground truth over the entire track life span. MT threshold is set 80% and ML of 20%. IDS means the number of track IDs changed. This metric shows how consistently the track has been tracking.

B. VERIFICATION WITH CARLA SIMULATOR 1) Environmental Setup
Carla simulator provides a synthetic environment for autonomous driving simulation, including various vehicle models, buildings, pedestrians, street signs, etc. Moreover, a flexible setup of sensor suites and generating similar urban environments are enabled. For verifying our method, the LiDAR sensor is mounted on the ego vehicle's top of the roof and VOLUME 4, 2016 is set to the exact specifications as the Velodyne 32-channel LiDAR. Carla simulation provides a ground truth-labeled semantic point cloud so that it can be utilized to verify this system as shown in Fig.8. Since the cyclist is labeled as a car and pedestrian ( Fig.9(a)), the point cloud corresponding to the cyclist is relabeled. As shown in Fig.9(b), the cyclist point cloud can be distinguished by determining whether the car and the pedestrian point cloud exist in proximity. Surrounding objects information obtained from the Carla simulator, including a position in ego-vehicle coordinates, dimensions, class, and unique ID, is used to generate ground-truth data. However, since the vehicle and the cyclist are provided in the same category, the tracklet information is regenerated by additionally utilizing the dimension information of the object. Town 10 map in Carla simulator is utilized, which is the most similar environment to real-world. Surrounding on-road objects such as cars, cyclists, and pedestrian objects are spawned about 100 each. The whole scenario is about 50 seconds long. For verifying tracking performance in this setup, evaluation is conducted targeting objects within 50m around the ego-vehicle.

2) Verification Results
As shown in Table.2-4, our method offers the best performance for all metrics in all classes. These results are averaged after calculating the results five times for the same scenario. Significantly, there are many improvements compared to the IMM-UKF-JPDA algorithm selected as the basic algorithm, and the effect of the class adaptation module can be shown. MOTA improved by 16.16% in cars, 16.87% in cyclists, and 226% in pedestrians. MOTP improved by 3.33% in cars, 3.73% in cyclists, and 14.95% in pedestrians. Pedestrian shows a more significant improvement in tracking performance. As shown in Fig.11(a), since pedestrians are affected by nearby static objects, the base algorithm shows relatively lower performance. However, as shown in Fig.11(b), by extracting only the target point cloud through semantic filtering, the problem of miss clustering or miss association with a nearby static object is solved. The smallest number of IDS in our method implies a reduction of miss association with different class objects and improvement of tracking perfor- mance through class-adaptive modules. KF-GNN comparison also shows performance improvement in most metrics compared with the method not using SPC. Through these results, semantic point cloud-based class-adaptive MODT modules can improve the performance of existing LiDARbased MODT algorithms. It shows the objects corresponding to the tracked car and cyclist in Fig.10. In addition, the polygon inside the box means the result of the clustering module. The unique ID and classification information are marked on the labels of the tracked objects. Through this, it can be seen that the proposed method can classify the object and perform the class-adaptive MODT based on the semantic point cloud. Furthermore, it can be seen that tracking of surrounding static objects that do not require tracking is not performed.

C. VERIFICATION WITH KITTI DATASET 1) Environmental Setup
The KITTI dataset is utilized to validate our method in the real world. KITTI dataset contains various sensor data like camera, GPS, and IMU, especially including a highresolution LiDAR Velodyne HDL-64E. However, the KITTI tracking dataset is for 2D bounding boxes containing only the area visible to the camera in front of the ego vehicle. Since the ground truth label is provided as a 2D image plane, it is hard to evaluate tracking performance in a 3D coordinate. For this reason, we utilize a raw dataset from KITTI, which contains 3D bounding box tracking information. Still, a ground truth labeled semantic point clouds are not provided from the KITTI dataset so that we utilize two PCSS models (RangeNet++, SalsaNext) for verifying tracking performance according to PCSS performance.

2) Verification Results
Tracking validation is performed using the KITTI raw dataset. However, if the PCSS network cannot segment the point cloud of the target object, evaluation of a specific dataset is not possible. Therefore, the evaluation is performed by excluding datasets that could not be evaluated. The utilized datasets in KITTI raw dataset are 5, 51, 59, 84, and 91 for cars, 14 and 91 for cyclists, and 1, 5, 13, 17, 18, 48, 57, and 59 for pedestrians. The evaluation result for all metrics is the average value of the evaluation values of the datasets used. As shown in Table.5-7, the proposed method is significantly improved compared to the basic algorithm.
In particular, the improvement rate of the average MOTA is higher than that of the Carla verification case. Because real-world sensor data is noisier than synthetic data, the base algorithms are vulnerable to ambient static objects or environmental noise. This can be addressed by extracting onroad object data through a semantic point cloud. With these improvements, we found that the semantic point cloud-based class-adaptive MODT system is effective for the existing LiDAR-based MODT. Two PCSS models are used in this verification, which are RangeNet++ and SalsaNext. As shown in Table.1, Sal-saNext has high segmentation performance in car, cyclist and pedestrian. For that reason, our method with SalsaNext also shows higher performance at MOTA, MT, ML, and IDS than the RangeNet++-based method as shown in Table.5-7. In particular, since the segmentation performance of the two networks for pedestrians and cyclists is significantly different, the difference in tracking performance is also huge. However, MOTP shows lower than the method based on RangeNet++. The segmentation performance of SalsaNext is high, but problems that do not clearly classify even the outer edges of objects can affect tracking precision. Nevertheless, it can be seen that MOTA is significantly improved compared to RangeNet++, indicating that the classification performance of PCSS affects the tracking performance of the proposed algorithm. Our algorithm sometimes tends to have a slightly higher number of IDS or lower MOTP than the base algorithm or KF-GNN algorithm; however, those algorithms have an obviously smaller number of the tracked object. Consequently, it can be verified that the proposed algorithm can improve existing LiDAR-based MODT algorithms.
In addition, the possibility of applying our system to the real-time application is verified by measuring the execution time of the proposed method. The PCSS network SalsaNext and RangeNet++ are 20 fps and ten fps, respectively. Considering these speeds, the overall system speed recorded about 12 fps with SalsaNext and seven with RangeNet++. Since the frame rate of LiDAR is about 10 Hz, it may be challenging to use in real-time applications in the case of a system to which RangeNet++ is applied. However, the SalsaNext-based method can be sufficiently applied to real-time applications at ten fps. Accordingly, it is verified that our method based on SalsaNext can be applied to real-time application with reliable performance and speed.
As shown in Fig.12, the proposed method can detect and track objects corresponding to cars, pedestrians, and cyclists. Even object classification information can be obtained, and since only the target object is tracked, static objects are not tracked. In addition, it can be seen that pedestrians are tracked unaffected by nearby static objects. However, there are some false positives, which are not the target objects but are tracked as target objects. As shown in Fig.13, the segmentation error of the PCSS network leads to the creation of false-positive tracks.

VII. CONCLUSION
In this paper, a class-adaptive MODT framework based on semantic point cloud was proposed. First, semantic VOLUME 4, 2016  information-based filtering was introduced to extract the target point cloud. Through this step, it was possible to extract unmodeled objects by utilizing the semantic information assigned to each point. Second, class-adaptive clustering and tracking modules were designed taking into account their unique dimensional and dynamic characteristics. In the class-adaptive clustering module, each class-based module was constructed by considering the mutual distance of each object. The class-adaptive tracking module is designed by setting the appropriate Q matrix and transition matrix in consideration of the dynamic characteristics. Finally, the entire system was validated using the Carla simulator, providing ground-truth labeled SPC and the KITTI dataset for real-world validation. As a result, there was a significant improvement in performance verification, compared with several existing algorithms. Moreover, to analyze the effect of PCSS network performance on this system, a performance comparison was performed according to the PCSS network. To sum up, • Semantic information-based filtering: through semantic point cloud, unmodeled objects' point cloud could be filtered simply. Points corresponding to a specific object could be extracted, which helps improve system accuracy and reduce computational resources in MODT.
• Semantic point cloud-based class-adaptive MODT: based on the semantic information, we designed the class-adaptive module for clustering and tracking to consider mutual distance and acceleration characteristics. Consequently, the average accuracy of the validation improved by about 86.34% on the Carla simulator and 267% on the KITTI dataset. Although the method proposed in this paper has improved performance compared to the existing classical methods, it still could not record very high values in terms of accuracy or precision. In addition, since the PCSS network performance is highly dependent, a study considering the classification uncertainty of the PCSS network will be required. As future work, we plan to apply the state-of-the-art online PCSS network and various MODT algorithms in the proposed system to improve overall performance. Specifically, machine learning-based detection or tracking algorithms will be utilized in future works for improving MODT performance. Additionally, by considering the uncertainty of the PCSS network, we will reduce the dependence of the performance of the PCSS network and develop a more robust MODT algorithm.