MD3D: Mixture-Density-Based 3D Object Detection in Point Clouds

The design factors of anchor boxes, such as shape, placement, and target assignment policy, greatly influence the performance and latency of the 3D object detectors. Unlike image-based 2D anchors, 3D anchors must be placed in a 3D space and determined differently for each class of different sizes. This imposes a significant burden on the design complexity. To tackle this issue, various studies have been conducted on how to set the anchor form. However, for practical reasons, anchor-based methods select the anchor design by compromising between performance and latency. Consequently, only objects that are similar in shape and size to an anchor can obtain high accuracy. In this paper, we propose a Mixture-Density-based 3D Object Detection (MD3D) in point clouds to predict the distribution of 3D bounding boxes using a Gaussian Mixture Model (GMM). With an anchor-free detection head, MD3D requires few hand-crafted design factors and eliminates the inefficiency of separating the regression channel for each class, and thus offering both latency and memory benefits. MD3D is designed to utilize various types of feature encoding; therfore, it can be applied flexibly by replacing only the detection head of the existing detectors. Experimental results on the KITTI and Waymo open datasets show that the proposed method outperforms its counterparts that are based on the conventional anchor-based detection head in its overall performance, latency, and memory. The code is publicly available at https://github.com/sky77764/MD3D


I. INTRODUCTION
Recently, the industrial demand for autonomous vehicles and robotics technology is increasing rapidly. With this rising demand, various sensors, such as monocular cameras, stereo cameras, light detection and ranging(LiDAR), and solid-state LiDAR, have been developed to capture the world's 3D spatial information in data. LiDAR enables more accurate and sophisticated distance measurements than other sensors, hence it has been widely used for the development of 3D object detection methods.
Raw point clouds obtained by LiDAR sensors have noisy sparse representation with an imbalance sampling problem, which causes many occluded surfaces to be without any The associate editor coordinating the review of this manuscript and approving it for publication was J. Jun Cheng .
The existing 3D object detectors have mostly adopted anchor-based detection methods. In this study, we present the anchor or anchor box as a set of predefined 3D boxes for each object class by using the scale and aspect ratio of the class. Anchors are tiled across the scene (gray boxes in VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/  Fig. 1 (a)), and anchor-based detectors detect target objects by predicting offsets from nearby anchors for each class. However, anchor-based detectors suffer from a fatal problem; they must predefine many anchor-related hyper-parameters: 1) anchor size, 2) anchor direction, 3) stride that determines anchor placement, and 4) target assignment policy. Each of these factors dominantly influences the performance and latency of the detectors. Thus, they must be defined separately for each class. Existing anchor-based methods usually set their default anchor size as the average of all ground truth (GT) bounding boxes with 0 and 90 degrees rotations. Unifying the anchor size as a representative value, i.e., the average size of objects in the training data, would be a good compromise in the trade-off between speed and performance. However, objects with significant deviations from the default anchor size inevitably tend to be overlooked. Another hyper-parameter that biases a detector is the stride between neighboring anchors. The spatial dimensions of the last feature map determine the anchor stride. For example, in the KITTI dataset [22], a typical compact detector has a 200 × 176 feature map; the anchor-to-anchor interval is about 0.4m, which is relatively narrow for large objects such as Car. Contrastingly, for small objects, such as Pedestrian and Cyclist, it is too wide to cover all GTs. Accordingly, the IoU value between an anchor and a GT varies significantly for each object class, and hence the foreground threshold is inevitably set differently for each class. In addition, calculating the IoU in a 3D space can result in extremely low IoU values, leading to multiple placements of anchors on the z-axis (vertical direction). To mitigate this problem, conventional methods ignore the z-axis using Bird's Eye View (BEV) 2D IoU. Fig. 1 shows the problems caused by hand-crafting factors in the anchor-based detection methods in detail. Notably, the average number of assigned anchors for Car with an extreme size (XS in (c)) is much smaller than the averagesized Car. Consequently, outlier objects with extreme sizes show low recognition rates compared to objects within the normal range because of an insufficient number of assigned anchors. Moreover, regardless of the GT box size, the number of foreground anchors for Car is overwhelmingly larger than those for the other two classes. This is because the strides of the anchors cannot be assigned differently for each class. A relatively larger stride for Pedestrian and Cyclist is likely to cause the objects belonging to these classes to be unable to match with any of the anchor boxes. Therefore, an unsuitable anchor box with a low IoU value, which is also the highest among the other anchor boxes, is forcibly assigned to avoid the zero assignment. Through this, it can be pointed out that it is not the number of GT boxes or other biases in the training data but the inherent structural limitations of the anchor-based detection methods that severely affect the inferior detection performance of Pedestrian and Cyclist.
Our proposed Mixture Density network for 3D Object Detector (MD3D) is a method of estimating the distribution of 3D bounding boxes in point clouds with a Gaussian Mixture Model (GMM), which is free from the problems experienced by anchor-based detectors mentioned above. Another significant merit of the MD3D is that it is free from the discrepancy between classification and regression loss. Most 3D object detectors use the focal loss [23] for a classification loss to adjust their weights according to the estimation accuracy, allowing them to learn well about data with fewer samples. By contrast, regression loss treats the anchors assigned as the foreground equally. Therefore, with typical regression loss, it is inevitable that classes whose GT box shapes are concentrated close to the mean, i.e., Cars in the KITTI dataset, have superior performance. Our MD3D estimates the 3D bounding box in a distribution form throughout the scenes without any process of assigning the GT during regression learning and without distinguishing classes or box sizes. Thus, it can reduce heuristic design factors and cover a wider variety of data samples. The contributions of this study are as follows.
• Among point-cloud-based 3D object detectors, we first propose an anchor-free detection method that estimates the density of bounding boxes and no longer requires a heuristic ground truth assignment.
• Our proposed MD3D is applicable to any type of point cloud feature encoding methods that enables it to be plugged and played easily to the existing detectors.
• MD3D shows superior performance and latency compared to the existing detection heads and facilitates learning by minimizing hand-crafted design factors.

II. RELATED WORK A. 3D OBJECT DETECTION IN POINT CLOUDS
As point clouds provide accurate geometric scene information, point cloud-based methods have achieved high performance in 3D object detection. However, the inherent properties of irregular and sparse point clouds impose difficulties in data processing. Therefore, various forms of point cloud representations have been proposed. SECOND [7] groups the point clouds into voxels and utilizes spatially sparse convolution, thereby reducing the computational burden of the 3D convolution. PointPillars [15] encodes point clouds into stacked pillars and operated all processes with only 2D convolutions, removing the bottleneck of 3D convolutions. PointRCNN [5] directly learns representations from raw point clouds using PointNet++ [3], [4] as a backbone network and generates bounding box proposals efficiently by taking advantage of point segmentation, which provided a powerful clue to 3D object detection. VoteNet [24] also uses PointNet++ as a backbone and detects objects using a deep Hough voting method. PV-RCNN [12] utilizes both voxel-based and point-based operations to encode multi-scale features and provide accurate location information efficiently. A graph method is adpoted in [25] to detect objects. The latest point-cloud-based methods compensate for the limitations of the existing detectors and significantly improve the accuracy and latency. Focal sparse convolution [26] predicts the importance of features in performing sparse convolution and selectively computes high importance features. IA-SSD [27] reduces the computational overhead of raw point-based detectors by using a learnable downsampling strategy. SST [28] improves the detection accuracy by introducing a single-stride backbone network that utilizes transformer blocks. Although the latest works have improved many aspects of the existing detectors, most have focused on improving the backbone structure.
Most anchor-free 3D object detectors [17], [18], [19], [20], [21] use a classification method based on heatmap estimation, which is primarily adopted in 2D object detection [29], [30], [31]; hence, they can only be used for 2D-projected features. In addition, the heatmap-based heads still have many hand-crafted design factors, such as, the Gaussian radius of the heatmap and the foreground assignment policy for regression. Because the proposed method does not require a GT assignment policy for regression, the design factors can be significantly reduced. There is no restriction on the input features, so the proposed method can be flexibly applied to various types of point cloud features.

B. MIXTURE DENSITY NETWORKS IN COMPUTER VISION
Originally, MDN [32] was proposed to predict a continuous quantity under uncertainty. The MDN has recently attracted considerable attention, especially in object detection tasks because capturing uncertainties and coping with mislocalization have become critical issues. He et al. [33] measured the uncertainties of bounding boxes to deal with challenging cases, such as occlusion, while Feng et al. [34] extended it to LiDAR 3D vehicle detection. The MDN was also utilized to model the multi-modal nature of object detection and human pose estimation [35], and to address active learning for object detection [36].
We address the problem of complex anchor design that restricts both performance and latency in 3D object detection and apply MDN to overcome this limitation. This study is inspired by MDOD [37], which reformulated the 2D object detection task as a density estimation problem, and it reduced the complex processing and heuristics in the training. We extend their works to the 3D object detection task and improve the existing detectors in a plug-and-play manner with a flexible detection head that is compatible with any representation of point clouds. Unlike images, point clouds have various feature forms (BEV, FV, voxel, point, etc.). The proposed MD3D can be flexibly applied to these different types of features and easily replace the detection heads of existing detectors.

A. MODELING POINT-CLOUD-BASED 3D OBJECT DETECTION WITH MIXTURE DENSITY NETWORK
In point cloud-based 3D object detection, the input point cloud can be expressed as L ∈ R N ×4 (3D coordinates and reflectance), and the position, size, and direction of an object can be expressed as a 3D bounding box B ∈ R N gt ×7 . Here, N represents the number of points in the scene, and N gt is the number of GT boxes. To regress object B from input L, we estimate the conditional probability distribution p(B|L).
A mixture density network (MDN) [32] is a neural network, whose target is to learn the probability density function (pdf). We applied MDN to point cloud-based 3D object detection to predict the distribution of multiple bounding boxes for a given scene (point cloud), and estimate the target 3D bounding box B for the input point cloud L as a mixture model. We use the conventional GMM as the target pdf, which can be expressed as: where K is the number of mixture components, which is determined by the spatial resolution of the BEV feature or the number of point features N , and φ k is the mixing coefficient. For the efficiency of the model, we assume that each element of µ k ∈ R 7 is independent and the covariance matrix is diagonal, that is, B is composed of the center position, box dimension, and yaw angle, so B origin = {x c , y c , z c , l, w, h, θ}. We encode the B origin as B corner = {C flt , C brb , w} ∈ R 7 , which consists of the front-left-top corner C flt = {x, y, z} flt , the back-rightbottom corner C brb = {x, y, z} brb , and width w. Among the various ways to encode bounding box B, encoding it with two opposite corners and width can result in a more accurate regression for the bounding box. This can be attributed to the nature of the point clouds obtained by LiDAR, in which the points are not in the center of an object but are concentrated in one corner. The corner on the hindside without points can be easily regressed using peripheral point features owing to the symmetry of the target object. As part of the post-processing, we decode the B corner back to the B origin .
Existing anchor-based regression methods learn several B's separately in L, where each anchor's design and matching algorithm become critical elements in training. However, because our method learns by representing the distribution of multiple B's as one mixture model with the conditional distribution p(B|L), unnecessary heuristic design can be eliminated.

B. NETWORK ARCHITECTURE
The MD3D consists of a regression branch that predicts three mixture parameters φ k , µ k , and σ 2 k for k ∈ [K ], and a classification branch that predicts class probability p. The backbone of the existing 3D object detectors, which encodes the feature of a point cloud, remains unchanged. We apply the MD3D to most commonly used forms of head features, BEV-type features, and point-type features. Their structures are shown in Fig. 2; MD3D can be applied to any form and is compatible with many different methods of encoding point cloud features.
The MD3D for BEV features has a structure similar to that of MDOD [37], an MDN-based 2D object detector. The BEV feature has the shape of H ×W ×C, where H , W , and C represents the height, width, and number of channels, respectively. Accordingly, the number of mixture components K becomes H × W , and the mixing coefficient φ ∈ R H ×W ×1 is forced to satisfy k φ k = 1, using softmax for the feature output. As shown in Fig. 3 (a), µ ∈ R H ×W ×7 does not predict B corner directly but predicts the offsets from the center coordinates of each feature M xy . For z and w, we use the raw output rather than the offset. The process is formulated as follows: ∈ R H ×W ×7 predicts values greater than zero using softplus activation. p ∈ R H ×W ×N c predicts the classification probability for each class using sigmoid activation, where N c denotes the number of classes which is set to 3 (Car, Pedestrian, Cyclist) in our experiments.
Existing anchor-based regression methods use anchors defined differently per class, and generally they use anchors in the two directions of 0 and 90 degrees. Therefore, the number of output boxes is H × W × N c × 2, which is N c × 2 times higher than that of our MD3D. Consequently, MD3D has advantages in terms of the number of parameters, inference time, and post-processing time.
The MD3D for the point feature has some minor modifications from that of the BEV feature because the input shape is slightly different. Because the point feature has the form of N × C, where N is the number of points in a scene, the number of mixture components becomes K = N . Accordingly, it becomes φ ∈ R N ×1 , µ ∈ R N ×7 , σ 2 ∈ R N ×7 , and p ∈ R N ×N c . Furthermore, as shown in Fig. 3 (b), the reference point M xyz becomes the original (x, y, z) coordinates of the point, and µ predicts an offset from M xyz , except for w: At inference time, because the values of µ are highly likely to be close to the local maximum of the predicted GMM, we use µ of each mixture component as an independent output box. To improve the inference speed, σ 2 is not used and φ is used to filter out unnecessary boxes. In addition, the mixing coefficient φ is very low for the location where no object exists, as in the example in the Fig. 2, hence many output boxes can be filtered out. Then, using non-maximum suppression (NMS), boxes in which p is the local maximum are finally extracted.

C. LOSS FUNCTION
L MDN , the regression loss, is used to learn the GMM parameters φ, µ, and σ 2 with a negative log-likelihood as follows: Here, N gt is the number of GT bounding boxes in the scene.
For classification loss, we use the most commonly used focal loss [23], as shown below: Among the boxes predicted in the regression branch, when the 3D IoU of a box exceeds 0.5, we assign it to the foreground; otherwise, we assign it to the background. We use α t = 0.25 and γ = 2. The loss of the MD3D head is the sum of the MDN loss and focal loss, as follows: For one-stage detectors, we use MD3D loss as a final loss, and for two-stage detectors, we replace the region proposal network (RPN) loss with MD3D loss because the MD3D head is utilized in the RPN. We use β = 500.

IV. EXPERIMENT A. DATASETS
We evaluated the proposed method on the KITTI dataset [22], one of the most popular datasets for 3D object detection for autonomous driving. It consists of 7,481 training samples and 7,518 testing samples, where the training samples are generally divided into train split with 3,712 samples and val split with 3,769 samples. Because the KITTI dataset contains only 90-degree annotation, we clipped the scenes into (0, 70.4)m, (−40, 40)m, and (−3, 1)m for the X, Y, and Z axis ranges. We also experimented on a large-scale Waymo Open dataset [38] to verify whether the performance of the MD3D improved regardless of the data size. The Waymo dataset includes 798 training sequences with approximately 160k samples and 202 validation sequences with 40k samples. Because of limited resources, we trained the models with 20% samples at regular intervals for each sequence, using a total of 32k training samples. The Waymo dataset contains a complete 360-degree annotation, and we clip the scenes into (−75.2, 75.2)m, (−75.2, 75.2)m, and (−2, 4)m for the X, Y, and Z axis ranges. We primarily focused on outdoor scene datasets whose target objects are occlusion-free in the BEV.

B. EXPERIMENT SETTINGS
We conducted the experiments with the same factors as the existing 3D object detectors, except that the detection head was replaced with MD3D. Most of the configurations are from OpenPCDet [39], one of the most commonly used codebases for 3D object detection. The detailed network structures of each detector are shown in Tables 7 and 8. The baseline detectors may differ slightly in performance owing to the gap between the settings of the original paper. As MD3D is plugged and played with the existing detectors, the only modification in the MD3D experimental setting is the detection head for one-stage detectors, and RPN head for two-stage detectors.

C. RESULTS ON KITTI DATASET
As shown in Table 1, we conducted the KITTI-val dataset experiment to compare the performance and speed of the baselines, three anchor-based detectors, and two anchor-free detectors. We mark '+MD3D' when the proposed method, MD3D, is applied. We calculated the average precision (AP) by creating a precision-recall curve along with changes to the confidence threshold and averaging the precision values at 11 recall points. The IoU thresholds for 3D AP were set to 0.7, 0.5, and 0.5 for Car, Pedestrian, and Cyclist, respectively, and mAP is the mean 3D AP score for all classes. Latency was measured as the total inference time, including post-processing with a batch size of 1, using Titan RTX.
MD3D improves the performance of all anchor-based detectors for most classes and difficulty levels. In the case of SECOND, a significant improvement was achieved in all settings, especially in the Pedestrian and Cyclist classes. This difference in performance gain arises from the difference in the feature dimension size. Unlike PointPillars, which use 248 × 216 features, SECOND uses 200 × 176-size features. In other words, PointPillars has a lower anchor stride than SECOND, so its anchor boxes have already been excessively assigned as a foreground for even small-size GTs of Pedestrian and Cyclist, which enables sufficient learning for both classes at the cost of latency. As a result, even if MD3D was applied, a significant performance improvement would not be achieved. However, with a larger anchor stride, SECOND assigns an average of 1.4 and 1.6 anchors to Pedestrian and Cyclist, respectively, so they are not trained sufficiently. Therefore, with MD3D learning a single GMM regardless of the size of the object, SECOND + MD3D significantly improves the performance for Pedestrian and Cyclist compared to SECOND. The performance improvement of PV-RCNN is marginal because MD3D is applied only to the RPN of the first stage. Regardless of the precision, in the RPN, the recall value increases with the increase in number of proposal boxes. However, as can be seen in Fig. 7, MD3D is more effective at removing false positives than the existing head; therefore, the performance improvement of an RPN head is insufficient.
The MD3D also shows superior performance and latency compared to anchor-free detection heads. Compared to Cen-terPoint [17], which uses a heatmap-based detection head and regresses the bounding box with center coordinates, MD3D significantly improves the performance for all classes and reduces latency. This performance improvement is inherently attributed to MD3D's effective learning for small classes, such as Pedestrian and Cyclist, in addition to the change of box encoding scheme from center to corner. The Gaussian radius of the heatmap depends on the size of the GT box; therefore, small objects are not sufficiently trained as large ones. For another anchor-free detector, PointRCNN [5], which is a two-stage detector that utilizes point features, we replace only the regression branch with MD3D while leaving the classification branch that performs foreground segmentation in the RPN as it is. With this modification, the mAP is increased slightly, and the latency was significantly reduced. This shows that our MDN-based corner regression offers an advantage over PointRCNN's bin-based residual regression. The latency of our method is significantly reduced because unnecessary boxes are removed using φ before the NMS. For both the training and inference phases, we keep the top 512 proposals for refinement of the stage-2 sub-network.

D. RESULTS ON WAYMO OPEN DATASET
For the Waymo dataset, we report the AP and the average precision weighted by heading (APH) of SECOND [7] with and without the MD3D head, respectively. The AP was calculated by averaging the precision values at 11 recall points identically to that of the KITTI dataset. APH is calculated similarly to AP but uses precision values weighting each true positive by heading accuracy. We evaluated the models into two difficulty levels: LEVEL 1 includes GT boxes with at least five inside points, and LEVEL 2 includes GTs with at least one inside point. As shown in Table 2, the proposed MD3D improved the baseline at all levels and all classes. Note that the MD3D leads to a significant gain in Pedestrian and Cyclist classes, whose GTs are relatively small. This verifies the advantages of the MD3D predicting bounding boxes in a probabilistic and anchor-free manner.

E. ANALYZING RECALL BY OBJECT SIZE
As shown in Fig. 1, anchor-based detectors have an insufficient number of anchors assigned to the foreground for extremely small objects. We measured recall by object size to verify that the proposed method, not in the use of anchors, could improve this inherent limitation. We used SECOND  (x c , y c , z c ), the dimension (l , w , h), and the yaw angle θ . The B center consists of the center coordinate (x c , y c , z c ), the front-center coordinate (x fc , y fc ), and the dimension (w , h). We use B corner consisting of the front-left-top corner C flt = (x flt , y flt , z flt ), the back-right-bottom corner C brb = (x brb , y brb , z brb ), and width w . as the base model, and considered bounding boxes before NMS to focus on the regression results. As shown in Fig. 4, there is an improvement for XS-sized GT boxes in the Car class, where the lack of assigned anchors has caused harm to the performance of the existing detector. The improvement is insignificant for other sized Car boxes because the base model has already assigned an excessive number of anchors to the foreground. In addition, in Pedestrian and Cyclist classes, the recall of GT with a size close to the average is already high enough for the base model, so the increase is small; however, for extremely small size cases, the increase is noticeably significant. Therefore, the proposed MD3D can delicately detect small objects compared to anchor-based detection heads.

F. ABLATION STUDY 1) BOX ENCODING METHODS
As shown in Fig. 5, we conducted an ablation study of the box encoding method on the KITTI validation dataset with models applying MD3D to PointPillars. To better demonstrate the performance of each box encoding method regarding the headings of predicted boxes, we used the average orientation similarity (AOS) metric, along with 3D AP, which assesses cosine similarities between the angles of the estimated and GT heading orientations. The results are presented in Table 3. First, using the GT box in its original form, B origin = {x c , y c , z c , l, w, h, θ} results in a very low AOS AP and thus a low 3D AP. This is because of discontinuous θ, which is the same box when the yaw angle θ is 0 and 2π, but the loss is calculated differently. Therefore, we experimented with B sincos , which changes θ to a continuous value, (sin(θ), cos(θ)) ∈ R 2 , to avoid ambiguity. Both AOS and 3D AP have some increases. Still, because sin(θ) and cos(θ) are mutually dependent and periodic, B sincos is not entirely appropriate for our GMM modeling, leaving room for improvement. Therefore we devised a novel method of finding the front-center coordinate value of the box, as shown in (b), to predict the box without θ. This B center = {x c , y c , z c , x fc , y fc , w, h} has a notable improvement over the previous two approaches. In addition, another variant method, B corner , predicting two corners with w, yields the highest performance. This is because it is easy to localize the corner of an object owing to the characteristics of the point clouds obtained by LiDAR.

2) PROBABILITY DISTRIBUTIONS
In MD3D, it is essential to choose a proper probability density function that fits the data characteristics of the input point clouds and the output 3D bounding boxes, because it substantially impacts the model's performance. We consider the Laplace, Cauchy, and Gaussian distributions, which are symmetric and have the same number of parameters. We applied them to PointPillars and SECOND baselines and compared them to the KITTI-val dataset. As shown in Fig. 6, the shapes of the three distributions differ in peak height and tail length. The point clouds have sparse representations, implying that the area occupied by actual points is very small compared with the space of the entire area when voxelized. This makes our MD3D have a considerable number of mixture components because MD3D requires output for the entire feature space. Therefore, the Gaussian distribution with the shortest tail is the most suitable for MD3D because it can effectively suppress the probability of boxes with high uncertainty from unnecessary empty spaces. Table 4 also shows that the Gaussian distribution was the most effective for both models. Table 1 shows that MD3D has a significant performance improvement for detectors with smaller input feature dimension and therefore fewer anchors. To verify the effect of the number of anchors and the number of mixture components K , we set SECOND as the base model and compared the  performance by adjusting the horizontal and vertical dimensions of the input feature by 1/4, 1/2, and 2 times, respectively. As shown in Table 5, anchors are placed separately in two directions, 0 and 90 degrees for each class. Thereby, anchorbased detectors output anchor boxes six times more than K , even for features of the same size. The anchor-based detector with more anchors achieves higher performance in obtaining better foreground GT assignments, whereas MD3D with an unnecessarily large K tends to learn poorly and achieve lower performance. However, comparing them in a small feature dimension of 50 × 44, SECOND achieves a very low mAP (50.10) despite predicting six times more boxes than MD3D (64.86). SECOND needs to predict 200 × 176×6 boxes, 96 times more boxes than MD3D, to achieve similar performance (67.28). Therefore, the performance of MD3D is maximized when used in compact detectors.

4) COVARIANCE MATRIX
We modeled the point cloud-based 3D object detection with the multivariate GMM using only the diagonal elements of a covariance matrix rather than a full matrix. Training with a full covariance matrix means that a detector learns the correlation of the elements of B corner ∈ R 7 , whereas the diagonal matrix does not. Table 6 presents the results of applying these two methods to PointPillars. In the case of Pedestrian, whose intraclass correlation is high because the objects share similar shapes, the full covariance model achieves higher performance. Except for Pedestrian, the models using only the diagonal matrix outperformed those using the entire matrix. Therefore, we decided to use only the diagonal elements of the covariance matrix because it achieves a slightly better mAP and uses half the number of parameters.

G. DISCUSSION
MD3D showed superior performance and latency regardless of the backbone network types and the use of anchors  (Tables 1 and 2). It is especially effective for one-stage detectors, such as SECOND and CenterPoint, which output relatively small feature maps ( Table 5). The reason for their dramatic increase in performance is the higher recall than that of existing heads for small objects, such as Pedestrian and Cyclist (Fig. 4). However, MD3D has an advan- tage in terms of recall rather than precision (Fig. 7), the performance improvement for two-stage detectors, such as PV-RCNN and PointRCNN is marginal. In addition, MD3D has advantages regarding the number of parameters and latency because the box prediction channels are not separated by class. However, because of the unified channel across classes, the performance on datasets with many classes may be limited, which we will attempt to overcome in the future work.

V. CONCLUSION
Most of the existing point cloud-based 3D object detectors apply a specific target assignment policy to the GT boxes to regress 3D bounding boxes. Because this training method needs to optimize many hand-crafted design factors, it takes significant amount of effort to utilize and places many restrictions on the network structure. In this paper, we proposed MD3D, which reformulates the regression of 3D bounding boxes in point clouds as a density estimation problem. The MD3D is easy to use and can be applied to various types of feature encoding methods without considering the target assignment policy and network structure. Experiments on the KITTI and Waymo datasets show that the proposed method outperforms conventional methods in terms of performance, speed, ease of use, and flexibility. Although we only considered point clouds as inputs, the MD3D can be easily applied to other various types of inputs. Furthermore, we expect MD3D is utilized for multi-modal inputs by fusing the mixture density outputs.