Refining the ONCE Benchmark With Hyperparameter Tuning

In response to the growing demand for 3D object detection in applications such as autonomous driving, robotics, and augmented reality, this work focuses on the evaluation of semi-supervised learning approaches for point cloud data. The point cloud representation provides reliable and consistent observations regardless of lighting conditions, thanks to advances in LiDAR sensors. Data annotation is of paramount importance in the context of LiDAR applications, and automating 3D data annotation with semi-supervised methods is a pivotal challenge that promises to reduce the associated workload and facilitate the emergence of cost-effective LiDAR solutions. Nevertheless, the task of semi-supervised learning in the context of unordered point cloud data remains formidable due to the inherent sparsity and incomplete shapes that hinder the generation of accurate pseudo-labels. In this study, we consider these challenges by posing the question: “To what extent does unlabelled data contribute to the enhancement of model performance?” We show that improvements from previous semi-supervised methods may not be as profound as previously thought. Our results suggest that simple grid search hyperparameter tuning applied to a supervised model can lead to state-of-the-art performance on the ONCE dataset, while the contribution of unlabelled data appears to be comparatively less exceptional.


Introduction
In an era where the interaction with 3D environments is increasingly necessary, there's a corresponding demand for refined deep learning technologies that cater to 3D data detection.Key domains such as autonomous driving, robotics, and augmented reality are just a few applications of this emerging need.Amidst various representations, our research focuses on the point cloud representation, primarily due to the unique benefits of the range sensor (LiDAR).Known for its consistent observations, LiDAR is unaffected by external parameters such as lighting conditions or the time of day, and exhibits resilience under various weather conditions [1].In light of the anticipated rise of wearable devices equipped with compact LiDAR scanners [2], industry specialists anticipate a growing dependence on LiDAR [3].
The pressing challenge on the horizon relates to the demand for 3D data annotation, highlighting the increased importance of automated self-labeling data platform solutions within current and future LiDAR applications.Our research addresses this challenge by focusing on semi-supervised learning as a strategic approach to significantly reduce the 3D annotation workload (see [4] for more details).However, it is imperative to acknowledge that the field of semi-supervised learning presents its own formidable challenges, the most prominent of which are the creation of datasets and the establishment of validation methodologies.
The fundamental premise of semi-supervised approaches assumes that the labeled and unlabelled data have identical distributions.Nevertheless, the practical manifestation of this assumption often deviates from the ideal scenarios due to a lack of labeled data, which is insufficient to represent the entirety of the general population but is instead skewed towards specific modes.In the context of autonomous driving data, factors such as prevailing weather conditions or specific geographical locations may be disproportionately limited to a particular option.Hence, the utilization of datasets that can adequately account for these different characteristics becomes mandatory when validating semi-supervised learning methods.
We argue that the ONCE dataset, introduced recently, is the most appropriate option for fulfilling this requirement, serving as the prime benchmark for the comparative evaluation of semi-supervised 3D object detection models.Specifically, the current state-of-the-art approach, Proficient Teacher, relies predominantly on the ONCE dataset [5] to support its claims of superiority over alternative methods.Nonetheless, it is worth mentioning that the authors of this study utilize training configurations and benchmarking results of previous methods provided within the ONCE data toolkit.Our empirical analysis reveals that these training parameters deviate significantly from optimality and that a properly tuned model trained exclusively on labeled data significantly outperforms Proficient Teacher [6].
The reason for this gap is that most semi-supervised learning methods rely on pseudo-labeling.A supervised model must be pretrained on labeled data to acquire initial pseudo labels.As a result, the effectiveness of semi-supervised learning depends on the quality of the pretrained model.Note that the models pretrained according to the ONCE benchmark tend to underfit, compromising the legitimacy of comparisons with semisupervised methods.We argue that a fair assessment of semisupervised techniques, particularly their effectiveness in learning from unlabelled data, can only be achieved with an ade-quately fitted pretrained model.Without this essential criterion, during semi-supervised training models would only undergo further training on labeled data, and the approach that hinders such training less would be superior to one that relies more heavily on unlabelled data.
Within this study, we aim to address the question: "How can we optimize the supervised pretraining process to provide a fairer comparison of semi-supervised methods?".Our research results in proper training hyperparameters for the SECOND [7] and CenterPoint [8] models, culminating in a significant improvement in the quality of supervised pretraining and, consequently, semi-supervised training.We present empirical evidence indicating that the difference between supervised and semi-supervised models is less significant than previously believed, implying sufficient scope for improvement within the domain of semi-supervised methods.
In summary, our contributions to this study encompass the following key aspects: 1. Enhancement of the SECOND Model: We have meticulously identified the optimal training parameters for the SECOND model, spanning the realms of pretraining, semi-supervised learning (SSL) training, and postprocessing.This rigorous optimization process has led to a notable refinement of the benchmark results associated with this model.2. Advancements in the CenterPoints Model: In the case of the CentrePoints model, our contributions extend to substantial improvements in the outcomes of the pretraining phase.Additionally, we have calculated metrics for semisupervised methods, specifically Mean Teacher [9] and Proficient Teacher [6], that had yet to be published.

Related works 2.1. Two-dimensional object detection
In recent years, significant strides have been made in the field of object detection, a crucial component of computer vision.This advancement is largely credited to the development and enhancement of deep learning models.Despite common challenges posed by computer vision algorithms, such as object scale variation, occlusion, changes in lighting conditions, and the presence of previously unseen object categories, these algorithms are approaching human-like performance in certain scenarios.These algorithms primarily fall into two categories based on their architecture design: single-stage detectors and two-stage detectors.A series of publications [10,11,12] introduced the key principles of the two-stage detection framework [13,14,15,16,17,18], which gradually replaced traditional methods with neural networks.Initially, a region proposal network is used to detect regions of interest (RoIs) that are likely to contain an object.These proposals are then refined in the second stage by predicting positional residuals for detected RoIs.Conversely, one-stage detectors (YOLO [19,20,21], RetinaNet [22], SSD [23]) directly predict bounding boxes and categories without refinement, offering faster execution but lower detection quality.
Further divisions in these categories include anchor-based and anchor-free methods.Anchor-based detectors like Faster R-CNN [12] or YOLO [19] predict positional offset and scaling for anchors, pre-defined bounding boxes associated with different regions of the image.Anchor-free detectors [24,25,26], on the other hand, remove the need for pre-set anchor boxes and directly predict object boundaries from features such as keypoints [27], center points [28,29], or extreme points [30,31].Despite anchor-based methods' early success and extensive literature base, anchor-free methods offer straightforwardness, flexibility, and eliminate the need for tuning anchor-related hyperparameters.Thus, these are gaining interest in the research community.

Three-dimensional object detection
The way 3D object detection methods categorize and predict bounding boxes mirrors that of the more mature 2D object detection field.However, the main differences lie in the backbone architectures and their extraction of information from sparse 3D data.These approaches can be broadly classified as voxelbased, point-based, and hybrid, which is a combination of the two.
Voxel-based methods [32,33,34,35,36,37,38,39,40] convert point clouds into regular voxel grids that are processed using 3D convolutions.This method introduces a structure absent in point clouds and simplifies processing.It also allows for the application of 2D convolutional neural network architectures.Nonetheless, the direct application of voxelization and 3D convolutions as proposed in VoxelNet [41] results in considerable computational demands due to cubic complexity from voxelization.This also results in the loss of fine-grained details and quantization artifacts.Furthermore, most voxel grids generated from typical point clouds are empty, making these computations largely unnecessary.To address this, SECOND [7] proposed sparse convolutions to avoid processing empty voxels.PointPillars [42] simplifies voxel representations to pillar representations, or BEV maps, to maintain high efficiency and competitive performance.SA-SSD introduces an auxiliary network to exploit point-wise supervision.
Point-based methods [43,44,45] operate directly on raw point clouds to extract pointwise features.The model Point-Net [46] introduced this framework, followed by its successor PointNet++ [47], to process points directly without losing structural information due to data representation.Frustum-PointNet [48] uses 2D detection to constrain space with the frustum corresponding to the 2D bounding box.PointRCNN [49] ideologically follows Faster-RCNN but generates proposals directly from points.VoteNet [50] uses voting to predict proposal centers from point clusters and mitigate the occlusion problem.These methods, however, can experience computational inefficiencies and limited learning capacity due to their reliance on operations such as kNN search or k-d tree construction because raw point clouds lack the structural information needed for neural networks to perform computations.
Hybrid methods [51,52] strive to combine the strengths of voxel-based and point-based approaches while ensuring efficiency.STD [53] employs PointNet++ to identify the most relevant points in the sparse representation and converts them to a dense representation processed with convolutions.PV-RCNN [54] refines proposals using pointwise features.HVPR [55] incorporates both voxel-wise and pointwise features into a single 3D representation with a memory module.

Semi-supervised learning
Semi-supervised learning leverages the potential of unlabeled data to enhance classification accuracy.Since images offer ample meaningful information about the underlying data distribution even without labels, these primary approaches in semi-supervised learning -self-training and consistency regularization -with multiple algorithms developed within each approach, both smooth the feature manifold [56] by stabilizing predictions and increasing confidence for unlabeled data.This process assists models in learning robust decision boundaries, improving performance, generalization, and overall quality without the need for costly manual annotation.
Self-training [9,57,58,59,60,61] begins with initial training on labeled data, followed by iterative generation of high-quality pseudo-labels for unlabeled data and training on the expanded labeled dataset.Improving the performance of these methods is associated with maximizing valuable information obtained from pseudo-labels and minimizing the mislabeling produced by the overfitted model.Consistency regularization [62,63] maintains consistent predictions for the input under various perturbations, mainly implemented with image augmentations.The research on these methods is related to finding the optimal loss that provides consistency and ways to acquire perturbations.

Semi-supervised learning for object detection
Applying semi-supervised methods to more complex tasks such as object detection presents challenges due to the complex label assignment associated with the multi-task predictions of object detectors.While a few consistency-based methods [64,65] enforcing consistency between augmented views have been proposed for semi-supervised 2D object detection, selftraining approaches are currently more prevalent.The teacherstudent framework-based approach was initially proposed in STAC [66], where the teacher, pretrained on labeled data, predicts pseudo-labels for weakly augmented views of unlabeled data, and the student learns to predict

Models and Hyperparameters
In this section, we briefly describe SECOND and Center-Point models used for 3D object detection and describe which training hyperparameters we choose to tune and why they are essential.The overview of the SECOND [7] detector.The detection process commences with the utilization of an initial raw point cloud as the input data.Subsequently, this point cloud undergoes a conversion into voxel-based features and coordinates.The transformed data then traverses through two layers of VFE (voxel feature encoding) followed by a linear layer.A sparse convolutional neural network (CNN) is subsequently applied to further process the encoded features.Finally, the detection process is completed with the operation of a Region Proposal Network (RPN) responsible for generating the final detections.
We follow the [67] and [6], the most prominent works in semi-supervised 3D object detection, and elaborate experiments with the SECOND and CenterPoint detectors.These models may not be the latest or most sophisticated, yet their simplicity permits the isolation and highlighting of the effects of semi-supervised learning.Additionally, they support realtime operation, a critical requirement for the majority of applications.
SECOND (Sparsely Embedded Convolutional Detection) is primarily designed for autonomous driving and robotics use.Its architecture consists of the fusion of three core components: a Voxel Feature Extractor, a sparse convolutional middle layer, and a region proposal network.The Voxel Feature Extractor (VFE) transforms raw point cloud data into structured representation by dividing it into fixed-size 3D grids called "voxels."SECOND extracts meaningful features containing spatial and semantic information within each voxel.These features are then combined to produce a feature map.Sparse 3D convolutions, which are significantly faster and require less memory than their original counterparts, are used to process the feature map.Finally, RPN uses the feature map generated by the convolutional layers to produce 3D object proposals by predicting their 3D bounding boxes and associated objectness scores.After the RPN proposes objects, an essential post-processing step is taken to refine the detected objects.Non-maximum suppression (NMS) is frequently used to eliminate redundant or significantly overlapping proposals, leaving only the most confident nonintersecting objects.
CenterPoint.The core of this architecture (see Fig. 3) is the Center heatmap head.It recognizes probable object centers within the point cloud for each class independently.Once centers are identified, CentrePoint predicts the 3D bounding boxes for each object by regressing its dimensions and orientations concerning the centers.Subsequently, in the second stage, boxes are refined by processing features interpolated for center position with MLP.This approach simplifies the process, making it computationally efficient.Similar to SECOND, postprocessing includes NMS and score thresholding.
We perform a straightforward grid search operation to maximize the quality of training and inference for these models.We investigate the influence of the following components: • Learning rate is typically one of the first hyperparameters to be tuned.It sets the size of the steps taken during the optimization process when updating the model's parameters based on the gradient of the loss function.It influences the neural network's speed of convergence for solution, regulates numerical stability, and affects the model's general- It also influences computational resources and should be considered alongside other hyperparameters like learning rate and batch size.
• NMS threshold is a post-processing parameter that determines the merging or suppression of overlapping or closely spaced 3D bounding boxes.It specifies the maximum allowed overlap between two detections required to be considered separate and corresponds to the 3D IoU metric.The selection of non-maximum suppression (NMS) threshold significantly affects the precision and recall of the object detection system.Raising the threshold boosts precision but may compromise recall by discarding valid detections, whereas lowering it might enhance recall but could lead to more false positives.
The overview of the CenterPoint [8] detector.Initially, a conventional 3D backbone extracts map-view feature representations from lidar point cloud data.Subsequently, a specialized 2D convolutional neural network integrated into the detection head identifies object centers and performs a regression to determine the complete 3D bounding boxes based on these center features.The predicted bounding box information is then used to locate and extract point-based features at the 3D centers of each face of the estimated 3D bounding box.These extracted features are then fed into a Multilayer Perceptron to project a confidence score according to IoU and refine the bounding box regression.

Results
This section presents a comprehensive analysis of various semi-supervised learning methods using the ONCE dataset, demonstrating that our hyperparameter setup achieves state-ofthe-art performance among existing semi-supervised methods for SECOND and CenterPoint models.Firstly we provide a detailed description of the ONCE dataset and highlight its unique features in Section 4.1.Secondly, we outline our experiments' training methodology and configurations in Section 4.2.Finally, we compare the performance of different semi-supervised methods using the mean Average Precision (mAP) metric in Section 4.3.

ONCE dataset
The ONCE [5] dataset is a large-scale autonomous driving dataset consisting of 1 million LiDAR point cloud samples corresponding to 144 hours of driving across different cities in China.The ONCE dataset stands out among open-source autonomous driving datasets due to its larger size and greater diversity in weather and traffic conditions.This diversity is essential for training and evaluating autonomous driving systems, as it enables the development of more robust and versatile models.The detection task focuses on five foreground object classes: Car, Bus, Truck, Pedestrian, and Cyclist.However, during evaluation, the classes Car, Bus, and Truck are combined into a single "Vehicle" class.
The ONCE dataset is designed to evaluate semi-supervised and self-supervised learning approaches for 3D object detection.It contains 581 sequences with 20 labeled and 561 unlabeled sequences.The labeled data is divided into training, validation, and test splits.The splits are organized as follows: • Training split: 6 sequences (5k scenes) captured on sunny days • Validation split: 4 sequences (3k scenes) with diverse weather conditions -1 sunny day, 1 rainy day, 1 sunny night, and 1 rainy night • Testing split: 10 sequences (8k scenes) covering various weather conditions -3 sunny days, 3 rainy days, 2 sunny nights, and 2 rainy nights Both downtown and suburban areas are covered in each split.The training split has a slight domain shift compared to the validation/testing split to encourage better generalizability of the proposed methods.
The remaining 560 sequences are kept as unlabeled data for research on leveraging large-scale unlabeled data.These unlabeled scenes are divided into three subsets: Small, Medium, and Large.
• Small: 70 sequences (100k scenes) • Medium: 321 sequences (500k scenes) • Large: 560 sequences (about 1M scenes) Small is a subset of Medium, which is a subset of Large.Small and Medium are created by selecting specific roads in time order rather than uniformly sampling from all scenes.
The performance of 3D object detectors is evaluated using mean Average Precision (mAP) [69] over all classes, based on 3D Intersection over Union (IoU) thresholds of 0.7, 0.3, and 0.5 for Vehicle, Pedestrian, and Cyclist classes, respectively.Additionally, the detector's performance is assessed over three different detector ranges: 0-30m, 30-50m, and 50m-inf.

Training setup
To ensure the reproducibility of our experiments, we use the codebase of the official ONCE benchmark 1 , which is based on Table 3: Evaluation results on ONCE validation split for SECOND trained on different splits of unlabeled data (Small, Medium, Large).Models marked with * are initialized with SECOND*, while others are initialized with SECOND.Metrics demonstrate better results for higher values (best metrics are highlighted in bold, and second best are underlined).Our hyperparameter setup exhibits state-of-the-art performance in terms of mAP for every split.
First, we trained the SECOND and CenterPoints models from scratch using the hyperparameter configurations provided with the ONCE benchmark, which are the same for both models.Namely, training was performed on the training split for 80 epochs with a batch size of 32 and a maximum learning rate of 0.003 under the One Cycle learning rate policy [70].The NMS threshold during inference was also identical and was 0.01.
After that, we discovered that the proposed hyperparameters were suboptimal, and models were underfitted.Consequently, a grid search over batch size, learning rate, number of epochs, and NMS threshold was conducted to identify parameters that yielded significantly better results.The considered values and corresponding metric values are exhibited in Table 1 and Table 2 for SECOND and CenterPoint correspondingly.With this operation, we found out the set of optimal hyperparameters.It is batch size 128, learning rate 0.006, 1000 epochs, and 0.65 NMS threshold for the SECOND detector, and batch size 128, learning rate 0.003, 1000 epochs, and 0.25 NMS threshold for CenterPoint.We used original hyperparameters of the ONCE benchmark for every semi-supervised learning approach.

Model comparison
The ONCE benchmark codebase provides the implementation of three image-based semi-supervised methods: Pseudo Label [71], Mean Teacher [9], and Noisy Student [57], as well as two semi-supervised methods for point cloud detection: SESS [72] (designed for indoor datasets) and 3DIoUMatch [67] (both indoor and outdoor).We perform semi-supervised learning for pretrained detectors (with original and tuned hyperparameters) using Mean Teacher and Proficient Teacher approaches since they show the best quality.The other results are borrowed from [5].Detection performance is evaluated using mAp on the split.The models trained on the Small, Medium, and Large unlabeled subsets are considered separately.By comparing the performance of these methods over different amounts of unlabeled data, we can gain insight into their effectiveness and scalability in the context of semisupervised learning for 3D object detection.
The results in Tables 3 and 4 show that hyperparameter optimization significantly improves model performance with and without applying semi-supervised techniques.Models marked with "*" are trained with our hyperparameters and others with the original ones.In particular, the quality of the pretrained SECOND* model, which uses our hyperparameter setup and hasn't been exposed to any unlabeled data, outperforms the original Proficient Teacher trained with all labeled and unlabeled data.In addition, the gap between supervised pretraining and semi-supervised learning is not as large as expected.Considering the Small subset, it is 3.45 mAP and 5.83 mAP (respectively for Mean Teacher and Proficient) for SECOND and only 1.09 mAP and 3.84 mAP for SECOND*.It shows that some of the performance gains reported in previous works are not the result of utilizing unlabeled data but of prolonging training with labeled data.However, there is still some performance improvement that validates these approaches.Also, a Proficient Teacher, specifically designed for 3D object detection, steadily surpasses a general approach Mean Teacher.It demonstrates that the research aimed at specifying semi-supervised methods for different tasks is fruitful and encourages further exploration.

Conclusion
In this study, we have addressed the increasing demand for advanced deep learning technologies in 3D data detection, crucial in domains like autonomous driving, robotics, and augmented reality.Focusing on LiDAR-based point cloud representation for its reliability, we have considered reducing the 3D data annotation workload through semi-supervised learning.We have highlighted the importance of diverse datasets and identified the ONCE dataset as a key benchmark for evaluating semi-supervised 3D object detection models.Our findings demonstrate the critical role of a well-fitted pretrained model in the success of semi-supervised learning.Our primary focus has been optimizing supervised pretraining.We have refined the SECOND and CenterPoint models through parameter optimization, improving benchmark results, achieving state-of-theart performance for these models, and documenting metrics for semi-supervised methods.
• Batch size determines the number of training examples used in each iteration of the optimization process, impacting both the training dynamics and the computational re-quirements.Smaller batch sizes introduce more randomness into the optimization process, allowing the model to escape local minima and explore a wider range of solutions.However, using very small batch sizes may also lead to noisy