PillarNeSt: Embracing Backbone Scaling and Pretraining for Pillar-based 3D Object Detection

This paper shows the effectiveness of 2D backbone scaling and pretraining for pillar-based 3D object detectors. Pillar-based methods mainly employ randomly initialized 2D convolution neural network (ConvNet) for feature extraction and fail to enjoy the benefits from the backbone scaling and pretraining in the image domain. To show the scaling-up capacity in point clouds, we introduce the dense ConvNet pretrained on large-scale image datasets (e.g., ImageNet) as the 2D backbone of pillar-based detectors. The ConvNets are adaptively designed based on the model size according to the specific features of point clouds, such as sparsity and irregularity. Equipped with the pretrained ConvNets, our proposed pillar-based detector, termed PillarNeSt, outperforms the existing 3D object detectors by a large margin on the nuScenes and Argoversev2 datasets. Our code shall be released upon acceptance.


I. INTRODUCTION
P OINT cloud 3D object detection, a crucial task for autonomous driving and robotics, aims to identify and locate objects surrounding the ego agent.While great advance has been achieved in this field, there remains a significant gap in performance and efficiency, especially in the context of processing raw point cloud data.The primary approach to address this challenge is to transform point clouds into structured data, such as voxel-based and pillar-based representations.Voxel-based methods [9], [44], [48], [52] produce 3D voxels and employ 3D convolutions for feature extraction.Pillar-based [24], [25], [35], [51] approaches first transform the point clouds into a pseudo-image representation and then employ a 2D backbone for feature extraction.
datasets.So we may ask two unresolved questions: (i) if it is possible to observe the scaling up phenomenon similar to the 2D image domain?(ii) how to transfer the image knowledge extracted by the pretraining process for the point cloud?For the first question, the biggest problem is the lack of reasonable and scalable backbone design for point cloud.While for the second problem, it is difficult to pretrain image backbones for adapting the special backbone design in pillar-based detector.
In this paper, we aim to answer these two questions above and explore the effectiveness of backbone scaling and pretraining for pillar-based 3D object detector.Since the pillarbased methods transform the point clouds into the pseudoimage representation, we mainly refer to the well-known ConvNeXt [29] in image domain as our basic network for dense feature extraction.To verify the effectiveness of backbone scaling, we first construct a strong baseline based on CenterPoint-Pillars [48] with some modifications on the pillar encoder, detection head, and loss function.Then, we modify the architecture of ConvNeXt to adapt the features of point cloud.In specific, we introduce several backbone design rules according to the characteristics of point cloud.For example, we introduce the large kernel (e.g., 7×7 convolution layer) to increase the receptive field for point cloud and more blocks in early stages for refining fine-grained point cloud features.After that, for the backbone scaling, we provide a series of ConvNet models with different scales, namely PillarNeSt-Tiny, -Small, -Base, and -Large, to achieve the trade-off between the performance and inference speed.Finally, for the backbone pretraining, we utilize the ConvNeXts pretrained on large-scale image datasets (e.g., ImageNet [12]) as the arXiv:2311.17770v1[cs.CV] 29 Nov 2023 weight initialization for 2D backbone.Considering the large difference between the original ConvNeXt and our modified version, we introduce the weight initialization adaptation from two perspectives: stage view and micro view.
With the special design, scaling and pretraining of 2D dense backbone, our proposed PillarNeSt series show the scaling up of performance as the model size increases, as shown in Figure 1.Moreover, we observe that convergence is facilitated and performance is improved when using the pretrained 2D backbone, compared to the common random initialization.Our proposed PillarNeSt outperforms the existing 3D object detectors by a large margin on the nuScenes and Argoversev2 datasets.Without any test-time augmentation, PillarNeSt-L achieves 66.9% mAP and 71.6% NDS on nuScenes test set.PillarNeSt-B achieves 35.9% mAP and 28.6% CDS on Argoversev2 val dataset.We hope the observations can bring some insights into integrating the scaling and pretraining 2D backbone for 3D object detection.

II. RELATED WORK A. Grid-based 3D Object Detection
Grid-based 3D object detection has gained significant interest in recent years, especially in applications like autonomous driving [4], [40], [41] and robotics.Grid-based split methods offer a structured way to handle the inherent sparsity and irregularity of point clouds.These methods are typically categorized into voxel-based and pillar-based approaches.Voxel-Based methods.One of the earliest and most influential works in this domain is [52], which splits point clouds into 3D voxels and employs 3D convolutions for feature extraction.While VoxelNet sets the stage, it also introduces some challenges, such as empty voxels in outdoor environments.Subsequent methods, like [13], [44], [48], address these challenges by introducing 3D sparse convolutions, which not only enhance detection accuracy but also improve computation efficiency.[9], [30] greatly improve the performance of voxelbased methods by optimizing the backbone.Pillar-Based methods.PointPillars [24] shifts the view from 3D voxelization to 2D pillar, focusing on the ground plane.By combining 2D voxelization with a Point-based feature extractor, PointPillars efficiently leverages 2D convolutions, making it particularly suitable for embedded systems with limited computational resources.Other works, such as [8], [33], [46], have further refined pillar-based detection by introducing feature projection and multi-scale aggregation strategies.Historically, pillar-based approaches trail voxel-based methods in terms of performance.Recently, [25], [35], [51] introduce more advanced backbones, bridging the performance gap with voxel-based methods.

B. Backbone Pretraining and Scaling
Backbone pretraining and scaling [17], [22], [28], [37] have shown great success in 2D perception tasks.By pretraining on large-scale datasets like ImageNet [12], 2D ConvNets can capture general knowledge from images that can be transferred to downstream tasks.Model scaling, which improves the depth, width, and resolution of the network, has been a crucial strategy for improving the performance of various image recognition tasks.A notable example of the application of backbone scaling and pretraining is EfficientNet [37], which leverages a compound scaling method to uniformly scale all dimensions of the network.
In the 3D domain, methods like [27], [32] utilize local reconstruction techniques for pretraining point cloud models.The pretraining transformer networks provide a robust initialization, which is beneficial for downstream tasks.[42] utilizes pretrained image models for 3D analysis.By transferring knowledge from 2D image domains to 3D point clouds, P2P bridges the gap between these two modalities, potentially benefiting from the vast amount of labeled data available in 2D tasks.

III. METHOD
The architecture of our PillarNeSt is summarized in Figure 2. Similar to the typical pillar-based detector, PillarNeSt comprises: (i) a pillar encoder to transform raw point clouds into the pseudo-image representation, (ii) a 2D backbone to extract features, (iii) a neck for fusing multi-scale features, and (v) a 3D detection head for predicting 3D objects.PillarNeSt mainly focuses on the 2D backbone part.We first introduce a strong baseline (Section III-A) with some modifications on the pillar encoder, detection head, and loss function.Then we describe the backbone design (Section III-B), scaling (Section III-C), and pretraining (Section III-D) strategies.Finally, we discuss the main differences with existing works (Section III-E).

A. Strong Baseline
We choose CenterPoint-Pillars [48] as our vanilla baseline, where PointPillars [24] is employed as the backbone.Our strong baseline mainly introduces some improvements on the pillar encoder, detection head and loss functions.Pillar encoder.In the original Pillar encoder, Multi-layer Perceptions (MLPs) are used to extract features from point clouds, followed by max pooling to extract pillar features.Relying solely on max pooling leads to information loss.In this paper, we simultaneously employ max pooling and average pooling to preserve more information.Moreover, we also introduce the height offset of points, relative to the geometric center, as the input to compensate for the information loss on the z axis.Detection head.Previous works [23], [49] have identified that a low correlation between classification and localization is one of the reasons for pool detection.Inspired by them, we integrate an IoU branch to predict the IoU score.During inference, we use IoU score to reweight classification score Ŝ, as illustrated in Eq. 1: Where Here, λ 1 , λ 2 , λ 3 are corresponding weight coefficients.
Training Strategy.During training, we use the copy-paste data augmentation from [48].Following [39], we introduce the fade strategy to disable the augmentation for the final several epochs.

B. Backbone Design
Large kernel matters.Some previous works [25], [35], [51] attempt to improve the representation by employing CNN with sparse or dense convolution layers.Those CNN backbones tend to employ 3 × 3 convolution layers for feature extraction.
As mentioned above, the pillar-based methods apply the pillar encoder to transform the point cloud into 2D representations (pseudo image).Processing such 2D representation is quite similar to the case in the image domain.In the seminal work of [31], the concept of the Effective Receptive Field (ERF) is introduced.It points out that the ERF is not significantly improved by increasing the number of layers.Some recent works [15], [29] argue that large ERF can be effectively achieved by employing convolution with larger kernels.Moreover, [9], [18], [30] also points out that a larger receptive field contributes to enhancing the capability of point cloud detectors.To balance efficiency and performance, we adopt ConvNeXt Block [29] with 7×7 depth-wise convolution layer.
Removing downsampling in the first stage.Considering the information redundancy of images, the image backbone usually performs the downsampling operations, e.g., convolution layer with stride 2, to reduce the overall computation cost.Raw point cloud data is sparse and irregular, while containing rich detailed information.These features pose a challenge: premature or excessive down-sampling may lead to the loss of critical information [35], [38].It underscores the importance of designing strategies that meticulously maintain these details while efficiently handling point clouds.Given these considerations, our backbone design removes the stem and refrains from implementing down-sampling in the first stage block.This strategic choice ensures the preservation of the original resolution of the input features.
More blocks in early stage.Backbone design in image domain [17], [22], [29], [36] tends to employ more block layers in later stages to extract high-level features for classification.For example, ResNet-101 [22] for image classification takes 23 residual blocks in the 3rd stage while the 1st stage uses 3 blocks for feature extraction.Considering that the point clouds are sparse and irregular, it means that only after multiple convolutions can make empty pillar features filled by the features diffused from occupied pillars.Different from the practice in the image domain, more convolution layers are supposed to be stacked in the early stages.In our practice, we stack more blocks in the early stages for fully refining fine-grained point cloud features.Our extensive experiments suggest that increasing the number of blocks in early stages yields superior gains compared to adding more blocks in later stages.
More depth stage.Through our analysis of point cloud scenarios, we observed a large variation in object sizes.For example, when the pillar size is set to 0.2m, the maximum perceivable range is 1.6m after 8× downsampling.However, many objects in real-world scenarios exceed the limited perceivable range.It means the feature points after 8× downsampling cannot fully perceive the entire object for large objects.To tackle this issue, PillarNeXt [25] employs a complex ASPP [5] module to achieve large receptive fields.Different from it, we adopt a simple way and add one more stage (named stage-5) on top of stage-4, which contains only one or two ConNeXt blocks.The block number of stage-5 can be scaled up based on the model size.The output of added stage-5 is served as one of the multi-scale inputs of the neck network.

C. Backbone Scaling
In this section, we design a series of 2D backbones, ranging from PillarNeSt-Tiny with light weight to PillarNeSt-Large with high performance, for different demands.Building upon the design principles in Section III-B, we propose our PillarNeSt-Tiny, Small, Base and Large models.As shown in Table I, all model versions share a similar architecture.Each model consists of five stages, with stage-1 keeping the feature map size without downsampling.For stage-2 1, stage-3 1, stage-4 1, and stage-5 1, the features are downsampled by a factor of two.Each ConvNext block contains a depth-wise convolution layer with a kernel size of 7, followed by two 1 × 1 convolution layers.The architecture hyper-parameters of these model variants are: where C in is the input channel number of the first stage.The model size and theoretical computational complexity (FLOPs) are listed in Table I.

D. Backbone Pre-training
PillarNeSt aims to integrate the knowledge from backbone pretraining on large-scale image datasets, e.g., Ima-geNet [12].Since our backbone design is conducted based on ConvNeXt [29] architecture, we wonder if it is possible to utilize the ConvNeXt pre-trained on ImageNet for initialization.However, as mentioned above, the block number of each stage or the channel numbers in PillarNeSt are different from the original ConvNeXt design.To resolve this problem, we introduce the weight initialization adaptation from two perspectives: stage view and micro view.(i) From the stage view, we simply copy the weights from pretrained ConvNeXt model for stage-1∼4 while our added last stage (stage-5) is randomly initialized.For stage-1∼4, if the block number is less than the ConvNeXt one, we only copy the parameters of corresponding blocks according to the block identity.(ii) From the micro view, we replicate the parameters from pretrained model for the first C in channels (or blocks) and the parameters of the left channels (or blocks) are randomly initialized.

E. Discussion
Our work is somewhat similar to the PillarNet [35], FastPillars [51] and PillarNeXt [25] from the perspective of backbone scaling.However, there are many differences between our PillarNeSt and them.PillarNet simply employs the randomly initialized VGG [36], ResNet-18 [22] and ResNet-34 as the 2D backbone.It fails to scale the backbone from a unified architecture and enjoy the benefit from the image pretraining.FastPillars use a structural re-parameterization technique [16] to VGG and ResNet-34 for fast inference.PillarNeXt is the most close work to ours.However, it does not exhaustively design the backbones according to the features of point cloud, without the large kernel and more blocks in early stage.Our PillarNeSt aims to explore the effectiveness of both the backbone scaling and pretraining for the pillar-based 3D object detection.With specific design for point cloud, it shows that the detection performance is improved with the scaling-up of 2D backbone.Moreover, it is the first one to show that the point cloud based detectors can directly benefit from the 2D image pretraining.
IV. EXPERIMENT nuScenes Dataset.nuScenes [3] is a sizable and frequently used dataset.It consists of multi-modal data obtained from 1000 scenes, including RGB images from six surroundview cameras, radar points from five different radars, and point cloud data from one LiDAR.It is broken down into 700/150/150 scenarios for training/validation/testing.There are eleven categories and 1.4 million annotated 3D bounding boxes in total.We assess the performance using the mean Average Precision (mAP) and the nuScenes Detection Score (NDS).The NDS offers a comprehensive measure by encompassing various aspects of detection capabilities.Argoverse2 Dataset.Argoverse2 (AV2) [43] is a large-scale dataset for long-range perception and it contains 1000 scenes in total, 700 for training, 150 for validation, and 150 for testing.AV2 also adopts a composite metric called Composite Detection Score (CDS) as an addition to the commonly used mAP.Similar to the NDS of nuScenes, CDS takes both mAP and other localization errors into account.The perception range in AV2 reaches 200 meters (total area 400m × 400m), which is much larger than nuScenes.We conduct experiments on AV2 dataset to validate the long-range performance of the proposed method.Implementation Details.PillarNeSt models are trained by using Adam optimizer and the one-cycle learning rate schedule using the mmdetection3d [11] framework.All ablation experiments are performed on 8 V100 GPUs, while other experiments are conducted on 8 A100 GPUs.λ 1 , λ 2 , λ 3 are set to 1.0, 1.0 and 0.25, respectively.Furthermore, during training, we also use the ground-truth copy-paste data augmentation from [48] and disable this augmentation for the final 5 epochs, following [39].
For nuScenes, the one-cycle learning rate policy with an initial learning rate of 1e-3 is used.we use an AdamW optimizer.The weight decay is 0.01, and the momentum ranges from 0.85 to 0.95.In the horizontal plane, we set the point cloud range to [-54m, 54m], and in the vertical direction, we set the point cloud range to [-5m, 3m].For PillarNeSt, we set the pillar size as [0.15m, 0.15m, 8m].Moreover, following CenterPoint, we employ class-agnostic NMS as the post-processing during inference, with the score threshold set to 0.2 and the rectification factor α set to 0.5 for the ten classes.For AV2, we use an Adam optimizer, with 0.01 weight decay and a base learning rate of 1e-4.The point cloud range is set to [-153.6m,153.6m] for both the X-axis and Y-axis, and [-5m, 5m] for the Z-axis.The post-processing is the same as nuScenes.Note that the results are evaluated in the range [-200m, 200m] in the horizontal plane.

A. Overall Results
1) Strong Baseline.:In Section III-A, we improve the CenterPoint-Pillar baseline to a strong version.As shown in    II, our PillarNeSt-Large model achieves to 70.4% NDS and 64.3% mAP, outperforming all previous methods.Specifically, our PillarNeSt-Large model surprisingly surpasses the previously advanced Pillarbased method, PillarNeXt-B, by +1.6% in NDS and +1.8% in mAP.Notably, our model exceeds all voxel-based methods, such as LargeKernel3D [9] and LinK [30].We also showcase our results on the nuScenes test.As shown in Table III, our approach still achieves state-of-the-art (SOTA) performance compared to all point cloud 3D object detection methods.Specifically, our Base model achieves an NDS of 71.3% and an mAP of 65.6%, while our Large model achieves 71.6% NDS and 66.9% mAP.
3) Comparison with State-of-the-Art on Argoverse2: We also evaluate the proposed method on the val set of long-range Argoverse2 dataset in Table IV.Due to the huge cost caused by the long range, we only report results using PillarNeSt-Base backbone.PillarNeSt-Base outperforms the existing LiDARbased detection methods [10] by +5.4 mAP and +5.6 CDS on the average metrics.The per-class evaluation results are also listed, which shows that PillarNeSt has a great advantage in detecting larger objects (i.e.Vehicle, Bus, V-Trailer, etc.).The improvements are benefited from the large receptive field of PillarNeSt.

B. Ablation Study
In this section, we explore the effectiveness of our backbone design principles and the scaling rules on nuScenes val set.For our baseline, the block number of four stages is set to [1, 1, 1, 1] and the channel number of the first stage is 48.Pre-training of backbone.For PillarNeSt, we utilize the 2D convolutional blocks pre-trained on ImageNet for initialization.Here, we compare its performance with the randomly  Input channels of backbone.For Pillar-based methods, the input channel quantity of the 2D backbone is equivalent to the number of output channels from the pillar encoder.Table VIII presents an ablation study to show the effect of varying input channels of 2D backbone on the performance.It indicates that models with larger input channels of the backbone yield much better performance.However, increasing the number of input channels introduces much more computational cost.The shown results enable us to scale up the backbone size regarding the number of input channels.

V. CONCLUSION
In this paper, we present PillarNeSt, a series of pillar-based 3D object detectors, by exploring 2D backbone scaling and pretraining.Based on the ConvNext [29] block, We clearly propose the design rules and scaling principles for the pillarbased 2D backbone.The marked effectiveness of backbone scaling and pretraining exhibited in this paper provides a promising direction for future works on backbone design.The core of our approach revolves around improvements in the backbone.Hence, future modifications on the neck and head parts have the potential to further boost the performance of point cloud 3D detectors.Limitation: Though we present the effectiveness of scaling and pretraining of pillar-based backbone, there are also some limitations for our method.As shown in Table VII, removing downsampling of the early stage greatly improves the performance while introducing much more computation cost.How to deal with high-resolution pseudo-image representation with low computation cost is worth exploring for the near future.Moreover, some generative pretraining strategies, such as masked image modeling [2], [21], can be further employed to reconstruct the 2D pseudo image from pillar encoder for highquality representation, producing good backbone initialization for point cloud 3D object detection.

Fig. 2 .
Fig. 2. The overall architecture of a PillarNeSt (PillarNeSt-Base).Our model takes raw point clouds as input and employs a pillar encoder, comprised of MLPs and max&avg pooling, to transform the 3D data into a pseudo-image representation.It is then fed into a well-designed 2D dense ConvNet, pretrained on a large-scale image dataset.Multi-scale features of the last three stages are injected into the neck to produce fused features and Centerheads are applied for 3D box regression.

Fig. 4 .
Fig. 4. Impact of varying block quantities in the four stages of the backbone on the final results.Each subfigure corresponds to a different stage of the backbone.Subfigure (a) ∼ (d) shows the effect of changing the number of blocks in Stage 1 ∼ 4, respectively.As can be seen, different stages show varying sensitivity to the number of blocks utilized.

TABLE I
DETAILED ARCHITECTURES FOR PILLARNEST SERIES BACKBONES.

TABLE III
STATE-OF-THE-ART COMPARISONS ON NUSCENES TEST SET.WE SHOW NDS, AND MAP FOR EACH CLASS.ABBREVIATIONS: CONSTRUCTION VEHICLE (CV), PEDESTRIAN (PED), MOTORCYCLE (MOTOR), BICYCLE (BC) AND TRAFFIC CONE (TC).ALL MODELS DO NOT USE ANY TTA OR MODEL

TABLE IV COMPARISON
WITH STATE-OF-THE-ART METHODS ON ARGOVERSE2 VALIDATION SPLIT.†: PROVIDED BY AUTHORS OF AV2 DATASET.‡: REIMPLEMENTED BY US. * : REIMPLEMENTED BY FSD [19].§: THE PROPOSED METHOD WITH SMALLER PILLARS.THE AVERAGE RESULTS CONSIDER ALL CATEGORIES.

TABLE V
STRONG BASELINE ON THE NUSCENES VAL SET.†: REPRODUCED RESULTS BASED ON THE OFFICIAL CODEBASE.

TABLE VI ABLATION
STUDY OF ADDING MORE DEPTH STAGE.THE SECOND ROW ADDS ONE MORE STAGE WITH 2× DOWNSAMPLING.Table V, our strong baseline achieves a 54.5% mAP and 63.5% NDS, compared to the original baseline with 49.7% mAP and 60.1% NDS.It should be noted that the great improvements on mAOE are mainly from the introduced IoU branch in CenterPoint head while the fade strategy contributes a lot to improve the mAP and NDS metrics.2) Comparison with State-of-the-Art on nuScenes: For a fair comparison, we evaluate all our PillarNeSt models against previously published LiDAR-only methods on the nuScenes validate/test set.As shown in Table

TABLE VII ABLATION
STUDY OF STAGE-1 DOWNSAMPLING.THE EXPERIMENT IN SECOND ROW USES A DOWNSAMPLING IN STAGE-1.The experiments show that increasing the block number of early stage leads to great improvement on performance.While for the later stage, increasing the number of blocks brings almost no performance improvements.More depth stage.In Section III-B, we analyze the necessity for adding more depth stages to cover the perception range for large objects.Such analysis is validated by the results presented in TableVI.It shows that adding one 16× stage, including down-sampling and a ConvNet block, improves the performance by 2.0% mAP and 1.3% NDS, while only increasing marginal computational cost.Removing downsampling in the first stage.As mentioned previously, premature down-sampling may lead to the loss of critical information.To verify the effect of removing downsampling in the early stage, we conduct the experiments and keep the downsampling operation in the first stage (see TableVII).It shows that removing downsampling improves the performance by 7.0% mAP and 5.4% NDS, though a large computation cost is introduced.