Enhancing Grid-Based 3D Object Detection in Autonomous Driving With Improved Dimensionality Reduction

Point cloud object detection is a pivotal technology in autonomous driving and robotics. Currently, the majority of cutting-edge point cloud detectors utilize Bird’s Eye View (BEV) for detection, as it allows them to take advantage of well-explored 2D detection techniques. Nevertheless, dimensionality reduction of features from 3D space to BEV space unavoidably leads to information loss, and there is a lack of research on this issue. Existing methods typically obtain BEV features by collapsing voxel or point features along the height dimension via a pooling operation or convolution, resulting in a significant decrease in geometric information. To tackle this problem, we present a new point cloud backbone network for grid-based object detection, MDRNet, which is based on adaptive dimensionality reduction and multi-level spatial residual strategies. In MDRNet, the Spatial-aware Dimensionality Reduction (SDR) is designed to dynamically concentrate on the essential components of the object during 3D-to-BEV transformation. Moreover, the Multi-level Spatial Residuals (MSR) strategy is proposed to effectively fuse multi-level spatial information in BEV feature maps. Our MDRNet can be employed on any existing grid-based object detector, resulting in a remarkable improvement in performance. Numerous experiments conducted on nuScenes, KITTI and DAIR-V have shown that MDRNet surpasses existing SOTA approaches. In particular, on the nuScenes dataset, we attained an impressive 7.2% mAP and 5.0% NDS enhancement compared with CenterPoint.


I. INTRODUCTION
3D object detection has a broad range of uses in the fields of autonomous driving, driver assistance systems, intelligent transportation and robotics, with the aim of locating and classifying objects in a 3D space. In comparison to 2D object detection, which merely offers pixel-level object position and sizes in the image plane, 3D object detection enables autonomous vehicles to perceive object properties in the real world, including 3D locations, absolute dimensions, and orientation. Given the importance of guaranteeing precise The associate editor coordinating the review of this manuscript and approving it for publication was Junho Hong . object localization and recognition in autonomous driving systems for the safety of passengers and pedestrians, highperformance methods for 3D object detection are essential to be developed, which is a formidable undertaking in the field of computer vision.
Considerable effort has been expended on utilizing neural networks to process LiDAR point cloud data for 3D perception. A major challenge in this task is the learning of various object properties from sparse and unstructured point clouds. To address this issue, many approaches [4], [5] have adopted networks composed of multilayer perceptrons (e.g. PointNet-like [24], [25] networks) and point grouping operations to extract 3D object properties, leading to remarkable progress. However, the point sampling and clustering processes involved in these methods are computationally intensive, which makes them impractical for large-scale autonomous driving scenes. To achieve efficiency, most current advanced approaches propose to utilize the grid-based representation and execute 3D object detection in the BEV space. Grid-based methods necessitate a voxelization process to transform unstructured point clouds into 3D voxels or 2D pillars, thus can be separated into voxel-based [1], [2], [7], [13], [14], [15], [16], [20], [21], [23] and pillar-based [3], [8], [9], [10], [18], [22] approaches. After voxelization, the voxel-based approaches employ a 3D voxel backbone to encode voxel features, which are then flattened along the vertical axis to acquire 2D BEV features. Most methods [7], [13], [14], [16] maintain the core architecture of SECOND [2] and then introduce novel detection heads based on it. Alternatively, some approaches [15], [20], [21], [23] substitute the base modules in the SECOND [2] backbone network with novel modules (e.g. transformer modules and large kernel convolution modules). For pillar-based methods [3], [8], [9], [10], [18], [22], they directly obtain 2D BEV features from 3D point clouds through Multilayer Perceptrons (MLPs) and a pooling operation in the voxelization process, followed by a 2D convolutional network for feature extraction. The transformation of point clouds into bird's-eye view (BEV) features can cause information loss. To address this issue, some approaches [9], [18] have suggested utilizing a dedicated network for the extraction of features from point clouds before the transformation. Nevertheless, there is no approach that concentrates on the operations employed for transforming 3D voxel or point features into BEV features.
Transforming 3D voxel or point features into BEV features is a process of reducing the dimensionality of features, i.e., feature dimensionality reduction. This process brings efficiency, yet it unavoidably results in a significant decrease in geometric information. Current grid-based approaches employ pooling operations or convolution with static kernels for feature dimensionality reduction, resulting in a substantial loss of geometric information in BEV features, thereby diminishing the precision of object detection, particularly for categories with diverse sizes and structures that necessitate adaptive feature extraction. Consequently, grid-based methods are not able to precisely identify and localize objects with complex and diverse geometries, such as people, motorcycles, and bicycles. This is mainly attributed to two factors.
First, the use of fixed pooling operations and static convolution kernels in the 3D-to-BEV dimensionality reduction process fails to capture spatial information adaptively according to the object structures, resulting in difficulty in recognizing objects in the BEV space. Second, by downsampling the voxels, the network tends to concentrate on hierarchical semantic information, while losing the sparse geometric information which is essential for recognition and localization.
To address these issues, we propose an innovative backbone network, MDRNet, for grid-based 3D object detection, which is capable of adaptively incorporating 3D geometric information into the BEV space. The network is a dual-branch structure comprising a lightweight voxel branch and a BEV branch. To improve the capacity of BEV features in retaining 3D geometric information, we introduce two new modules, namely Spatial-aware Dimensionality Reduction (SDR) and Multi-level Spatial Residuals (MSR). Specifically, SDR estimates the spatial distribution of significant features and dynamically aggregates voxel features along the height dimension, thus enabling the dimensionality reduction process to retain essential geometric information adaptively. MSR, on the other hand, fuses voxel features and BEV features at each stage to bolster the multi-level 3D geometric information of the BEV branch, thereby minimizing the loss of sparse structural information. Concretely, the initial features of the BEV branch are obtained from the voxel features via SDR, and at each subsequent stage, the voxel features are combined with the BEV features through MSR, as depicted in Fig. 1(c).
The proposed backbone can be an easy replacement for the backbone network of existing grid-based point cloud object detectors. To validate the efficacy of MDRNet, we employed it in existing 3D object detection frameworks [7], [13]. The results show a dramatic enhancement in performance on nuScenes [26], KITTI [27] and DAIR-V [28] without incurring additional time costs.
To summarize, we provide critical insights into the proposed 3D detection method: • We design a universal backbone, named MDRNet, which is readily used with any grid-based point cloud detectors to enrich the features obtained from dimensionality reduction. The backbone estimates the spatial distribution for dynamic feature aggregation and enables multi-level 3D-BEV connections without incurring an additional computational burden.
• We propose two novel modules: Spatial-aware Dimensionality Reduction (SDR) and Multi-level Spatial Residuals (MSR). With the combination of two modules, geometry information of point clouds can be successfully retained and aggregated in the dimensionality reduction process, which significantly boosts the 3D detection performance.
• Numerous experiments have shown that MDRNet surpasses multiple solid baselines and attains SOTA results on point cloud object detection tasks. In particular, on the nuScenes dataset, our method has achieved a

II. RELATED WORK
Grid-based 3D object detection approaches have gained immense popularity because of their effectiveness and precision. In the following, we will take a quick look at the two predominant grid-based approaches: voxel-based and pillarbased. Furthermore, we explore the dimensionality reduction operations employed by these methods, as depicted in Table 1.

A. VOXEL-BASED METHODS
The voxel-based approaches [1], [2], [7], [13], [15], [16], [20], [21] initially split point clouds into 3D voxels through voxelization, owing to the unstructured nature of point clouds and the varying point density. To be more specific, the voxelization process is a maximal or average pooling operation of the points in each voxel, thus producing voxel features. VoxelNet [1] conducted one of the pioneering studies of endto-end 3D detection, which initially employs a stacked set of Multi-Layer Perceptrons (MLPs) to encode the correlation between points and voxels as point-wise features, then further utilises voxelization and 3D convolutional layers to obtain BEV features. While VoxelNet, each convolutional middle layer applies 3D convolution with high computational cost, which makes it challenging to use for real-time applications. SECOND [2] introduces 3D sparse convolution for acceleration and performance improvement. It extracts voxel features using a backbone network composed of 3D sparse convolution, and then concatenates the voxel features along the height dimension, followed by 2D convolution layers to obtain dense BEV features. On the basis of SECOND [2], CenterPoint [13] proposes to utilize the 2D detector CenterNet [31] to process BEV features, thereby efficiently achieving 3D object detection by regressing the object's height, three-dimensional size and yaw angle in the BEV space. Building upon SECOND [2], PV-RCNN [7] introduces the Voxel Set Abstraction Module to sample multi-scale voxel features in the proposal boxes, thus forming a two-stage detector for further refinement. VISTA [20] carries out feature dimensionality reduction of voxel features in two directions to generate BEV features and range view (RV) features, and then employs a dedicated multi-view transformer to merge features from distinct perspectives. FocalsConv [21] extends the submanifold sparse convolution by learning the spatial density of voxels, which estimates the possibility of empty voxels, thereby making the voxels of foreground objects more concentrated. Nevertheless, as depicted in Table 1, these methods [2], [7], [13], [14], [15], [16], [20], [21], [23] all employ height compression operation (i.e., flattening of features along the height dimension) and static 2D convolution to realize the reduction from 3D to 2D BEV space, yet they are unable to flexibly retain the geometry information of the scene. In contrast, our proposed method is capable of adaptively capturing the geometric and multi-scale information of the objects, and efficiently encoding it into the BEV features.

B. PILLAR-BASED METHODS
Compared with the voxel-based approach, the pillar-based approach [3], [8], [9], [17], [19], [22] aims to reduce the time consumption during inference. These methods adjust the grid height to be equivalent to the height of the 3D space during the point cloud voxelization, thereby directly transforming the point cloud from the 3D shape to the 2D form in the BEV space. PointPillars [3] is the pioneering method that adopts the pillar representation, which employs PointNets [24] to encode point features and subsequently applies a max/mean pooling operation to transform the point features into a pseudo-image in the bird's eye view, thus only requiring 2D convolution. Pillar-OD [8] initially utilizes point grouping to generate features for the cylindrical view and BEV view, which then are scattered to each point, and subsequently obtain BEV features through pooling operations. Infofocus [10] adds a second-stage attention network to PointPillars [3] for fine-grained proposal refinement. MuRF-Net [9] introduces the utilization of dilated operations in the voxelization process to acquire BEV features with varying receptive fields, followed by channel-wise attention for fusion. CVFNet [18] first projects the point clouds onto the range view to extract point-wise features through a 2D convolutional network and then voxelizes them to obtain BEV features, thereby enabling the 3D detector to capture information from various perspectives. PillarNet [22], a modified version of CenterPoint-pillar [13], introduces the 2D sparse convolution of ResNet18 structure into the backbone for BEV feature extraction. Experiments have demonstrated that, after adequate 2D sparse convolution extraction, the pillar-based network can attain a level of accuracy comparable to that of voxel-based approaches. Nevertheless, the bottleneck of 3D object detection is hard to be broken by pillar-based methods, as existing pillar-based methods use pooling operations for 3D-to-BEV dimensionality reduction. The straightforward pooling operation is not capable of preserving the geometric information of the point cloud onto the BEV feature maps, thus leading to a great deal of 3D geometric information being lost. To facilitate the BEV features in capturing 3D geometric information of point clouds, we propose Spatial-aware Dimensionality Reduction and Multi-level Spatial Residuals, thus enabling BEV features to contain more abundant point cloud geometric information and multi-scale 3D information.

III. PROPOSED METHOD A. FRAMEWORK OVERVIEW
The overall framework of MDRNet is shown in Fig.1(c), based on the proposed Spatial-aware Dimensionality Reduction (SDR) and Multi-level Spatial Residuals (MSR). The SDR module is designed to adaptively focus on the geometric structure of the point cloud and dynamically retain essential spatial information that is beneficial for detection. The MSR strategy is intended to preserve the multi-scale geometric information into the BEV space, thus minimizing the information loss due to the dimensionality reduction process. Different from the voxel-based ( Fig.1(a)) and pillarbased ( Fig.1(b)) backbone networks, we use the dual-branch network structure, including a BEV branch (pillar branch) and a voxel branch. Similar to VoxelNet [1], we first assign the input point clouds into small voxel grids of the same size in the 3D space. For the points located in a voxel, we compute the positional encodings as the coordinate differences between the 3D points and the centre of the voxel. These positional encodings are then combined with the point coordinates and fed into an MLP. Subsequently, a maximum pooling operation is conducted on the features within the voxel to generate the initial voxel features. Secondly, we feed the initial voxel features into the SDR module for adaptive feature dimensionality reduction to generate the initial BEV features. Meanwhile, the initial voxel features are processed through a simplified 3D sparse convolutional network to obtain multi-scale voxel features. Next, we feed the initial BEV features with multi-scale voxel features into the MSR module for feature fusion, thus enabling the 2D BEV features to retain geometric information at various scales. Finally, the output BEV features of the MSR module are fed into a region proposal network for 3D object detection.

B. DIMENSIONALITY REDUCTION
In this section, we first review the previous feature reduction operations in III-B1. We then describe our proposed Spatial-aware Dimensionality Reduction (SDR). For simplicity, we denote the Z-axis as the height dimension.
35246 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

1) PRELIMINARIES
In the BEV representation of point clouds, there is no need to consider object occlusion and an efficient 2D convolutional network can be leveraged. Existing voxel-or pillar-based methods project 3D features to a bird's-eye view through dimensionality reduction. Pillar-based methods [3], [22] directly pool the points along the height dimension to obtain the BEV representation, as displayed in Fig.2(a). The voxel-based methods mainly use the SECOND [2] architecture to extract 3D features and concatenate these features along the Z-dimension, followed by 2d convolutions to reduce the feature dimensionality, as shown in Fig.2(b). Obviously, Fig.2(b) computes many empty voxels, which can be replaced by a 3D sparse convolution with the kernel size equal to the Z-dimensional range, as in Fig.2(c). Given an input feature x 3D i,j,k in the spatial space, the above dimensionality reduction process can be expressed uniformly as where i, j, k are the coordinates along the X, Y, Z axis, Z i,j is the number of features whose coordinates of X, Y-axis are equal to < i, j >, w i,j,k is the weight of each feature and x BEV i,j is the output BEV feature at position < i, j >. When using mean pooling for dimensionality reduction, w i,j,k equals 1/Z i,j . When using max pooling, w i,j,k is a binary weight that is 1 only when x 3D i,j,k is the maximum the along z-axis. When using convolution, w i,j,k is a static learnable parameter optimized with the training process and its value is fixed once the training is over.
All three dimensionality reduction operations will cause the loss of spatial information. (i) Mean pooling simply averages the spatial features at the same < i, j > position without using the semantic information. (ii) Max pooling only retains the maximum value of features along the z-axis, leading to the loss of a lot of relevant information. (iii) Convolution operation assigns different weights to the features along the Z-axis, but the weights are identical for different objects. Due to the large variation in the geometric structure of different objects, using fixed weights without adaptive adjustment according to the object's geometric information would obscure the valuable information and limit the representation capability for point cloud data. Since 3D object detection should focus on different locations of different instances, feature encoding using the same weights for different instances along the Z-axis will further reduce the object perception capability of the 3D object detector.

2) SPATIAL-AWARE DIMENSIONALITY REDUCTION (SDR)
Considering categories like pedestrians, motorcycles, and bicycles have more complex geometries, we argue that not all parts of the object are of equal significance to point cloud object detection. A dynamic focus on the easily detectable parts of objects should be considered. In this subsection, we present a novel dimensionality reduction structure called Spatial-aware Dimensionality Reduction (SDR) that dynamically focuses on the easily detectable parts of the object and effectively preserves the geometric information. Compared to the previously described approaches, SDR explores the spatial correlation distribution and geometric semantic features along the Z-axis. Learning spatial distribution enables the network to focus on the easily detectable geometric structure of the objects and preserve the 3D information well during dimensionality reduction.
The proposed SDR is illustrated in Fig.2(d). We denote {w i,j,k |k ∈ Z i,j } in Eq. 1 as the spatial correlation distribution of (i, j) position along the Z-axis, which represents the geometric semantic representation capability of the voxel features. Due to the various geometric structures of objects, w i,j,k should be dynamically generated according to the geometric semantics of objects. Therefore, we first use a submanifold sparse convolution F(·) with a kernel size of three to encode the geometric information of the neighborhood of voxels and the output is represented as w(x 3D i,j,k ) = F(U(x 3D i,j,k )), where U denotes the set of neighbors within a distance of 3. Then, a normalization function is used to transform w into a probability distribution. In the following, we show several forms of normalization functions: In our ablation studies, we find that w Softmax achieves superior performance than others. The proposed SDR is spatial-aware and instance-aware, and fully preserves the spatial geometric properties of the point cloud.
The ground-truth bounding boxes are shown in green, and the predicted bounding boxes are in blue. Compared to CenterPoint [13], our approach dynamically focuses on the distinguishable regions of objects, making it easier to detect objects with complex geometries, such as bicycles. In the third column, we visualize the spatial distribution learned in the SDR. The closer the color comes to red, the more the network focuses on that area.

C. MULTI-LEVEL SPATIAL RESIDUALS (MSR)
With the dimensionality reduction operator, we propose a novel strategy named Multi-level Spatial Residuals (MSR) to preserve more geometric information at different resolutions, as shown in Fig. 1(c). At each level of resolution, we project the 3D voxel features onto the 2D BEV space and perform element-wise addition to the previous BEV features. In this manner, the 2D feature map is able to retain more 3D geometric information. The cross-dimensional connection between the 3D voxel features of l + 1 stage and BEV features of l stage is defined by: where x 3D and x BEV are the 3D voxel features and 2D BEV features. The D(x 3D , W ) can be any 3D to 2D mapping function, such as the previously mentioned max/mean pooling, convolutional dimensionality reduction and our SDR. The F operation is employed to fuse the dimensionality-reduced 3D features with the BEV features from the preceding stage.
To simplify and optimize the process, F is represented here as the addition operation. The design of MSR is able to retain multi-level 3D geometric information of the point cloud to a greater extent. In contrast to pillar-based( Fig.1(b)) and voxel-based ( Fig.1(a)) methods, where the 3D information is retained only at the initial or final resolution, MSR is able to obtain 3D semantics of different sensory fields. The results in Table 8 indicate that MSR can bring a significant improvement to existing 3D detectors without causing additional time consumption.

1) COMPARE TO THE VOXEL SET ABSTRACTION MODULE (VSA) IN PV-RCNN [7]
First, the MSR module retains multi-scale geometric information without complex grouping and sampling operations, while VSA does. Second, VSA can only be applied to twostage detectors, while MDR is applicable to both one-stage and two-stage detectors. In addition, when applied to PV-RCNN [7], MSR enables improvement of the accuracy of proposals, while VSA does not. Therefore, MSR can improve the recall of proposals and the accuracy of the second-stage detection, thus significantly enhancing the performance of the two-stage detectors [7], [16]. As shown in Table 4 and Table 5, PV-RCNN [7] can achieve better performance using MDRNet as the backbone network.

D. ARCHITECTURE DETAILS
Similar to the previous backbone networks in 3D object detectors [2], [7], [13], [16], [21], our proposed backbone network consists of four stages. Different from the voxel-base ( Fig.1(a)) and pillar-based ( Fig.1(b)) backbone networks, we use the dual-branch network structure, including a voxel branch and a BEV branch (pillar branch), as presented in Fig.1(c). At the last layer of each stage, we perform elementwise addition to fuse the reduced dimensional geometric features with the BEV feature map. Considering the efficiency of the network, each stage of the voxel branch contains only one submanifold sparse convolution with the kernel size 3 and one sparse convolution for down-sampling. The BEV branch consists of residual blocks [29] with the number of {1, 2, 2, 2} in four stages, respectively. For SDR to better extract the geometric information, we change the initial sparse convolution kernel from 3 × 3 × 3 to 5 × 5 × 1. For the multi-modal version, we simply project the point cloud into image planes to obtain aligned image features extracted from DLA34 [30] of pretrained CenterNet [31], [32] and then fuse LiDAR and image features by point-wise concatenation before feeding them into the proposed MDRNet. We validate the proposed MDRNet on the existing SOTA 3D detectors [7], [13] by directly replacing the backbone network. For PV-RCNN [7], we directly replace its backbone with our MDRNet while keeping its VSA module.

A. DATASET AND TECHNICAL DETAILS
• nuScenes Dataset. The nuScenes [26] dataset is a large-scale autonomous driving dataset for 3D perception tasks, which is collected by a 32-bean synchronized LiDAR, 5 radars and 6 cameras providing full 360 o coverage around. It contains 1,000 driving sequences, of which 700 are for training, 150 for validation and 150 for testing. The 3D bounding box annotations of nuScenes detection task include 10 object categories with a long-tailed distribution. For the evaluation, the official metrics are the mean Average Precision (mAP) and nuScenes detection score (NDS). Following previous work, 10 LiDAR scans are accumulated as network input and the results are reported using the official evaluation protocol.
• KITTI Dataset. The KITTI [27] dataset encompasses a total of 14,999 samples, with 7,481 for training and 7,518 for testing. Following previous works [2], [7], [21], [33], we split the data with annotations into a train set of 3,712 samples and a val set of 3,769 samples. The annotations include three categories (car, pedestrian and cyclist) that are split into three difficulty levels (Easy, Moderate and Hard). We evaluate models on val. set using the 3D Average Precision (AP 3D ) metric, which is calculated with recall 40 positions (R40). The performance of models is ranked based on the Moderate difficulty samples.   • DAIR-V Dataset. The DAIR-V [28] dataset is the vehicle-side set of the cooperative perception dataset DAIR-V2X [28]. It contains a training set of 9322 samples and a validation set of 5963 samples. To validate our method, we use the point cloud data collected from the vehicle side (DAIR-V), with 9322 samples in the training set and 5963 samples in the validation set. We evaluate the performance of models by employing the same metrics as those used in the KITTI [27] dataset.
• Implementation Details. Our work is built upon public projects [13], [22] as well as open-sourced Open-PCDet [7], [21]. The training schedules are the same as those in previous works [13], [21], [22]. For the nuScenes dataset, models are trained with a batch size of 16 for 20 epochs on 4 V100 GPUs. The Adam optimizer is adopted with the one-cycle learning rate strategy and the momentum range from 0.85 to 0.95. The maximum learning rate is equal to 1e −3 and the weight decay is equal to 0.01. Following the conventional settings, the Z-axis detection range is set as [−5m, 3m]. The X-axis and Y-axis detection ranges are set as [−51.2m, 51.2m] and [−54m, 54m] when the voxel sizes are 0.1m × 0.1m × 0.2m and 0.075m × 0.075m × 0.2m. For KITTI [27] dataset and DAIR-V [28] dataset, all networks are trained with a batch size of 4 for 80 epochs. The Adam optimizer is adopted with a weight decay of 1e −2 and momentum of 0.9. The learning rate is set to 1e −2 and reduced using the cosine annealing strategy. The point cloud range of the X, Y and Z axis are clipped to [0m, 70.4m], [−40m, 40m] and [−3m, 1m] respectively. The initial voxel size is equal to 0.05m × 0.05m × 0.1m. Following previous methods [13], [21], [22], data augmentations including random flipping, global scaling, global rotation and ground-truth sampling [2] are used to improve the accuracy of the 3D detectors. For the ground-truth sampling in the multi-modal setting, as with [21] and [34], we copy the corresponding 2D objects in bounding boxes onto images based on the objects' center distance. For the models used to submit results to the nuScenes test server, GT sampling is deactivated in the last four epochs, as done by [21], [23], and [34].

B. OVERALL RESULTS
On nuScenes, we implement MDRNet into CenterPoint [13] and evaluate it against other SOTA methods through the online test server, encompassing PointPillars [3], 3DSSD [5], HotSpotNet [11], CVCNET [12], PillarNet [22], VISTA [20], FocalsConv [21], PointPainting [35], FusionPainting [38], MVP [39] and PointAugmenting [34]. Table 2 presents the results obtained from the nuScenes online test server. MDR-Net dramatically improves CenterPoint [13] to 65.2% mAP and 70.5% NDS. Moreover, without any testing augmentation, the AP for the motorcycle and bicycle categories saw a remarkable rise to 73.1% and 45.2%, respectively, representing an increase of 19.4% and 16.5%. This is due to the fact that the geometry of motorcycles and bicycles is more complex than that of other categories, and MDRNet is able to retain 3D geometric information more effectively than other methods. Furthermore, AP in other categories has also experienced considerable growth, with construction vehicles increasing by 8.2%, traffic cones by 6.2%, trailers by 5.7%, etc.
For multi-modal settings, we employ PointAugmenting [34] as our baseline, which is built upon CenterPoint [13]. As illustrated in Table 3, MDRNet-F with a simple fusion mechanism outperforms other methods with complex fusion strategies. This is attributable to the fact that SDR and MSR are able to reduce the loss of semantic and geometric information during dimensional collapse. With test-time augmentations [13], MDRNet-F † further achieves 69.8% mAP and 73.5% NDS. Compared to Focals Conv [21], which also improves upon CenterPoint [13], the proposed MDRNet performs better both in LiDAR-only and multi-modal settings.

C. ABLATION STUDIES
• Improvements. We evaluate the impact of our methods on existing SOTA detectors on the KITTI [27] val. set, DAIR-V [28] val. set and nuScenes [26] val. set, respectively. On KITTI [27], we take PV-RCNN [7] and CasA-PV [14] as the strong baselines. Compared to PV-RCNN [7], Table 4 shows that our method achieves appreciable improvement in the pedestrian category, boosting the AP 3D from 54.49% to 60.06%. Compared to CasA-PV [14], our approach yields significantly better performance. The proposed SDR module and MSR module have substantially improved the AP 3D of pedestrian and cyclist categories, by 6.49% and 2.03% respectively. This demonstrates that our proposed backbone network is capable of effectively capturing the geometric information of objects with complex geometric structures, thus enhancing the accuracy of 3d detection. As presented in Table 5, the comparison results on the DAIR-V [28] validation set reveal that our method can significantly enhance the performance of pedestrian and cyclist categories. Table 6 presents the comparison results on the nuScenes [26] val. set. Obviously, the proposed method significantly improves the performance of CenterPoint [13]. Notably, MDRNet using 10 cm voxels outperforms CenterPoint using 7.5 cm voxels. This is due to the fact that SDR is capable of dynamically  preserving the spatial attributes of the point clouds, thereby minimizing the impact of voxel size on the network.
• Dimensionality reduction operations. We conduct ablation experiments on the nuScenes [26] val. set to explore the design of dimensionality reduction operations for grid-based 3D detectors. The ablations consist of two parts: various dimensionality reduction operations and the forms of spatial correlation distribution (i.e. w ReLU , w Sigmoid and w Softmax ). Table 7 indicates that SDR-Softmax achieves the best performance among the four different dimensionality reduction operations, which should be attributed to the fact that Z-axis feature aggregation using spatial correlation distribution estimation is able to preserve more 3D geometric information.
• Ablations of SDR and MSR. We perform ablations using the same experimental setup for the SDR module and the stages for using MSR. In Table 8, the first row displays the results of using a simple pillar backbone network with pooling voxelization. The second row shows the results of adding the use of SDR to obtain the first-stage BEV features and the following lines show the results of adding MSR to different stages.  As the number of stages used increases, the performance is enhanced until all stages are involved. It can be noticed that Multi-level Spatial Residuals are helpful to strengthen the performance of the BEV branch, with mAP and NDS metrics improving by +2.25% and +1.44%, respectively.

D. RUNTIME ANALYSIS
In Table 9, a comparison of the runtimes of MDRNet and the baseline method CenterPoint [13] is conducted to evaluate the efficiency. The experiments are conducted on an RTX 3090 GPU and all methods use the grid size of 0.075m× 0.075m × 0.2m. On nuScenes [26], the inference speed of CenterPoint using MDRNet as the backbone is almost the same as that of the original Centerpoint, yet the performance takes a substantial improvement. This result demonstrates that our proposed SDR (Sec. III-B2) and MSR (Sec. III-C) can boost the performance of existing 3D detectors without increasing the inference time. Fig.3 presents the visualization of an example from the nuScenes val. set [26]. The first column displays the GT boxes, and the subsequent and third columns present the results for CenterPoint without and with our MDRNet, respectively. Compared to the original CenterPoint [13], MDRNet is able to adaptively concentrate on the essential elements of objects, effectively enhancing the perception capability. In Fig.3(c), the points inside the bounding boxes are colored according to the predicted spatial distribution. The closer the color is to red, the more valuable the area is considered by the network. More visualization results from a bird's eye view are presented in Fig.4, where the bounding boxes in red indicate the ground truth and the bounding boxes in blue indicate the predictions. The first row presents the qualitative results of CenterPoint [13], indicating a higher rate of misses in the truck category (as evidenced in the first and second columns), as well as lower detection accuracy for distant objects and the pedestrian category (as shown in the third and fourth columns). On the other hand, the second row presents the qualitative results of our method, which has superior detection performance on both the truck and pedestrian categories, and can also detect distant objects more effectively. It is obvious that our MDRNet can remarkably enhance the performance of 3D object detection.

V. CONCLUSION
In this paper, we design a universal point cloud backbone network called MDRNet, which is able to be used with any grid-based point cloud object detectors to enrich 3D geometry information. Firstly, in order to capture more geometric information, we estimate the spatial distribution of the voxelized point clouds and then perform adaptive feature dimensionality reduction along the Z-axis, thereby greatly preserving the spatial information from the point clouds to the BEV space. Additionally, we introduce a multi-level spatial residuals strategy to fuse multi-scale voxel features with BEV features, thus allowing BEV features to access multi-scale spatial information. Experiments on the nuScenes [26], KITTI [27] and DAIR-V [28] datasets validate that our MDRNet dramatically improves the precision of 3D object detection at a similar inference time compared to the baseline methods. The proposed backbone is novel in two modules: Spatial-aware Dimensionality Reduction (SDR) and Multi-level Spatial Residuals (MSR). SDR performs adaptive feature aggregation along the height dimension by dynamically focusing the valuable parts of objects and MSR enriches the information of BEV features by multi-level 3D-BEV connections.
For the first time, we explore the effect of different dimensionality reduction operations on grid-based point cloud object detectors. Extensive experiments show that our MDR-Net achieves top-notch results on nuScenes [26], KITTI [27] and DAIR-V [28]. In addition, we offer a fusion-based variant of MDRNet by simply performing point-wise fusion. Experiments on nuScenes demonstrate that MDRNet with simple point-wise fusion is superior to other baseline approaches with complex fusion strategies. However, we did not delve into which multi-modal fusion strategy is most suitable for MDRNet. In future work, the issue of multi-modal fusion will be concerned and more multi-modal fusion strategies such as BEVFusion [40], [41] will be added to MDRNet-F.