3D Object Detection With Multi-Frame RGB-Lidar Feature Alignment

Single-frame 3D detection is a well-studied vision problem with dedicated benchmarks and a large body of work. This knowledge has translated to a myriad of real-world applications. However, frame-by-frame detection suffers from inconsistencies between independent frames, such as flickering bounding box shape and occasional misdetections. Safety-critical applications may not tolerate these inconsistencies. For example, automated driving systems require robust and temporally consistent detection output for planning. A vehicle’s 3D bounding box shape should not change dramatically across independent frames. Against this backdrop, we propose a multi-frame RGB-Lidar feature alignment strategy to refine and increase the temporal consistency of 3D detection outputs. Our main contribution is aligning and aggregating object-level features using multiple past frames to improve 3D detection quality in the inference frame. First, a Frustum PointNet architecture extracts a frustum-cropped point cloud using RGB and lidar data for each object frame-by-frame. After tracking, multi-frame frustum features of unique objects are fused through a Gated Recurrent Unit (GRU) to obtain a refined 3D box shape and orientation. The proposed method improves 3D detection performance on the KITTI tracking dataset by more than 4% for all classes compared to the vanilla Frustum PointNet baseline. We also conducted extensive ablation studies to show the efficacy of our hyperparameter selections. Codes are available at https://github.com/emecercelik/Multi-frame-3D-detection.git.


I. INTRODUCTION
Perception is an important step for autonomous systems to plan their actions safely, especially in dynamic environments. For example, the dynamic nature of the traffic makes it vital for automated vehicles to obtain high-quality 3D detections to react to changes at appropriate times. Therefore, 3D object detection has gained increasing importance in the intelligent vehicle domain. Recently, large-scale benchmarks [1]- [3] with annotated lidar, RGB camera, and radar data have been introduced to compare different approaches.
The associate editor coordinating the review of this manuscript and approving it for publication was Mehul S. Raval .
State-of-the-art 3D object detectors rely on single-frame data [14]- [17]. However, a single-frame lidar point cloud can contain occluded or partially observed objects. These occlusions are especially prevalent in dense, urban cityscapes. In addition, the sparsity of lidar point clouds increases with distance [18]. This sparsity causes inconsistent single-frame detection performance. Against this backdrop, we propose a multi-frame approach to compensate for the lack of information by using object features already obtained in previous frames (Fig. 1).
Multi-frame approach has been studied quite intensively for 2D image-based problems. There have been attempts to fuse object-level features [19]- [24] and scene-level features [25]- [29] to improve the 2D detection quality. Also, action recognition studies rely on multi-frame information to reach more reliable decisions [30], [31], since one scene in a single-frame can be related to multiple actions without observing corresponding scenes in the prior frames. Multiframe processing has been also studied in the context of FIGURE 1. Single-frame 3D detectors rely only on the point cloud available in one frame. Therefore, quality of 3D detections decreases for far away or occluded objects, which reflect only a small number of points. We propose fusing object-level features in multiple frames to compensate for the missing information in the current frame. In this way, we utilize features of high-quality detections from the previous frames for poor-quality features in the most recent scene.
3D object tracking methods work also on successive frames to track objects [37], [38]. It has been shown that using object-level features improves the tracking quality further [39]. However, processing multi-frame data for 3D object detection has been limited until the release of largescale datasets [1], [2]. Recently, there are studies considering scene-level temporal fusion [40]- [42], however these methods suffer from feature alignment problems in two successive frames due to the movement of objects.
In this study, we propose an object-level temporal fusion method to improve 3D object detection results. We extend Frustum PointNet architecture [43] with our temporal fusion module, which fuses features of the same object from successive frames to obtain a more representative feature. The Frustum PointNet generates object-level features to predict 3D bounding boxes using segmented frustum points. We fuse the object-level feature of an object in the current frame with its features from the previous frames temporally to predict a more accurate 3D bounding box. We run our method on the KITTI Multi-object Tracking Benchmark dataset since it contains sequential data contrary to KITTI 3D object detection dataset. The validations are done for car, pedestrian, and cyclist classes. We compare out method with the vanilla Frustum PointNet baseline as well as the state-of-the-art object detectors that apply RGB-Lidar fusion. Comparing to the Frustum PointNet baseline, we reach 7%, 4%, and 6% improvement on car, pedestrian, and cyclist classes in moderate difficulty, respectively.
We list our main contributions below: • A novel multi-frame 3D object detection strategy to increase consistency of bounding box predictions by fusing object-level features from multiple frames.
• An object-level RNN-based temporal RGB-Lidar fusion approach for 3D object detection. We investigate the temporal fusion strategies for the RNN and provide our results through extensive ablation studies.
• Experimental validation on a commonly used benchmark. We provide our 3D detection results on the KITTI object tracking.

II. RELATED WORK
A. SINGLE-FRAME RGB-LIDAR FUSION FOR 3D OBJECT DETECTION Multi-modal perception systems benefit from redundancy for alleviating sensor failure problems. RGB-lidar fusion is commonly used for multi-modal 3D object detection, which follows two main strategies: feature-level and high-level information fusion. The common approach for feature-level method is the fusion of lidar bird's eye view (BEV) representations with RGB image features [10], [11]. Additionally, [7] and [13] combine RGB image features and 2D segmentation outcomes with lidar features using calibration information. Crossmodality fusion further improves the 3D detection quality [12]. As a high-level fusion method, 2D bounding boxes obtained from RGB images are used to extract frustums of objects to reduce the search space in the entire point cloud [8], [43]. Similarly, [9] uses 2D semantic segmentation results to filter out background points in the point cloud.
Our temporal fusion method is based-on the Frustum PointNet [43] architecture, which applies high-level RGB and lidar frustum fusion. In this way, we obtain object-specific features from the frustums of 2D bounding boxes. Singleframe methods are limited to the data of the current frame, which makes them prone to noisy data. However, our multiframe approach can propagate features from previous frames to the current frame, which helps obtaining a wider view, richer features, and as a result more stable 3D detections.

B. 2D VIDEO OBJECT DETECTION AND TRACKING
Multi-frame processing has been studied more extensively for 2D object detection and tracking tasks than their counterparts in 3D. Even though, single-frame 2D object detection methods provide good results [44], [45], they can miss previously-detected objects in certain frames of a video. Aggregating object-level features through multiple frames alleviates this problem. References [19], [34] combines 2D regions in 3D tube-like volumes among multiple frames, [20] and [21] leverage attention mechanisms among multi-frame objects, [22] measures similarities and concatenates similar regions in the video, [23] and [24] refine object-level features using RNNs. On the other hand, several approaches to aggregate features of the entire scene have been proposed such as using convolutional LSTMs [25] and LSTMs [28], attention mechanisms [26], [29] and flow fields [27]. Similarly for the 2D tracking problem, [36] processes object-bounding tubes with 3D convolutions, [35] utilizes graph and motion features, and [46] and [47] use transformer architectures to track objects.
Even though object-level feature aggregation has been frequently and successfully considered for 2D video object detection, this was an untouched method for the 3D object detection problem. In this study, we propose multi-frame object-level feature propagation method is to obtain better 3D bounding boxes from successive lidar point clouds in time.
In the voxel-based approaches, an object can occupy multiple voxels, which makes it difficult to obtain and propagate object-level features. Similarly, point-based methods that rely only on lidar point clouds sample keypoints, also multiple of which can represent an object. Instead, we utilize and extend Frustum PointNet [43], which extracts features from the region of the object. Thus, object-specific features are obtained without additional computation and merging process.
2) 3D MULTI-OBJECT TRACKING 3D multi-object trackers also utilize 3D bounding box and appearance information from multiple successive frames. [38] tracks 3D bounding box of objects using a 3D Kalman filter. References [37] learns a similarity map on 2D appearance features from two frames. References [57] predicts also 2D and 3D center offsets to match objects in successive frames. [58] associates objects by learning geometryand appearance-based costs for 3D bounding boxes of two adjacent frames. [59] utilizes LSTMs on monocular video frames. Additionally, [60] makes use of lidar appearance features. [39] employs 2D and 3D appearance and motion features of 3D bounding boxes detected in multiple time-steps.
Similar with the 3D tracking methods, we fuse 3D appearance features of objects from multiple frames to predict better 3D bounding boxes instead of associating objects. Thus, the more representative multi-frame object features can result in better 3D bounding box prediction comparing to using single-frame features.

3) MULTI-FRAME 3D OBJECT DETECTION
Multi-frame 3D object detection gained attention after the availability of datasets that provide sequences of data [1], [2]. Even though, KITTI multi-object tracking dataset [3] provides sequences of lidar scans, it has been only considered for the tracking task. The proposed methods for processing multiple frames have focused on scene-level aggregation so far. [61] utilizes motion maps from two successive frames. [62] uses 3D CNNs for multi-frame BEV features, whereas [63] and [40] apply convolutional LSTMs. Similarly, [41] proposes a custom-designed LSTM with spatial feature alignment between frames. Differently, non-local attention mechanisms [64] as well as transformer architectures [42], [65] have been utilized to build the spatio-temporal relation between multiple frames.
These methods show usefulness of multi-frame approaches to improve quality of 3D detections. However, scene-level feature aggregation requires spatial alignment of features between successive frames. Different from these approaches, we utilize object-level features from multiple frames to obtain stronger object representations without the need for an additional spatial alignment. Our previous work [66] includes also object-level temporal features on lidar-RGB fusion for 3D object detection, in which tracking and the extent of temporal associations were not considered. Here, we extend the previous study considering tracking of objects, ablation of multi-frame sequence lengths, comparison with other 3D detectors as well as a large number of ablations.

III. PROPOSED METHOD
Making use of the data from the previously-processed frames extends ego-vehicle's field of view and compensates for the occluded objects. With this idea, we aim to fuse an object's features in successive frames to obtain more informative object-specific features and finally its 3D bounding box as shown in Fig. 2. Therefore, we extend Frustum PointNet [43] v1 model, from which we can obtain object-specific features.
, with a 3D detector, which is capable of processing the multi-frame sequence S t . We represent b t i with the width, height, and length, 3D center, and the orientation of the 3D bounding box. The whole pipeline is summarized in Fig. 2.

B. 2D DETECTION
Frustum PointNet extracts a subset of point cloud in the frustum of a 2D bounding box, o t i . Therefore, we require a 2D detector g to obtain a set of 2D bounding boxes given as O t = g(I t ). Frustum PointNet utilizes a custom-designed detector for high recall, which is not included in the original Frustum PointNet repository. We provide our own 2D detectors in our repository. The 2D bounding boxes are used to decrease the search space in the entire point cloud. In addition, high recall is required not to miss the objects for 3D detection.

C. 2D TRACKING
We need track IDs of the 2D bounding boxes to associate features of the same object from multiple scenes. Therefore, we utilize a 2D tracker, which takes 2D bounding boxes in successive two frames, O t and O t−1 , with unique track IDs U t−1 and outputs U t as seen at the top of Fig. 2. Since the tracking problem was not considered in the original Frustum PointNet [43] study, we employed our 2D trackers to match objects in successive frames. As we explain our approach in subsection IV-D, we tested SORT [32] and Deep SORT [33] 2D trackers on top of 2D detections for assigning 2D bounding boxes to each other in successive frames.

D. SINGLE-FRAME FRUSTUM-LEVEL RGB-LIDAR FUSION
Frustum PointNet is a single-frame 3D detector, which takes RGB images and lidar point clouds as inputs and fuses them by extracting lidar points (P i ) in the frustum of each object's 2D bounding box o i . The point set is decreased to P i = {p i |i = 1, . . . , m}, where m is the number of points sampled randomly in the frustum. However, the quality of the objectlevel features and the 3D detection depends on the sampling strategy, which is done randomly in the original study [43]. We also follow the same procedure.

E. OBJECT-LEVEL SINGLE-FRAME 3D DETECTION
Frustum PointNet [43] consists of 3 main parts. The frustum proposal part extracts frustum lidar points, a subset of point clouds P O t = {P 1 , P 2 , . . . , P l }, using 2D bounding boxes, O t . This leaves irrelevant lidar points outside of the frustum. However, there are still points in the frustum that do not belong to the object of interest, which would cause a noisy representation.
The 3D instance segmentation PointNet aims to minimize the irrelevant frustum points. It removes points with low objectness scores from the frustum point set (P t i ). A masked subset (R t i ) of P t i is obtained such that R t i = {p i |i = 1, . . . , n} as shown in Fig. 2. n is the number of points after masking operation and satisfies n < m.
The amodal 3D box estimation module consists of a T-Net and an amodal 3D box estimation PointNet. The T-Net takes R t i and estimates center residuals, which are used to translate R t i to a point setR t i in the new coordinate system. The amodal 3D box estimation PointNet takesR t i and outputs global feature (F t i ) of the object o t i using multilayer perceptrons (MLPs) and a max-pool operation. F t i is further processed by fully-connected layers to predict 3D box parameters b t i .

F. TEMPORALLY FUSING OBJECT-LEVEL FEATURES FOR REFINING 3D DETECTION
The global feature F i represents the position, orientation, and shape of the object o i in an abstract way. However, the o i can be only partially observed by the lidar sensor at a time-step, which causes dramatic changes in F i . Therefore, detection quality depends on the sampledR i and the occlusion state of the o i . To alleviate the stated problem, we propose to use an object's features from successive frames to compensate for the loss of information and to obtain a richer representation. Our multi-frame feature alignment module, f , fuses global features of the same object from multiple frames given with . . , F t i } to obtain a more representative feature of the object as shown with z t , where τ is the number of frames, as depicted at the bottom of Fig. 2.
We realize f with the gated recurrent units (GRUs) to fuse object-level features in time. The resulting feature vector from GRU cells is further processed with a fully-connected layer. We expect that the temporal feature vector z t i provides a better representation of the object, which is shown with the experiments in V. The T-Net shown in the Fig. 2 predicts center residualsR t i for the bounding box of the o t i . Adding this information to the object feature vector F t i would help the network to understand the center shifts of the bounding box. Therefore, we concatenate the center residuals from T-Net with the F t i to improve the representation quality. This choice is also experimented as given in IV-D.

G. MULTI-FRAME FEATURE ALIGNMENT STRATEGIES
The obtained temporal feature z t i can be used with different multi-frame feature alignment strategies as shown in Fig. 3 for predicting 3D bounding boxes. The straightforward approach would be to feed z t i directly to the FCs to predict 3D box parameters. We name this strategy as one branch (OB) seen in Fig. 3-a. Even though z t i is expected to contain a richer representation, it is still beneficial to have object's current feature vector, F t i , that is related more to the current position. Therefore, we combine the temporal feature z t i with the object feature F t i at time-step t through a mean operation, which is named as two branch (TB) in Fig. 3

-b.
In this way, the feature has also the awareness for the current state's predictions. We realized the mean operation by adding two feature vectors and dividing by 2.
Due to objects' movement, only objects' shape remains same between frames. As a third alignment option, we use z t i to predict shape parameters of the object bounding box and F t i to predict the rest of the parameters. To construct the final output, the parameters are concatenated. In addition, using temporal feature z t i for the shape prediction acts as an auxiliary loss for the preceding shared layers of F t i and z t i . This also ensures to obtain a more representative object feature vector F t i from the MLPs. This is called as Ours (seen in Fig. 3-c) throughout the experiments and results sections. None of the three strategies requires extra learnable parameters and the shape of the FCs are the same. As the ablation results in subsection V-B3 indicate, the Ours version provides considerably better results than the other two strategies.

IV. EXPERIMENTS A. KITTI TRACKING DATASET
In this study, we use KITTI Multi-object Tracking Benchmark dataset [3], which provides sequential RGB images and lidar scans. We split the 21 drives of the dataset in training and validation sets. The validation split consists of drives 11, 15, 16, and 18 and the rest of the drives are used for the training. There are 6264 frames in the training and 1239 frames in the validation set. We tried to keep the ratio of object instances in the training and validation splits similar. In addition, 92% and 96.3% of the objects are visible in more than 10 frames in the training and validation sets, respectively. We choose the validation splits to be challenging. As it can be seen from Fig. 4, there are scenes from the urban and suburban areas. We also observe permanent occlusions that occur due to the parked cars, pedestrians crossing in front of the vehicles, the traffic jam, and overtaking vehicles. In addition, we check the number of points inside object 3D bounding boxes as given in Table 1. In all difficulty levels defined for KITTI, the mean number of points inside objects decreases with the distance considerably. Also, comparing mean number of points for the training and validation splits, validation split seems to be more challenging since the objects contain less points in average. The dashes for the easy difficulty level indicate that there is no object in that distance bin.

B. METRICS
We utilize average precision (AP) metric to measure the performance of the network as it is recommended in the KITTI 3D object detection benchmark. 1 The AP is calculated through 40 recall points as also stated in [67]. IoU threshold is 0.7 for car class and 0.5 for pedestrian and cyclist classes for AP calculation.

C. LOSS FUNCTION
We utilize the original multi-task loss function used in Frustum PointNet [43], which consists of mask, center, heading class, size class, heading regression, size regression, the T-Net center regression, and corner losses shown with L mask , L center , L h−c , L s−c , L h−r , L s−r , L T −c , and L corner respectively. The classification losses are realized with softmax loss and the regression ones are with the smooth-L1 loss. We add a cosine distance loss between the object-level features of an object in the current (F t i ) and the preceding (F t−1 i ) frames to the multi-task loss of the Frustum PointNet. This loss function aims to force the network to form the feature in the previous frame with a different set of points. The cosine distance loss is shown in Eq. 1, where the υ and the ω are feature vectors, between which the distance is measured.
The multi-task loss function is constructed as shown in Eq. 2, where α, β, and γ are the weights of the 3D box losses, corner loss, and the cosine distance loss.

D. EXPERIMENTS & ABLATIONS
We compare 3D AP results of multi-frame object-level temporal fusion with the Frustum PointNet baseline and stateof-the-art 3D detectors on KITTI tracking dataset for car, pedestrian, and cyclist classes. In addition, we conduct ablation studies to investigate validity of the results.
Our method requires 2D bounding boxes and unique track IDs of boxes. For the state-of-the-art comparison, we utilize perturbed ground-truth 2D detections and ground-truth track IDs for the baseline Frustum PointNet and also for our method. We conduct an ablation study to show method's performance with the predicted track IDs obtained by SORT [32] and Deep SORT [33] 2D tracking methods. We also study effects of different sequence lengths (τ ). tau = 1 means using a fully-connected layer instead of GRUs to keep a similar depth. Third ablation discusses results of feature alignment strategies explained in III-G as well as results of training with the cosine distance loss on the proposed strategies. As a fourth ablation, we compare GRU-based fusion with LSTM-based fusion and a simple convolution-based temporal fusion. In the convolution-based fusion, we use two convolutional layers to obtain the temporal feature vector (z t i ) instead of a recurrent layer. Our fifth ablation compares the results according to the depth of features to be aggregated temporally. The Frustum PointNet max-pools the point features in its Amodal 3D Box Estimation PointNet and concatenates the max-pooled features with the k-length class vector. This is called as the global feature. Two FC layers are added after the global feature. We call the output of the first FC layer as fc1 and use it in all our experiments as the temporal feature vector (z t i ). In this ablation, we also take the global features as z t i and compare with the fc1 features. Finally, we study the extent of features by temporally fusing scene-level features. We compare the scene-level fusion with our object-level fusion results.

E. IMPLEMENTATION DETAILS
The object-specific global feature in the amodal 3D box estimation PointNet is processed with two FC layers and an output layer for the prediction of the box parameters. The     We also utilize the same training hyperparameters as the Frustum PointNet. The number of points in a frustum is fixed and taken as 1024, which is randomly sampled. We train each parameter set for at least 5 times and take the best resulting checkpoint for evaluation. All the training and validation steps took place in a Docker container, which runs Python 3.6 and Tensorflow 1.15, and on a single Nvidia RTX 2080 GPU with an Intel Xeon E3-1225 v5 3.30GHz CPU.  [43] and Ours with temporal fusion on cyclist class (Green: Ground-truth, Red: Detection). We omit the ground-truth boxes in the zoomed-in crops for simplicity. Even though the Frustum PointNet was able to localize the cyclist correctly at time-step t , it misses the object in the upcoming frames. However, Ours with the temporal fusion can localize the object and keep the bounding box in the successive time-steps correctly.

V. RESULTS & DISCUSSION
In this section, we show and discuss the efficacy of the proposed object-level temporal feature fusion for 3D object detection task. We first share our quantitative and qualitative results and then provide the ablation results explained in IV-D.

A. COMPARISON OF 3D DETECTION PERFORMANCE
We compare our method with other 3D detection architectures and with the baseline Frustum PointNet [43] on the KITTI tracking validation set. As seen from Table 2, our method outperforms the compared detectors in the moderate difficulty for pedestrian and cyclist classes. In this table, we also provide our results for Ours (w/ Cent), which utilizes object-level features extended with the center prediction from T-Net as explained in III-F. Adding centers improves the 3D AP for the cyclist class. Cyclist is the least-represented class among all classes and we think that adding extra information helps network to learn how to localize the cyclist objects. For the car class, we also outperform the baseline in the moderate difficulty given in Table 4.
Our qualitative results indicate how our method outperforms the Frustum PointNet baseline for the far-away or occluded objects, which reflect a small number of points comparing to the closer objects. In Figure 5, Frustum PointNet misses the previously-detected pedestrians in the following frames. In this case, there are two pedestrians approaching each other, which causes the detector to miss the farther-away pedestrian. However, our method can keep the localization of the 3D box correctly. Figure 6 and 7 show that our method can detect the far-away cyclist and car objects quite accurately in all frames, respectively. However, Frustum PointNet without temporal fusion suffers from the localization problem and misses the previously-detected objects in the successive frames.

B. ABLATION STUDIES 1) COMPARISON FOR 2D TRACKING PERFORMANCE
The 3D detection performance of our method highly depends on the 2D tracking accuracy used to match object-level VOLUME 9, 2021 FIGURE 7. Qualitative comparison of 3D detection results between the baseline Frustum PointNet [43] and Ours with temporal fusion on car class (Green: Ground-truth, Red: Detection). The baseline detects the object in the first frame, but misses the object afterwards. However, our temporal fusion model can detect and keep the detected box in the successive frames as well. For simplicity, we show only the detected boxes (red) in the crops t , t + 1, and t + 2. At t + 3, the baseline cannot predict the orientation correctly, whereas our method (Ours) can predict the heading consistently. features in time. Therefore, we also evaluate our method using SORT [32] and Deep SORT [33] 2D trackers. We train our method with the ground-truth track IDs and evaluate on the KITTI validation set with predicted track IDs. Results are given in Table 3. Comparing to the ground-truth tracking, the AP values decrease for all classes and difficulty levels, which would also require fine-tuning of the trained network with the predicted track IDs.

2) COMPARISON OF SEQUENCE LENGTH
We evaluate our results with different sequence lengths as seen in Table 5. The best results are obtained with τ = 3 for pedestrian and car classes, however we obtained the best score for the cyclist class with τ = 8. Comparing tau > 1 with tau = 1 results, the improvement originates from the temporal fusion, but not from the additional depth in the architecture comparing to the Frustum PointNet baseline.

3) COMPARISON OF THE FEATURE ALIGNMENT STRATEGIES
We also validate our choice of placement of the temporal fusion module in the architecture in Table 6. The best results are obtained mostly using Ours version. In this strategy, training the network with the temporal features acts as an auxiliary loss for the prediction of the center and the orientation since the feature in the current frame is shared. This helps learning more representative features for the shared layers.
In Table 7, we also provide training results using the cosine loss on the object-level features from two subsequent frames. Except the car class for Ours, training with cosine loss improves the results for all fusion strategies.

4) COMPARISON OF TEMPORAL FUSION TYPE
GRU is our main choice for multi-frame fusion. We also test using convolutional layers and LSTM layers instead of GRU separately as explained in IV-D. As shown in Table 8, GRU-based feature aggregation outperforms convolutionbased method for all classes in the moderate difficulty level. However, LSTM provides better results for all difficulty levels of the Cyclist class even though GRU performs better than LSTM for Car and Pedestrian classes.

5) COMPARISON OF FEATURE DEPTH
We also evaluate the performance of our multi-frame alignment method according to the depth of features. As explained   in IV-D, we use global features as the object-specific feature vectors instead of using the output of the first FC layer, fc1. global features have a larger dimension than fc1 features and  they represent lower-level features of the objects comparing to the fc1 features. As the results in Table 9 indicate, fc1 performs better for the Car class. However, global features provide better results for Pedestrian and Cyclist classes. Pedestrian and Cyclist classes are mostly represented with a smaller number of points comparing to the Cars. We think that lower-level features might be required for the aggregation of features in multiple frames. Hence, the two classes are detected better using the global features for the multi-frame alignment.

6) COMPARISON OF FEATURE EXTENT
Our multi-frame alignment method makes use of object-level features to improve 3D object detection quality. We also compare our method with the scene-level multi-frame feature alignment. We obtain the temporal features using all of the points in the frustum instead of using object-level segmentation. Thus, the features represent a larger field. The results are given in Table 10. Comparing to our object-level fusion, detection accuracy decreases drastically for all classes. This result also indicates the importance of spatial alignment of features in a scene-level feature fusion.

VI. CONCLUSION
This study introduced a multi-frame RGB-lidar fusion framework to decrease 3D object detection inconsistency across multiple frames. The proposed method achieves this by VOLUME 9, 2021 aggregating object-level features of the same object from multiple frames to improve 3D object detection quality.
Experimental validation shows that our approach increases the performance of already existing networks. Extending Frustum PointNet with the proposed temporal feature aggregation strategy improved the 3D detection performance by 6.5%, 4%, and 6% for car, pedestrian, and cyclist classes.
Future work can extend the proposed multi-frame detection framework with additional modalities such as radar. The robustness of other multi-modal fusion systems can be increased with the proposed temporal aggregation idea.