Multiframe CenterNet Heatmap ROI Aggregation for Real-Time Video Object Detection

Though Two-stage video object detectors cannot perform detection in real time, the accuracy of them is normally higher than that of one-stage video object detectors. One essence is that two-stage video detectors can easily use feature information from adjacent frames to augment key frame features. How to extract and exploit temporal features in the video stream for one-stage detectors needs further exploration. CenterNet is an anchor-free one-stage object detector that regress bounding boxes from heatmap peaks. We propose to use detected peaks and regressed boxes which encompass peak points to determine the heatmap ROIs as the extracted object heatmap features. A new relation module is designed to evaluate the similarity of heatmap ROI features and output the relation features which can effectively augment the heatmap ROI features. In the video sequence the heatmap ROIs of multiple adjacent frames are aggregated to a heatmap ROI of the key frame. Compared to CenterNet and other CenterNet-based video object detectors, our method achieves improved online real-time performance on ImageNet VID dataset with 78.8% mAP at 36 FPS.


I. INTRODUCTION
With the rapid advancing of convolutional neural networks in recent years, a number of convolutional neural network-based models have been proposed to solve image detection tasks and achieve high accuracy on still images. Video detection is needed in many scenarios such as autonomous driving, video intelligent surveillance, and industrial inspection. The video detection needs to deal with fast moving objects that cause dynamic blur, obscured objects in certain frames, and objects with rare pose. In addition, video detection is generally expected to operate in real-time with more than 25 frames per second processing speed.
Previous studies in [4], [5] based on two-stage models use temporal information to augment image features for improving accuracy, but the two-stage object detection model structure is more complex compared to the one-stage model and cannot meet the demand of real-time detection. On the other hand, the one-stage network is difficult to extract the object information to remove the background information in The associate editor coordinating the review of this manuscript and approving it for publication was Byung-Gyu Kim. the feature map because it has no RPN (Region Proposal Network) as in R-CNN, and therefore it is also difficult to use the temporal information to aggregate and enhance feature map. In this paper, we propose a multi-frame CenterNetbased temporal heatmap ROI (Region of Interest) aggregation module to improve the performance of video object detection. CenterNet is an anchor-free one-stage object detector. It selects the peak on the heatmap as the object center point and regresses the aspect of the object at the peak. Then it decodes the center point, size (width and height), and bias information into the original image to detect the object. In light of the effectiveness of RPN in two-stage object detectors, We propose the heatmap ROI, a kind of region of interest on heatmap, for extracting heatmap features in CenterNet. Each selected heatmap ROI corresponds to a possible object heatmap region. Then we design a multi-frame heatmap ROI aggregation module to extract heatmap ROIs in multiple adjacent video frames and aggregate them on the heatmap of key frame to enhance the heatmap feature of candidate objects in the key frame. The rest of paper is organized as follows. Section 2 introduces the related work of object detectors and video object detectors. Section 3 presents our proposed method. Section 4 describes the details of our implementation and demonstrates our experimental results.

II. RELATED WORK A. OBJECT DETECTION FROM IMAGES
The current high accuracy object detectors are basically designed based on Convolutional Neural Networks (CNNs) [11], [13], [18], [19]. According to whether or not the ROI feature extraction process exists, they are divided into one-stage object detectors and two-stage object detectors. Two-stage object detectors such as Fast(er) R-CNN [8], [17] and Cascade R-CNN [3] extract ROIs via RPN module in the first stage, and classify and further refine the localization of the object in the second stage. Direct classification and detection in the feature map is generally considered a one-stage object detector, such as the YOLO series [2], [16], RetinNet [15]. In general, two-stage object detectors are more accurate but relatively slow. Some work [14], [24] has shown that the accuracy of one-stage object detectors is comparable to that of two-stage.
The difference in the process of mapping from bounding box to feature maps is divided into two types of detectors: anchor-based detectors and anchor-free detectors. Anchorbased detectors require setting a priori anchors to regress the difference between the real bounding box and the object [8]. Anchor-based detector usually generates many anchors and then performs regression or filtering on these anchors, which causes increase of computation complexity as the drawback of the real-time performance of the detectors. In anchorfree detectors, CornerNet [14] detects the two vertices of the object bounding box, and CenterNet [24] detects the center point of the object bounding box. The coordinates and attributes of the object itself are used during the encoding of the label. It can outperform the anchor-based detectors in terms of detection speed.

B. VIDEO DETECTION
Unlike object detection for still images, objects in video detection tasks can change their appearance, aspect ratio, shape, and other attributes during motion. Other issues can also appear during motion, such as motion blur, obscured by other objects, etc. Therefore, it is challenging for video object detector to maintain the consistency of the object in the detection process with respect to the temporal order to ensure that the object can also be detected in multiple frames.
Video has more information in the time domain than still pictures, and methods [1], [4], [5], [7], [9], [12], [20], [22], [23], [25], [26] of using the information in the time domain to improve the detection of the current frame have been explored to resolve the difficulty in video detection. FGFA [25] and MANet [20] uses the optical flow predicted by FlowNet [6] to propagate feature information between frames. STSN [1], STMN [22] and other methods for direct aggregation of multi-frame features. RDN [5] based on Relation Network learns the relationship between multi-frame candidate boxes and uses self-attention to pass relation features from other frames to key frames. Most of the multi-frame feature aggregation methods are based on two-stage models, which are difficult to be used in real time detection. One-stage model does not have RPN structure, and can not extract object features on the feature map. The relation features of these model ROI features calculated by the fully connected layer and the relationship module cannot be used to augment the original feature map. In this paper, we design the relation module that can calculate the relation features of two ROI features on the heatmap. The most relevant work to ours is [23], which propagated a heatmap of the previous frame to enhance the detection results of upcoming frame. In our work, we propose the concept of heatmap ROI based on CenterNet to extract the potential object features on specific heatmap regions instead of the whole heatmap of a video frame, and further aggregate heatmap ROIs on multiple frames to boost the object detection results.

III. PROPOSED METHOD A. BACKGROUND: CenterNet
CenterNet is an anchor-free one-stage object detector with a relatively simple and intuitive network structure. Given an input RGB image of width W and height H , where R is the output stride and C is the number of classes. A predictionŶ x,y,c = 1 corresponds to the detected center of an object of class c at position (x, y), whileŶ x,y,c = 0 corresponds to the background. Let an object be k and its corresponding class be c k , the bounding box of the object is (x 2 ). Its center point is at p k . Hence CenterNet is a heatmap-based detector since it uses heatmap Y to predict the center points of objects. CenterNet network flow chart is shown in Fig. 1. The input image is generated into a high-resolution feature map (featmap) with a fully-convolutional encoder-decoder backbone network. The backbone network can be selected from alternatives such as ResNet with DCNv2 or hourglass network to extract feature map of an image. The extracted feature map can be refined by up-convolutional network and deep layer aggregation (DLA). Then the refined feature map is processed by 3 separate heads to outputŶ ,Ô andŜ. Peak detection can then be performed on the heatmapŶ to obtain the predicted coordinates of the object center point. For a peak at position ( is the predicted width and height of object bounding box, which is represented as

B. HEATMAP ROI EXTRACTION
CenterNet has good results as a one-stage object detector for still images. When detecting video, problems such as obscured object center point, motion blur, and rare pose may VOLUME 10, 2022 be encountered. Feature aggregation can augment the feature map information of key frame by extracting the feature map of nearby frames to improve the detection accuracy of the key frame. Feature aggregation is generally used in two-stage object detectors to extract object feature and remove background information by RPN. Nevertheless, CenterNet does not have the RPN structure. In light of the effectiveness of RPN, we propose to extract object feature information by heatmap ROI. Each selected heatmap ROI corresponds to a possible object heatmap region. Then, multiple heapmap ROIs of adjacent frames are aggregated to the key frame to enhance the heatmap features of candidate objects and hence improve the detection accuracy. Video object detection requires detection of all video frames {N t } T t=1 . We set key frame for detection to N k . Since we use dataset of VID (30 categories), The heatmap of N k is H k , the tensor of H k is (128, 128, 30) representing 30 heatmaps with width 128 and height 128 corresponding to the 30 categories. Peak detection is performed on the H k , and 10 peak points in each heatmap are selected using 7 × 7 Maxpooling. We set the peak points of each heatmap to the set {P i } I i=1 , I = 10, The five categories with the highest peaks are obtained by summing and sorting the 50 and recording the category index values. Such peak detection is performed for all N k , and the five object categories with highest number of occurrences are used for heatmap ROI aggregation. This design can improve the model running speed while ensuring the enhanced accuracy of multi-object detection. With the position information of {P i } I i=1 , the width and height (ŵ i ,ĥ i ) =Ŝx i ,ŷ i can be obtained, and the bounding box of heatmap ROI is calculated as As shown in Fig. 2, a heatmap ROI is determined by the bounding box obtained by (2). To reduce the computation complexity in the following steps, we use NMS (Non-Maximum Suppression) to select salient heatmap ROIs. Each selected heatmap ROI corresponds to a potential object heatmap region.

C. RLATION MODULE
In two-stage network models such as [4], [5], the ROI features go through the fully connected layer and the relation module is used to obtain the relation features.  feature matrix needs to be unified. We use (3) to align the k-th heatmap ROI of the j-th adjacent frame under the same category We extract the short edge δô i in the height and width δx i , δŷ i of m i t as the downsampling scale. We refer to the ROIAlign [10] to downsample the features of adjacent heatmap ROI so that h RA ∈ [0, 1] δô i ×δô i . The use of δô i as the scale of ROIAlign is to avoid the case of 0-value padding due to the size of the long edge being larger than the adjacent heatmap ROI and to effectively reduce the amount of operations.
For a single heatmap ROI we use (4) to calculate its related heatmap with adjacent heatmap ROI via dot product.
where h t i ,c,j k Rel ∈ [0, 1] δx i ×δŷ i , t i denotes the i-th heatmap ROI of key frame t, c denotes the category, and j k denotes the k-th heatmap ROI of the j-th of adjacent frames.
Each feature point of h t,c R m i t , M t is dotted with and each feature point of h RA . The geometric meaning of the dot product of vectors is to calculate the similarity of two vectors. More similar two vectors yield larger dot product value. Larger h ∈ [0, 1] δx i ×δŷ i value indicates that the two ROI features are more similar, and the weight of the augmentation of the key frame heatmap ROI is also larger. Therefore, h t i ,c,j k Rel ∈ [0, 1] δx i ×δŷ i is viewed as the relation heatmap feature of self-attention.

D. HEATMAP ROI AGGREGATION
As shown in Fig.3, adjacent heatmap ROIs are further aggregated to enhance the heatmap ROI of the key frame. Such an operation intends to boost object detection with augmented object heatmap features.
After calculating the relation heatmap features, to increase the nonlinearity of the aggregation module, we perform 1 × 1 convolution of h t i ,c,k j Rel , Batch Normalization and Relu operations, to output h t i ,c,k j R . The aggregated heatmap ROI on the key frame can be represented as where h * R m i t , M t denotes the final augmented heatmap ROI. It is expected that the augmented heatmap contains more object feature information than the original heatmap h t,c R m i t , M t , and can effectively boost the object detection results. To solve the detection problem of multiple objects and small targets, we augment multiple heatmap ROIs for multiple classes. As shown in Fig. 4, we establish longterm heatmap ROI feature augmentation during training and testing.

A. DATASET
We trained and tested our method on the ImageNet VID dataset. ImageNet VID dataset is a large dataset for video detection tasks, a subset of ImageNet 200 categories contains 30 categories, the training set consists of 3862 videos and the validation set contains 555 videos. We evaluated our method on the validation set using the mean accuracy (mAP) as an accuracy metric.

B. IMPLEMENTATION DETAILS 1) CENTERNET
In order to compare the accuracy and real-time performance with the two-stage feature aggregation algorithm, our VOLUME 10, 2022 backbone network is selected as ResNet-101. Following the same structure as the original CenterNet paper, [21] adds three layers of deconvolution to the output layer of the residual network to obtain a clearer feature map, [24] set the channels of the three-layer deconvolution to 256, 128, and 64, and add a 3 × 3 variable convolution layer in front of the deconvolution layer to enhance the generalization ability of the feature map.

2) HEATMAP ROI EXTRACTION
Our method of heatmap ROI extraction improves the method in [24] by extracting peak points and incorporating NMS to achieve better performance. The object features in the heatmap are concentrated in one area, and redundant heatmap ROI are generated when the peak points are close to each other. we extract the peaks of each category independently on the heatmap, and detect whether the value at that point is larger than the 7 × 7 matrix centered at that point, keeping the first 10 values. The peaks of 30 categories are compared and the peak points of the top five categories with the highest peaks are retained. NMS processing with IoU = 0.3 is performed on the enclosing frame of the generated heatmap ROIs to select the essential heapmap ROIs with significant object features.

3) HEATMAP ROI AGGREGATION
Small sampling values of adjacent frames during the experiment will cause feature redundancy and high similarity of relation features in a small range, which makes it difficult to aggregate to other pose or shape features of the object, and the network robustness decreases. We take τ = 50 in the training process, and sample adjacent frames within 50 frames before and after key frame. In the inference process, we reduce the computation by filtering heatmap ROI, but retaining the larger span of adjacent frames still occupy memory. The range of values of adjacent frames in the inference process is reduced by setting τ = 12. The heatmap ROIs in four adjacent frames are used to enhance the key frame heatmap. The comparison of heatmap ROI aggregation with different frames on the performance is given in the following experimental part.

4) TRAINING AND INFERENCE
Following common protocols in [5], [20], [23], [25], we train our model on the intersection of the ImageNet VID and DET datasets (30 classes). The pixel size of the input image is adjusted to 512 × 512, the resolution of the output heatmap of CenterNet increases with the size of the input image and the accuracy of detection increases. The network model was trained on RTX 3090 GPU. We used Adam as the optimizer with a batchsize of 16 to train 100 epochs. We used the PyTorch function ReduceLROnplateau to adaptively adjust the learning rate, with an initial learning rate of 1e-4. In the inference phase, we use the full validation set for our experiments.

C. MAIN RESULTS
We show the comparison of our model with state-of-the-art in Table 1. We compare with CenterNet baseline, which is 5ms slower in runtime and 5% better in accuracy. CenterNet HP propagates the heatmap over the time series and predicts the object center point by weighting the propagated heatmap with the heatmap of key frame. Compared with CenterNet HP, we build the object heatmap feature extraction on the time series and aggregate the object heatmap features of multiple frames on the key frame object, our method improves the accuracy by 2.1%.
In comparison with two-stage multi-frame aggregation video detectors, we discard the RPN filtering and regression of the two-stage network model and use the CenterNet as the center of the heatmap and the aspect information of the one-stage network model to extract the object features of multiple frames in the video stream and aggregate them to key frame. The vast majority of two-stage network models cannot achieve real-time video detection at 25 FPS, our model can achieve 36 FPS real-time detection, which is a minimum 4 times improvement in real-time compared to the two-stage video detector.
The accuracy of most of the video detectors can be improved by continuing post-processing. We use the post-processing method of Seq-NMS [9]. Seq-NMS can use the timing information to re-score the candidate frames and avoid using NMS which would incorrectly select the candidate frames. As shown in Table 2, our model achieves 80.5% mAP at 33 FPS with ResNet-101.

D. ABLATION STUDY
In order to examine the impact of the different modules in our model, we performed a number of comparative experiments to investigate the impact of changes in these modules on the final performance. Performance comparison between our model and state-of-the-art video object detection models on the ImageNet VID validation set. To quantify the performance all models use ResNet-101 as the backbone network. In the table, all accuracies are without post-processing, accuracies and runtime are taken from the paper, -indicates data not given in the source paper.

TABLE 2.
Performance comparison between our model and state-of-the-art video object detection models with post-processing methods. -indicates data not given in the source paper.

1) SELECTION OF HEATMAP ROI
CenterNet paper uses a 3 × 3 maxpooling operation to extract peaks greater than or equal to the other 8 adjacent peaks on the heatmap of each category and keep the top 100 peaks. We compare the impact on model performance caused by different downsampling schemes for selecting peaks in Table 3.
We first tried the original CenterNet paper's method of extracting peaks using the corresponding heatmap ROI generated from the peaks. Although there is a distance between the peaks, there is extensive overlap between the heatmap ROI covering each other, and only single-digit heatmap ROI remain after NMS filtering. We reduce the retained peak to 10 points with almost no loss of accuracy, and runtime is reduced by 0.4 ms. We achieve an accuracy of up to 78.8% when using 7 × 7 maxpooling for peak extraction, and continuing to expand the downsampling size does not improve model performance. We also tried GlobalMaxpooling for peak extraction, but the peaks are too concentrated, resulting in too high coverage of the extracted heatmap ROI, which is not conducive to the detection of small objects and multiple objects.
Considering that when multiple classes of objects appear in a video, aggregation of heatmap of only the highest-peaked classes does not enhance the detection of objects in other classes, we perform heatmap ROI aggregation for the top N classes with the highest peaks in the video. As shown in Table 4, different N values have an impact on the accuracy and real-time performance of the model.
We first designed the model considering only heatmap ROI aggregation for a single category, and the highest peak category was selected for aggregation during training or  inference. However, the actual situation is that multiple categories of objects exist in a video stream, and the object recognition results of other categories are not enhanced. When we set the aggregation categories N to 3, the accuracy is improved by 0.7% and the runtime is increased by 0.3 ms. When we increase the value of N , maximum accuracy is achieved at N = 5.

2) PARAMETER α
We investigated the effect of the weight parameter α in (5), after as shown in Table 5. α as a hyperparameter determines VOLUME 10, 2022 FIGURE 5. Examples of qualitative results in ablation experiments. The four images are taken from four discrete frames of a video. CenterNet as a video detector, (a) three frames fail to detect small objects and the detection accuracy of large objects is low. After a single-category heatmap ROI aggregation, (b) only one frame of the small object is lost and the detection accuracy of the large object is improved by 0.14 on average. After five categories of heatmap ROI aggregation, (c) although the detection accuracy of large objects is basically not improved, the detection accuracy of small objects is improved by 0.21 on average.  the proportion of relevant features that enhance the key frame heatmap ROI. α has a linear effect on the heatmap features, and the accuracy of the network model reaches its maximum when α = 0.1 continuing to increase the value of a leads to a decrease in accuracy. The possible reasons are that the feature values within heatmap ROI are too high resulting in the prediction valueŶ xyc being too close to 1 and the loss increases.

3) NUMBER OF FRAMES AGGREGATED
As shown in Table 6, we conducted ablation experiments with different number of the aggregated frames. It can be seen that the accuracy grows with the number of frames aggregated, and best trade-off performance is achieved when the number of frames aggregated is four. When aggregating five frames, the real-time performance deteriorates and the accuracy hardly grows.

V. CONCLUSION
In this paper, we propose a multi-frame heatmap ROI aggregation module for CenterNet to improve the accuracy of real-time video object detection. In our proposed method, peak detection and regressed box size (width and height) are used to obtain heatmap ROIs. Each selected heatmap ROI corresponds to a potential object heatmap region. Furthermore, multiple heapmap ROIs of adjacent frames are aggregated to the key frame to enhance the heatmap features of candidate objects. Experiments conducted on ImageNet VID dataset validate the effectiveness of our proposed method. Compared to CenterNet Heatmap Propagation [23], our proposal achieves 2.1% improved mAP with almost no sacrifice of frame processing time.