Improvement of Non-Maximum Suppression in RGB-D Object Detection

Currently, the non-maximum suppression (NMS) algorithm is a commonly used method in the post-processing stage of object detection. However, the NMS algorithm cannot effectively eliminate missing and false object detection results because of the simple constraint condition. To solve the problem of the poor detection effect in highly overlapping dense object scenes in the traditional NMS algorithm, we design an RGB-D object detection network model based on the YOLO v3 framework, and using level-by-level metaphase fusion on the RGB and depth information, we propose an improved NMS algorithm which fuses depth characteristics. According to the depth of the object in the detection boxes, it is determined whether another object is the same object in highly overlapping detection boxes, and the average depth of the internal pixels in the detection boxes is calculated as a penalty term, then the penalty term is added to the detection box score to obtain a new constraint condition for non-maximum suppression. The experimental results on the NYU Depth V2 dataset show that the mean average precision (mAP) of the Depth Fusion NMS algorithm proposed in this paper is 0.8%, 0.5% and 0.3% higher than those of the Greedy-NMS, Soft NMS-L and Soft NMS-G methods, respectively. After comparison and analysis, our method can not only detect more overlapping objects but also achieve a better object localization accuracy.


I. INTRODUCTION
Object detection is an important research direction in the field of computer vision. The process can be understood as visual algorithm giving the computer a human-like visual recognition ability to identify object categories and obtain the object location information in scenes through an image obtained by a sensor. In recent years, with the rapid development of deep learning and neural network technology, the research on object detection has resulted in breakthroughs in the areas of monitoring security, automatic driving, human-computer interaction and so on [1]. Object detection algorithms based on convolutional neural networks can be divided into three steps [2]: feature learning and object extraction, object classification and location regression, and non-maximum suppression algorithms to select the optimal detection boxes. Non-maximum suppression (NMS) in the last step was first The associate editor coordinating the review of this manuscript and approving it for publication was Jingchang Huang . proposed in the edge detection algorithm, and then further applied to the fields of object detection, face recognition, etc. [3], [4]. NMS is an important method for the post-processing step of a detection model. Current studies mainly focus on feature learning, feature extraction and classification, but there has been little improvement in non-maximum suppression algorithms [5].
With the popularity of consumer-level depth sensors (such as Kinect), we can easily obtain the depth information of objects in a scene, which greatly promotes the application of RGB-D images in related fields such as object detection. The gray scale value of each pixel in a depth image represents the distance from the corresponding object in an RGB image to the camera. References [6]- [9] and other papers have shown that adding one-dimensional depth information to an RGB network can effectively avoid the impact of illumination changes and other factors for object detection results, which can improve the accuracy and recall rate of detection model. However, RGB-D object detection methods based on convolutional neural networks (CNNs) mostly research the fusion of RGB and depth features and the network structure. The traditional NMS algorithm is still used in the network post-processing stage to select optimal detection boxes by comparing the prediction score and the size relationship between the IoU value of overlapping detection boxes and a given threshold T . However, the selection of the threshold T is usually determined through experience, which is likely to cause instability in the system detection accuracy. In view of the above problems, this paper improves the NMS algorithm for RGB-D object detection, adjusts the detection box score by using the depth characteristics of different objects, and obtains the optimal detection boxes for each object, thereby effectively reducing the false and missing detection rate of the detection model. In this paper, we applied the improved NMS algorithm in the current, popular detection framework YOLO v3 [10], and the network model was trained and tested in the challenging RGB-D dataset NYU Depth V2 [11], then we obtained a high mean average precision (mAP).

II. RELATED WORK
The commonly used non-maximum suppression algorithm is a greedy strategy. Only single overlapping area information is used for suppression. To improve the algorithm accuracy in the post-processing stage of object detection, some researchers have made corresponding improvements to the NMS algorithm. In 2015, reference [5] combined the scale ratio, the detection score ratio and the peripheral window information in NMS algorithm based on the ACF (aggregate channel features), which significantly improved the accuracy of the algorithm but simultaneously increased the time consumption, and the algorithm is only improved for pedestrian detection, lacking versatility.
In 2016, aiming to solve the problem that the traditional NMS constraint condition is too simple to eliminate the overlapping detection efficiently, Zhang et al. [12] proposed an improved, simplified non-maximum suppression algorithm, which added ''completed covered detection suppression'' and ''PASCAL VOC overlap criterion'' constraints, which calculate the coverage ratio of the intersection area to the selected detection bounding box and the overlap ratio of the combined area, respectively. The experimental results show that the improved method can reduce the error and improve the detection performance, but it still has involves threshold selection and misses small objects.
An improved NMS method was proposed in reference [13] in 2017. A part of the NMS loss is added to the loss function of the network according to the NMS location error. The NMS loss is similar to the classification loss, and the NMS error can be continuously reduced by back propagation during network training. Although the detection accuracy can be improved in this way, the introduction of the NMS loss function leads to an increase in the training time of the network, and the network parameters are redundant, which is not conducive to lightening the weight of detection model.
In 2018, Qiu et al. [14] determined that the performance of the NMS algorithm is substantially affected by highly overlapping objects, and its localization accuracy only depends on the highest score detection. Therefore, they proposed an accurate NMS method, which gradually merges highly overlapping detection boxes in an iterative manner, taking advantage of Regression-NMS [15] and Soft NMS, while eliminating their disadvantages. The experimental results show that this method can not only detect more overlapping objects but can also achieve a better object localization accuracy. In the same year, Zhao et al. proposed an improved NMS algorithm in reference [2]. First, according to the IoU value of the detection box and the preselected detection box, the proportional penalty factor corresponding to the detection box is calculated; then detection box confidence score is multiplied by the proportional penalty factor, and the score of the detection box is reduced by the proportional penalty factor one by one; finally, after several iterations, the detection box whose score is lower than the threshold is removed. Experiments showed that the improved NMS algorithm can effectively preserve the object detection box and remove the false positive detection box, thus reducing the missing and false detection rate of the NMS algorithm. Both of these algorithms improve the detection accuracy in an iterative manner, but the iterative process not only increases the number of calculations and is time consuming but also it cannot solve the problem of missed detection of intensive objects with high overlapping.
Although the traditional NMS algorithm is used in the post-processing stages of popular object detection algorithms such as SSD [16], Faster R-CNN [17] and YOLO v3 [10] and achieves a good performance, it is still an obviously flawed greedy algorithm. This paper aims to improve the NMS algorithm in a double-channel RGB-D convolutional neural network by using object depth characteristics, effectively reducing the localization error of the detection box and decreasing the missing detection rate of highly overlapping intensive objects, thereby, improving the accuracy of the detection model.

III. ALGORITHM DESIGN
In this section, we introduce the principle of the traditional NMS algorithm and explain our improved NMS algorithm process in detail.

A. TRADITIONAL NMS ALGORITHM
Non-maximum suppression can be understood as a local maximum search, which has very important applications in the field of computer vision [18]. In object detection, the NMS algorithm is often used to extract the prediction box with the highest score. The process involves extracting the feature from the sliding window, and after the classifier recognizes the classification, each detection box receives a score, but the sliding window will yield many detection boxes containing or mostly intersecting other windows. Then, NMS is needed to extract the prediction boxes with the highest scores in the neighborhood (the probability that an object is VOLUME 7, 2019 the largest) and suppress the prediction boxes that have other lower scores. The process of the non-maximum suppression algorithm is shown in Fig. 1. The principle of NMS is not complicated, and mainly involves calculating the IoU of each overlapping detection box and comparing it with the threshold T to determine the final detection box. IoU refers to the ratio of the intersection and the union for two detection boxes areas (intersectionover-union), whose formula is described as follows: where BB i and BB j are two different detection boxes and area indicates the detection box area. For the list of all detection boxes and their corresponding confidence values S, first select the detection box M with the largest score, remove it from the collection B and add it to the final detection result D, and then calculate the IoU value of M and remaining detection boxes in B, which removes the box that is larger than a certain threshold T to form set B. Repeat this process until it is empty. The specific steps are described as follows: 1). Sort the scores of all the detection boxes, then select the highest score and its corresponding box; 2). Scan the remaining detection boxes, if the overlapping area (IoU) with the current highest score is larger than threshold T, then delete the corresponding box; 3). Continue to select the detection box with the highest score from the unprocessed detection boxes and repeat the above process.

B. DEPTH FUSION NMS ALGORITHM
The non-maximum suppression algorithm is used in the postprocessing stage of object detection and plays an important role in ensuring the accuracy of detection box localization. However, the traditional NMS has two obvious defects. First, the selection of the optimal detection box only depends on the prediction score, which lacks robustness. Second, two objects that are close together will not be detected at the same time, as shown in Fig. 2. Aiming to solve the above problems, we propose an NMS post-processing method based on depth fusion and the depth characteristics of RGB-D images to make some corresponding improvements. The goal is to improve the missing detection rate and localization accuracy by introducing deep fusion terms. When using RGB-D images for object detection, we take YOLO v3 and Darknet-53 as the basic framework and network structure of the convolutional neural network, respectively. Inspired by the RGB-D network with level-by-level feature fusion proposed in [19], we design a double-channel network structure to extract RGB and depth features in the early stage, which integrates depth features into the branches of each scale feature in the middle of the RGB network to carry out the next forecast classification. Finally, in the post-processing stage, we propose an improved NMS method based on depth fusion. The overall network model structure is shown in Fig. 3.
For the feature fusion of RGB-D images, the most convenient method is to use the depth image content as the fourth channel of the RGB image, combine the two types of images or feature images into a four-channel image format, and then input them together into the convolutional neural network for feature extraction and object prediction. Another method is to extract the RGB and depth features simultaneously in two networks, and finally merge the features of the two modes in the fully connected layer. These two methods are common network structures for RGB-D object detection, but similar splicing can only learn a simple linear combination of RGB and depth information but cannot effectively explore the deeper correlation between the two modes. Therefore, the improvement in the detection effect after fusion is not obvious. In this paper, we propose an improved, twochannel network structure with level-by-level feature fusion. The correlation feature between RGB and depth mode is learned from the semantic feature expression of the middle layer. We use the RGB channel as the main network, and the depth network information is merged with the three scale feature layers of the main network, and the merged features are sent to the network branches of different scales for RGB-D object detection.
The fusion strategy of RGB and depth modes learns the correlation feature between the two modes by sharing weights, but the semantic information contained in the input feature maps X RGB and Y Depth is not completely equivalent; In order to more accurately fuse the two features, we use the ''concatenate'' feature fusion mode used in the DenseNet network [20] to effectively combine the two kinds of information. The ''concatenate'' operation is to extract the features of multiple convolution kernels or to fuse the information of the output layer. The fusion here refers to merge the number of feature channels, which increases the characteristics of the description image itself, and it is obviously beneficial for the classification of the final image. In the process of merging channels, we use the accuracy of the individual detection of RGB and Depth networks to determine the weight of two modal information fusion.
Assuming that the inputs of the RGB and depth channels are x 1 , x 1 , · · · x n and y 1 , y 1 , · · · y n , respectively, the output of the combined channel is shown in (2), where α and β are the fusion weights of the RGB and depth features respectively, W r and W d are respectively weights by training two corresponding networks. ACC rgb and ACC depth are the accuracy of the RGB and depth images detection results, respectively, Z ri and Z di are the i neuron output of RGB and Depth networks.
In the Depth Fusion NMS module of Fig. 3, we first judge the size relationship between the IoU value U of two overlapping detection boxes and the threshold T ; if U < T , the detection box is retained; if U ≥ T , the depth values of the center pixels of the two detection boxes in the depth image are compared. If there is a significant difference, then there are two objects in the two detection boxes. In this case, the two detection boxes should be preserved. If there is no significant difference, the objects in the two boxes belong to the same object. Then, we compare the scores S of the fused depth information, and the higher score is taken as the optimal detection box. The formula for S is as follows: where Score i is the score of the ith detection box, D i is the average gray value of the pixels in the ith detection box and represents the average depth, and M and N are the width and height, respectively, of the detection box. We can consider the depth value of the center pixel of a detection box as the approximate depth estimation of an object in the box. If the center pixel depth values of two detection boxes are similar VOLUME 7, 2019 (less than the empirical value), then the objects detected by the two boxes are the same object. Since the average depth of pixel is smaller, the proportion of foreground objects in the detection box is larger, and the localization is more accurate. Therefore, the optimal solution is determined by combining the detection box score and the pixel average depth (as shown in (5)). The pseudocode for its process is described in Table 1.

IV. EXPERIMENT AND ANALYSIS A. MODEL TRAINING PROCESS
We used the NYU Depth V2 RGB-D dataset to train the network and test performance of the improved algorithm.
It is a challenging indoor scene classification database [11] Fig. 4. Experiments such as model training, feature fusion, objects detection and recognition were performed in Python 3.5 and run on GPU-accelerated drivers equipped with CUDA 9.0. The specific configuration of the experimental environment is shown in Table 2. This chapter selects 1250 RGB-D images with different indoor scenes with completed depth from the NYU Depth V2 dataset, and 14 categories (tv, chair, desk, whiteboard, door, trash can, people, blackboard, cabinet, lamp, sofa, bed, phone and toilet) are used for training and testing, including 1000 training sets and 250 test sets. The experiment uses batch normalization, 64 pictures for training per iteration, and 30200 iterations. In the training stage, we use the stochastic gradient descent with a momentum term of 0.9. The initial learning rate of the weight is 0.001, and the decay coefficient is set to 0.0005. To better observe the training situation and evaluate the model performance, we introduce the loss function (loss), intersection-over-union (IoU) and recall rate into the training process, which are visualized in Fig. 5. Since the number of iterations is large during the training process, we can observe the model training clearly after downsampling. In Fig. 5, (a) and (b) show the average IoU and recall curves with sampling rates of 0.20% and 0.25%, respectively, during the whole training process. It can be seen that both curves are spirally rising during training, and the IoU curve values are finally stable at approximately 0.85, the recall curve value eventually stabilizes at 0.94. Figure (c) shows the loss curve of the top 500 batches of iterations. It can be seen that the loss value drops rapidly during the first 100 training batches, then the change is extremely slow, and it finally stabilizes at 0.28.

B. QUALITATIVE ANALYSIS OF THE EXPERIMENTAL RESULTS
According to the Depth Fusion NMS algorithm proposed in this paper, 250 test images from the NYU Depth V2 dataset are detected in the trained fusion network and compared with the detection results using the traditional NMS algorithm, as shown in Fig. 6. The (a) rows show the detection result of the traditional NMS post-processing, and the (b) rows show the improved post-processing results based on Depth Fusion NMS.
The experiment sets the IoU threshold T to 0.6 and the depth error empirical value ε to 3. It can be seen from Fig. 6 that when there are two objects in the scene with high overlap, it is difficult to draw a box around the two objects simultaneously using the traditional NMS algorithm, but the Depth Fusion NMS algorithm can distinguish two adjacent objects with different depths. The result means that the improved NMS algorithm proposed in this paper can effectively increase the recall rate of the detection model and improve the localization accuracy of the system.

C. QUANTITATIVE COMPARISON OF THE NMS ALGORITHMS
To further verify the performance of the Depth Fusion NMS algorithm, we compare it with three algorithms on the NYU Depth V2 dataset: Greedy-NMS, Soft NMS-L [21] and Soft NMS-G [21]. In addition, we compare the performance of the four algorithms in the RGB, depth and RGB-D networks. The IoU threshold T is set to 0.6, and the parameter σ in the Soft NMS-G algorithm is set to 0.3. We calculate the average precision (AP/%) and the mean average precision (mAP/%) of the fourteen kinds categories, we also compared the average time of different algorithms with different networks, and the results are shown in Table 3.
As seen from Table 3, the Depth Fusion NMS algorithm achieves the highest AP in most categories of detection results, and the mAP is 0.8%, 0.5%, and 0.3% higher than those of the Greedy-NMS, Soft NMS-L, and Soft NMS-G algorithms, respectively. In addition, the RGB-D network is significantly more accurate for object detection than the individual RGB and depth networks. And the average detection time for one image with Depth Fusion NMS algorithm is 0.436s. Because the improved NMS algorithm mainly aims to increase the recall rate of objects with high overlap and has little effect on distant objects, so the overall performance improvement of the detection model does not seem obvious. To see the recall rate of object detection by the Depth Fusion NMS algorithm more intuitively, we selected several sets of scenes with dense objects to compare the detection results of the four NMS algorithms. As shown in Fig. 7, (a)  our method has an obvious detection effect on dense objects with high overlap (such as chairs and desks), which not only reduces the missing detection rate but also achieves more accurate object localization by combining with the average depth in the detection box.
In addition, we also tested the effect of different ways of fusing RGB-D information on the results. The Fig. 8 shows two different fusion modes. Fig. 8 (a) shows prophase fusion, in which the RGB and depth images are merged into a fourchannel image in the data input stage for feature extraction; Fig. 8 (b) shows later fusion, in which the two modal features are respectively extracted from two convolutional neural networks and then fused in the final fully connected layer. The three RGB-D fusion models were tested using the Depth Fusion NMS algorithm proposed in this paper. Table 4 shows that the detection results of the mAP (%) for all the class, among them, ''metaphase fusion'' is the medium-term, levelby-level fusion strategy proposed in this paper. We found that the metaphase RGB-D fusion strategy and the Depth Fusion NMS algorithm can provide better detection performance than the other schemes.
To quantitatively evaluate the detection performance of all the methods under different IoU thresholds, we set the threshold variation range to 0.3 to 0.9. We obtained the mAP values of the different methods at each IoU by changing the threshold size and drew a line graph, as shown in Fig. 9. Overall, the IoU threshold is larger, the mAP of detection is smaller, mainly because more overlapping boxes are not filtered out. When the IoU threshold is low, the performance difference in the four NMS algorithms is small, but their difference becomes obvious as the threshold gradually increases by more than 0.6, and the falling gradient becomes larger  when the IoU threshold exceeds 0.7. The results show that the Depth Fusion NMS algorithm proposed in this paper has a better object detection performance under a larger IoU threshold.

V. CONCLUSION
The post-processing stage is an indispensable step in the current popular object detection method. As a classic postprocessing method, NMS has the problems of insufficiently eliminating missed and false detections due to the single constraint condition and improper IoU threshold selection. In this paper, based on the advantages of depth images in RGB-D object detection, we designed an improved NMS algorithm that depends on depth fusion, which increases the discrimination condition of objects based on the depth information. The experimental results based on the NYU Depth V2 dataset show that compared with Greedy-NMS, Soft NMS-L and Soft NMS-G, the proposed algorithm a significantly improves the detection of dense objects with high overlap at higher IoU thresholds. It can effectively reduce the object missing and false detection rate, thereby improving the accuracy of the RGB-D object detection model.
Like the traditional non-maximum suppression algorithm, the Depth Fusion NMS algorithm also faces the problem of IoU threshold selection, and it is difficult to avoid the missed detection of highly overlapping objects with similar depths. Therefore, we will continue to research how to simplify the IoU threshold-setting process of the NMS algorithm and the missing detection of near-depth objects.