ReFPN-FCOS: One-stage Object Detection for Feature Learning and Accurate Localization

One-stage object detectors are simple and efficient; however, they cannot extract sufficient object features due to simplistic structures. At the same time, the classification score cannot reflect the actual positioning of the candidate box. Therefore, it is not accurate to use classification score only as the candidate box position score in non-maximum suppression (NMS) stage. These two shortcomings degrade the detection accuracy. In this paper, a novel feature pyramid architecture named refined feature pyramid network (ReFPN) is introduced to obtain better object features. ReFPN designs a refined module which is parallel with feature pyramid network (FPN) to extract the semantic features of objects, and then the extraction of features are used to optimize the features of FPN by summation. In addition, we design the refined center-ness (RCenter-ness) branch that predicts the position score of each point on the feature map to improve the localization accuracy. The predicted position score is multiplied by the classification score to obtain the final position score that has a stronger correlation with localization accuracy. The final position score is inputted to the subsequent NMS, which improves localization accuracy. The proposed method in this paper is named ReFPN-FCOS. The sufficient experiments on COCO2017 datasets demonstrate the effectiveness of ReFPN-FCOS on improving classification accuracy and localization accuracy. The average precisions of this method achieve 1.1% and 1.3 % higher than those of FCOS, when using ResNet50 and ResNet101 as backbone respectively. Code download link: https://github.com/xjl-le/mmdete


I. INTRODUCTION
In recent years, artificial intelligence has been hot all the time. Under the support of deep learning theory, object detection is developing rapidly. A large number of researchers have put forward many excellent detection models by analyzing the problems in object detection. These models can be classified into two categories: multi-stage object detectors [1,2,3] and one-stage object detectors [4,5,6,7]. Multi-stage object detectors use the networks to detect the objects for many times. In this process, a large number of invalid prediction boxes are eliminated, and the prediction boxes of the positive samples are optimized for many times. Multi-stage object detectors can obtain more detailed classification and positioning of object, so most of the current multi-stage object detectors are superior to the current one-stage object detectors in terms of detection accuracy. However, multistage object detectors are typically too large, such as Faster R-CNN [8], Cascade R-CNN [9] and Hybrid task cascade (HTC) [10]. They all require high performance device environment, and the detection time is very long, so multistage object detectors cannot meet the current requirements of real-time object detection. One-stage object detectors can be implemented in two ways: the anchor-based methods [11,12] and the anchor-free methods [13,14,15,16]. The current one-stage object detectors include SSD [17], YOLOv1, v2, v3 [18,19,20] and FCOS [21]. They directly use full convolutional network (FCN) to classify and locate objects. The main advantages of one-stage object detectors 6 VOLUME XX, 2017 are light weight and fast speed, which avoid the problem of long detection time of multi-stage object detectors.
Although the detection accuracies of one-stage object detectors are lower than those of multi-stage object detectors, one-stage object detectors have more extensive applications in real-time object detection. Therefore, it is significant to construct a high performance one-stage object detector. In order to explore the gap between one-stage object detectors and multi-stage object detectors, we tested and compared the current state-of-the-art one-stage detector FCOS and multistage detector Cascade R-CNN respectively. As shown in Figure 1, Cascade R-CNN detected more objects especially small objects, and its detection result for occluded objects is also better than that of FCOS. Thus one-stage object detectors have great improvement space in detecting small objects and occluded objects.
In order to improve the accuracy of one-stage object detectors, two approaches are employed. 1) To obtain more representative features of the object; 2) To obtain more accurate position score of the predicted box. Most of the object detectors use feature pyramid network (FPN) [22] structure to optimize the features of different levels. FPN distributed the objects of different sizes to different feature layers for detection. With the deepening of the network, the semantic information of the feature map is richer, while the shallow features have richer location information. In order to enable the lower features to obtain more semantic information, FPN proposed top-down structure, which transmitted the feature map from the high level to the low level to enhance the semantic information of the feature map of the shallow level. FPN alleviated the difficulty of multiscale detection and obtained better object features.
However, the structure of FPN is not optimal, and it has two deficiencies: 1) FPN used 1x1 convolution to reduce the channel number in the feature map. Figure 2 shows the feature channel number on each layer of the backbone. The channel numbers of the feature maps C3, C4 and C5 are 512, 1024 and 2048, respectively. However, all the channel numbers uniformly become 256 after 1x1 convolution, so the features on the layers lose a lot of useful information. 2) The top-down structure of FPN paid more attention to the features of adjacent layers, so there was an obstacle to the propagation of features from high level to low level. Thus FPN-based detectors cannot get the optimal results of the small objects by using low level feature.
A practical and effective feature pyramid structure, refined feature pyramid Network (ReFPN), is designed in this paper to alleviate the deficiencies of FPN. ReFPN is mainly composed of two modules.
Module 1: First, the features of different layers of the backbone are convolved with 3x3 convolution kernels to obtain the feature maps of different sizes. Then linear interpolation scales up the high-level features to the size of the lowest-level feature. Finally, the features are added by the element-wise way to obtain refined feature. Compared with 1x1 convolution of FPN, 3x3 convolution is adopted in this paper, which effectively increases the receptive field of the network, thus this module obtains additional semantic features of different feature layers, and these features are different from those of FPN.
Module 2: To integrate multi-level features and obtain rich semantic feature, the refined feature is resized to the sizes of multi-level features with max-pooling. Then the obtained features are used to strengthen the original features of FPN. Therefore, the feature maps of the objects are transmitted among multiple feature layers, avoiding the obstacle of feature transmission. Through the above processing, ReFPN can effectively avoid the deficiencies of FPN and obtain better object feature than FPN.
Object detection is mainly composed of two tasks: classification task and localization task. The classification task is to identify the objects, it tends to obtain semantic 6 VOLUME XX, 2017 information from the network. The richer the semantic information is, the more accurate the object recognition is. The localization task determines the positions of the objects and uses the appropriate boxes to locate the objects. The positioning information is more inclined to the context information of the network. The richer the context information is, the more accurate the positioning of the object is. From the above analysis, it can be seen that the information for classification and localization is different, so the correlation between classification and localization is low. However, during inference stage, most of the current object detectors still use classification scores as the position scores in the non-maximum suppression (NMS) [23] stage, which degrades the accuracy of the model. To solve this problem, this paper designs the refined center-ness (RCenter-ness) branch, which takes the object context information as input to predict the position score of the prediction box.
Our main contributions are summarized as follows: 1) We review the FPN structure as well as the correlation between classification and location confidence in detail, and point out their shortcomings. Then we propose a novel one-stage detector, ReFPN-FCOS, as a practical and efficient framework that combines two novel modules: ReFPN and RCenter-ness branch. 2) ReFPN is mainly used to fuse highlevel and low-level object features as well as enhance the semantic information of low-level features. 3) In RCenterness branch, the predicted object position scoring branch is parallel to the regression branch, and the two branches share the parameters. In order to obtain the better object position information, the regression result is taken as the input of the object position scoring. 4) A super parameter is used to control and adjust the contribution of the positioning score.
ReFPN-FCOS is evaluated on MS COCO with various backbones and obtains significant improvements over stateof-the-art one-stage detectors. The average accuracy of ReFPN-FCOS on ResNet50 is 37.7%, which is higher than 36.6% of the original FCOS. Meanwhile, the average accuracy on ResNet101 is 41.8%, which is higher than 40.5% of the original FCOS.

A. OPTIMAL FPN FEATURE
At present, both the one-stage object detectors and the twostage object detectors have outstanding achievements. In order to obtain better object features, object detectors use some modules to aggregate the features of different layers to generate distinctive feature representation and enhance the detection performance of the detectors. FPN spread highlevel semantic information to low-level feature map through a top-down structure, enriching the feature information of each layer. It improved the performance of the detector. Hu et al. [24] used a ResNet50 [25] with FPN as the backbone for feature extraction, which achieved a learning-based image detection method for printed circuit board defect detection.
FPN is not the optimal structure, so many object detectors improved on FPN by aggregating multi-scale features. PANet [26] pointed out that low-level features were of great help to instance segmentation, but low-level features need to go through a large number of convolutional layers to obtain high-level features, which hindered the transmission of lowlevel location information. In order to reduce the path of features in information transmission. PANet added a bottomup structure on the basis of FPN. First, features were transmitted from the high level to the low level, and then the obtained features were transmitted from the low level to the high level through the bottom-up structure. Object features were extracted by fusing multi-layer convolution feature maps [27]. Libra R-CNN [28] aggregated the features of all layers of the feature pyramid, then used the aggregated features to refine the features of FPN. NAS-FPN [29] firstly used neural architecture to search the optimal FPN structure. AugFPN [30] designed residual feature augmentation to enhance high-level features. Mask Refined R-CNN [31] improved the precision of Mask R-CNN [32] for instance segmentation by combining multi-scale information. Since fusion can improve the performance of detector, the analysis on correlation of fusion features showed that fusion of lowcorrelation feature can avoid redundancy and achieve more obvious improvement [33]. Guo et al. [34] designed a balanced attention FPN by fusing the multi-level feature maps of the FPN to balance the original features. Li et al. [35] designed a dense path aggregation feature pyramid network to obtain rich semantic features and context features.
In addition, some advanced and latest works were developed to enhance the robustness of object features, such as rotation-invariant property. Wu et al. [36] developed ORSIm detector to extract the rotation-invariant channel features in the frequency domain and the original spatial channel features. Wu et al. [37] proposed a multi-source active fine-tuning vehicle detection (Ms-AFt) framework to obtain vehicle feature. Xu et al. [38] used rotation-invariant features and channel features to construct more robust feature representation.
All of these works have different degrees of optimization for the feature maps. Unlike them, this paper designs a novel structure ReFPN, it takes into account both information loss caused by the reduction of feature channels and information transmission hindrance from high level to low level. Thus ReFPN can obtain better object feature to improve the accuracy.

B. FUSION OF CLASSIFICATION AND LOCATION
In recent years, many scholars have pointed out that there is a big difference between classification and localization, so it is very important to design a reasonable detection head or optimization strategy for object detector. Double-head RCNN [39] found that using FC-head and Conv-head to predict classification and regression could achieve better results. IoU-Net [40] designed Intersection-over-Union (IoU)  prediction head that predicted the IoU of each candidate box. The predicted IoU guided the NMS to better remove repeated prediction boxes. IoU-Aware single-stage object detection [41] designed a branch of IoU prediction that shared a detection head with regression branch to predict IoU between the proposal box and the ground-truth box. During inference, the predicted IoU was multiplied by the classification confidence to obtain the final position score that was used in the NMS. The fusion also required reasonable optimization strategy [42,43]. FCOS proposed the center-ness that was used for the position score of the prediction box. During inference, the predicted center-ness computed the weights of the classification scores to obtain the final position scores. In this paper, the RCenter-ness branch is proposed, which is similar to center-ness of FCOS. However, RCenter-ness uses the context information and the result of regression as inputs to obtain the position score. This modification brings obvious improvement to the accuracy.

C. ANCHOR-FREE DETECTION
In recent years, anchor-free detection methods have attracted more and more attention, they avoid the super parameters selection of anchor-based methods and reduce the number of prediction boxes. FoveaBox [44] took the points in the object center area as positive samples, and designed a regression strategy that mapped the location of feature points to the corresponding prediction box. PolarMask [45] predicted 36 distances from the point on the feature map to the object contour, and it proposed Polar center-ness for inhibiting low quality positive samples. OYOLOv2_FTD [46] considered the entire image area for strong object detection and used loss function to enhance effective learning. Sparse-YOLO [47] proposed a hardware/software (HW/SW) co-design methodology targeting CPU+FPGA-based heterogeneous platforms to achieve real-time object detection. Some Anchor-free methods are based on key point detection. Hei et al. [48] designed the corner pooling module which predicted the two key points of object: the upper left corner and the lower right corner, and then used predicted embedding vector to combine of the two points at the upper left corner and the lower right corner. Zhou et al. [49] modeled an object as a single point, the center point of the ground-truth box, and used the network to learn the length and width of the object. In order to get a better heat map of key point output, Gaussian kernel function was applied to VOLUME XX, 2017 11  output the key point of the object to the heat map. These models have satisfactory detection performance. Therefore, ReFPN-FCOS is implemented in anchor-free way. It takes the ground-truth boxes of the objects as prior information to generate prediction boxes, so the number of prediction boxes is less than those of the anchor-based methods.

III. METHODOLOGY
The ReFPN-FCOS framework is shown in Figure 2, ReFPN designs a refined module based on FPN to fuses the features of multi-layer feature map, this module is in parallel with FPN structure and optimizes the original FPN feature maps. In addition, we modify center-ness branch, which is parallel with the regression branch and shares the same input. The branch is used to predict the position scores of points in the ground-truth box. ReFPN-FCOS alleviates the problem of features loss and improves localization accuracy. The details of each module are specified as follows.

A. REFPN
Inspired by AugFPN, as the network layer deepens, the channel number in the feature map is twice that in the previous layer. In order to reduce the number of network parameters, FPN used 1x1 convolution to reduce the channel numbers in all feature maps of the backbone network to 256, which hindered the network from extracting sufficient object features. As the channel number was greatly reduced, a large amount of semantic information was lost in the deep feature map. While the channel number in the shallow feature was already small, after 1x1 convolution, a large amount of information was also lost. At the same time, FPN used the top-down structure to transfer the deep features to the shallow features, so as to increase the semantic information of the lower features. However, the top-down structure of FPN paid more attention to the information of adjacent feature maps, which hindered the top-down feature transmission and decreased the accuracy of small objects.
In order to solve these two problems, we propose ReFPN that has two main functions: 1) solve the problem of feature loss caused by the decrease of channel number in deep feature maps; 2) increase the semantic information of shallow feature maps and improve the detection accuracy of small objects. As shown in Figure 2, ReFPN is an additional branch on the basis of FPN, which makes up for the deficiency of FPN and facilitates the transfer of features at all levels.
At present, most object detectors adopt FPN to extract object features, and the objects of different sizes are dispersed to different feature layers of FPN for detection. The deep feature maps are used to detect large objects, while the shallow feature maps are used to detect small objects. As shown in Figure 2, the feature map on the backbone network is Cl, where l represents the layer index of the feature map. For simple representation, the lowest feature map and highest feature map are respectively denoted as lmin and lmax. The backbone network only retains {C3, C4, C5} as effective feature maps to reduce the number of network parameters and improve the detection speed. As shown in Figure 3, in order to make up for the information loss caused by the reduction of channel number, the refined module performs 3x3 convolution on C3, C4 and C5, respectively, and obtains the corresponding feature maps, M3, M4 and M5, with the channel number of 256. Then linear interpolation increases the sizes of M4 and M5 to the same size as M3. Finally, the three feature maps are summed to get the refined feature: obtained feature map is fused with P4 by the element-wise way to obtain R4. The R5 is also obtained in this way. Since P3 has the same size and channels as refined feature, so R3 integrates refined feature and P3 without any processing to reduce parameters. These operations address two shortcomings of FPN. R6 is obtained by convolving feature map R5 with 3x3 convolution kernels and the stride is equal to 2, while R7 is obtained by convolving feature map R6 with 3x3 convolution kernels and the stride is equal to 2, so R6 and R7 can obtain more useful information. 4D vector (l, t, r, b).

B. RCENTER-NESS
We first review FCOS for object detection. FCOS calculated the distances between the points on the feature map of each layer and the four sides of the ground-truth box of each object, and then distributed the objects to different feature maps for detection according to the pre-set distance ranges.
In order to reduce the number of anchor boxes, FCOS only took every point within the scope of ground-truth box as positive sample, and only one bounding box was predicted from each point. In the experiment, the positive samples far from the object center produced a lot of low-quality predicted bounding boxes that seriously hampered the detection accuracy. Thus FCOS designed the center-ness branch to predict the position scores of prediction boxes. However, the center-ness branch was parallel to the classification branch and shared a detection head. This prevented the network from predicting the position scores of the points in the ground-true box. Therefore the predicted center-ness was sub-optimized. We modify the center-ness branch and update it to RCenter-ness. As shown in Figure 4, RCenter-ness has two types, I and II. Both types have satisfactory accuracy, but RCenter-ness_II is better than RCenter-ness_I, which will be shown in the ablation experiments of Section IV. D. Thus RCenter-ness_II is selected as the network structure of RCenterness in RFPN-FCOS.
RCenter-ness branch predicts the position score of the positive sample point. The closer the positive sample is to the object center, the higher the position score is, and vice versa.
As shown in Figure 2, the single-layer convolution branch RCenter-ness is parallel to the regression branch and shares the positioning feature map with the regression branch. To keep the efficiency, the RCenter-ness prediction head only uses a single 3x3 convolution to obtain the position score feature map, the size of feature map is [N, 1, W, H]. To better predict the position score of each point within the groundtruth box, we also take regression distances as the input of RCenter-ness. As shown in Figure 5, regression distances are the distances between the points and the four sides of the ground-truth of each object, then these regression distances convolved with 1x1 to obtain the position score feature map, the size of which is RCenterness is the position score predicted by the network. In the training stage, we take RCenterness as the weight of regression loss, which effectively prompts the gradient from RCenter-ness branch to be back-propagate to the regression head. The prediction boxes closer to the ground-truth center area get higher weights, which effectively inhibits the lowquality prediction boxes that are far away from the target center. In the inference stage, ReFPN-FCOS produces a large number of prediction boxes, so we use NMS to eliminate the repeated prediction boxes. Traditional NMS uses the classification score as the position score of the prediction box, which is unreasonable. Therefore, we use the products of the classification scores and RCenterness as the position scores: det * s S Score RCenterness   (4) The parameter Ɛ controls the contribution of RCenterness. Sdet is used in the subsequent NMS. The ablation study in the Section IV. D will determine the value of Ɛ.

A. DATASET AND EVALUATION METRICS
All our experiments are carried out on MS COCO data set which contains a large number of complex daily scenes and covers common objects in daily life. MS COCO is a universally recognized data set for current object detection, VOLUME XX, 2017 11 The symbol * represents the results that we re-implement on mmdetection. The symbol # represents the results given by mmdetection. Schedule represents the number of iterations during training, the same as Detection, 1x and 2x respectively represent the model training 12 and 24 epochs on the training data set. MS shows the results of the multi-scale training. We use AP (average AP at IoUs from 0.5 to 0.95 with an interval of 0.05), AP50 (AP at IoU= 0.5) and AP75 (AP at IoU= 0.75). We also use APS, APM and APL to measure the detector's average accuracy of small, medium and large objects.

B. TRAINING AND TESTING DETAILS
ReFPN-FCOS is modified based on FCOS. During training, as the same as FCOS, we use Focal Loss and IoU loss as the loss functions of classification and regression respectively. For fair comparisons, all experiments are implemented on Pytorch and mmdetection [50]. There have been no artificial changes to the FCOS configuration file, and all related configurations are the same as FCOS detection. However, since we only have two GPUs, the batch size is set to 10 (5 images per GPU) for Resnet50, and the learning rate is set as 0.01 according to the linear scaling rule. The batch size is set to 6 (3 images per GPU) for ResNet101, and the learning rate is set as 0.005. For the main results, all the modules are evaluated on COCO. The converged models provided by mmdetection are evaluated as the baselines. All of our experiments are carried out in strict accordance with the standard operation, without adding any tricks. In inference VOLUME XX, 2017 11 To further illustrate the effectiveness of our approach, we make a comparative analysis of recall rates. As shown in Table 2, ReFPN-FCOS significantly improves the recall rate of the proposal box with the same backbone. The performances for APmax=1 are improved by 0.3% and 0.7% compared with the baselines, and the performances for APmax=10 are improved by 0.7% and 1.1%. In addition, the APmax=100 of ReFPN-FCOS are 0.7% and 1.1% higher than those of the baselines. It shows that ReFPN-FCOS can enhance the average recall of proposal. We find that the improvements of APmax on ResNet101 are more remarkable than that on ResNet50, because larger backbone can achieve better fitting objectives.

D. ABLATION EXPERIMENTS
ReFPN-FCOS designs two modules on the basis of FCOS. In order to analyze the importance of the two modules, experim ents ar e con duct ed on ea ch m odul e. The experimental results are shown in Table 3. ReFPN and RCenter-ness have improved the original FCOS to different degrees. Compared with the original FCOS, ReFPN has increased AP by 0.4%, which is not a big improvement. However, ReFPN yields 1.1% higher APS compared with the original FCOS. This result shows that ReFPN extracts more representative feature of small objects by aggregating high VOLUME XX, 2017 11 and low level features. The RCenter-ness branch brings 0.5% higher AP than the FCOS baseline, and it has different degrees of improvement to the large, medium and small objects. Thus the RCenter-ness branch can obtain better position scores, indicating that the context feature of object is more suitable for predicting the position information of the object box. From the above data, the two modules designed in this paper are effective. After combining the two modules, the average accuracy of ReFPN-FCOS is 1.1% higher than that of the FCOS baseline. It indicates that the two modules of ReFPN-FCOS are complementary and jointly improve the performance of detector. Ablation experiments of RCenter-ness are shown in Figure  4. We design two implementation schemes for the RCenter-ness branch. In Figure 4(a), RCenter-ness_I uses context feature as input, while RCenter-ness_II uses context feature and the regression result as input in Figure 4(b). The main difference between the two design options is whether to add regression result. The experimental results of the two types are shown in Table 4, the RCenter-ness_II obtains a better performance, which improves AP by 0.3% compared with RCenter-ness_I. The result indicates that adding regression result could better promote the detector to predict position score. In addition, for medium and large objects, the performance of RCenter-ness_II is better than that of RCenter-ness_I. However, regression distances of small objects are too small, the advantage of RCenter-ness_II is not remarkable for small objects. VOLUME XX, 2017 11 Ablation experiments on the relation between RCenterness and classification score are shown in Table 5. Under the influence of YOLO V1, most of the current detection algorithms with IoU branch or center-ness branch generally use the product of the classification score and Centerness or IoU as the final position score of the prediction box in the inference stage, and then use it to remove redundant boxes in NMS. This strategy is also adopted in this work for fair comparison. In order to verify the influence of RCenterness, the super parametric Ɛ is set to different values. The values of Ɛ gradually increase from zero, the average accuracy gets better. This is because when Ɛ is small, the value of RCenterness becomes large, resulting in a large number of lowquality predicted bounding boxes that are retained. When parameter Ɛ is between [0.6, 1], the average accuracy does not change. The result indicates that RCenter-ness can accurately predict the object position score.
Visual detection results are shown in Figure 6. To further evaluate the performance of ReFPN-FCOS and its generalization ability, we visualize the images detected by ReFPN-FCOS and FCOS. The images include multiple scenes and are randomly selected from the Internet. The background of these images are complex, and there are a lot of occlusion phenomena. Moreover, the small objects take up a large proportion. We also use a photograph of elks to verify the false detection. As shown in Figure 6, the first and second columns show the detection results of ReFPN-FCOS and FCOS, respectively. As shown in the first and third rows, FCOS obviously misses the detection of many objects, while ReFPN-FCOS can detect more objects and is good at the detection of the objects with severe occlusion. However, in both ReFPN-FCOS and FCOS, the confidences of small objects are very low due to inadequate features of small objects, so using feature alignment can alleviate this shortage. In the second row, FCOS misses the detection of persons in the bottom right corner of the image, but identifies the sheep. ReFPN-FCOS identifies the person in the bottom right corner, but mistakenly detects the sheep in the bottom right corner as cow. The images in fourth row are particularly interesting. We use the network to identify a category that does not exist in the COCO data set. ReFPN-FCOS has fewer error detection than FCOS, and its positioning is more accurate than FCOS. Through the above analysis, the detection performance of ReFPN-FCOS is better than FCOS in complex scenes, furthermore, ReFPN-FCOS has more accurate positioning, higher confidences, and can detect more small objects and occluded objects.

V. CONCLUSIONS
In this paper, we review one-stage and two-stage detection algorithms and find that the potential of the current one-stage detection is not fully exerted, which is mainly caused by two reasons. The first reason is insufficient feature extraction. The second reason is that using the classification score as the positioning score of the prediction box is unreasonable in inference stage. Thus a more efficient detection algorithm ReFPN-FCOS is proposed. ReFPN-FCOS employs two novel modules, namely ReFPN and RCenter-ness, to alleviate the two aforementioned problems, so it achieves substantial improvement. The object features extracted by ReFPN can be further optimized, and there is still a gap between the predicted score and the real position score. In the future, we will use feature alignment to get better object features and explore a better way to predict object position scores.