Multi-level Refinement Feature Pyramid Network for scale imbalance object detection

Object detection becomes a challenge due to diversity of object scales. In general, modern object detectors use feature pyramid to learn multi-scale representation for better results. However, current versions of feature pyramid are insufficient to handle the scale imbalance, as it is inefficient to integrate the semantic information across different scales. In this paper, we reformulate the feature pyramid construction as a feature reconfiguration process. Finally, we propose a novel detection network, Multi-level Refinement Feature pyramid Network (MRFPN), to combine the high-level features (i.e., semantic information), middle-level feature and low-level feature (i.e., boundary information), in a highly-nonlinear yet efficient manner. In particular, a novel contextual features module (chain parallel pooling) is proposed, which consists of global attention and local reconfigurations. It efficiently gathers task-oriented contextual features across different scales and spatial locations (i.e., lightweight local reconfiguration and global attention). To evaluate significance of proposed model, we designed and trained a robust end-to-end single stage detector called MRFDet by assimilating it into a conventional SSD model, and it achieved better detection performance compared to most recent single-stage objects detectors. In particular, MRFDet achieves an AP of 45.2 with MS-COCO and an improvement in the map of 4.5% with VOC compared to conventional SSD. We are releasing the source code for our proposed model MRFDet, to facilitate the research community.


I. INTRODUCTION
Object detection becomes more challenging as the scale of object instances varies [1][2][3]. According to our best information so far, two strategies have been devised to resolve arising issues by this challenge. In the first strategy, the image pyramid is used to detect objects (i.e., a series of different sizes of input images) [1]. Due to computational complexity and increased memory requirements, this solution can only be exploited during testing. Thus, it dramatically drops the efficiency of the detector. The second strategy is based on the feature pyramid developed from the input image used for object detection [3,4]. It can be exploited at both phases (testing and training phase) due to low memory requirements and computational cost. Furthermore, "the feature pyramid module" can be effectively incorporated into deep networks to create an end-to-end solution. However, the object detector based on pyramidal construction [3][4][5][6] yields promising results. But there are still some limitations due to the generation of feature pyramid from the intrinsic multi-scale pyramidal architecture of the backbone (that design for object classification task). For example, STDN [7] applies pooling and scale-transfer operations to the final block of DenseNet to create a feature-pyramid. Feature Pyramid Network (FPN) [3] fuses the shallow and deep layers features in a top-down style to construct the feature pyramid. SSD [4] utilizes the last two layers of backbone network and four additional prediction layers in a convolutional manner with stride 2 to build a feature pyramid. MLFPN [8] built a multi-level "feature pyramid to detect multi-scale objects" using multiple-layers of the base network.
The feature pyramid models have two main limitations. First, single-level layers of the backbone network (i.e., designed for classification task) are used to generate feature maps that are not sufficiently descriptive for the object detection task. Second, a multi-level feature pyramid can produce a more descriptive feature-map, but it adds significant computational complexity. Primarily, the low-level features that are extracted from shallow layers are not very descriptive but helpful in object localization task. Moreover, the extraction of high-level features from deeper layers can be useful for the classification task. High-level features are appropriate for objects with intricate presence, while low-level features are suitable for the objects with an uncomplicated appearance. In general, objects with similar sizes having different appearances, such as the appearance of a remote person, is more complicated than that of a similar traffic light. Thus, each feature-map in the pyramid based on single-level information can yield sub-optimal results (i.e., used for a specific range of object sizes).
The intuition behind this work is the use of information from the middle layers (i.e., it is expected to describe the midlevel representation of object parts and retain the spatial information as well) along with shallow and deep-level features. Medium-level features are necessary not only for initial low-level convolutional layers features (that encode basic visual geometry such as edges, circles, lines, corners) but also for high-level features that encode the high-level information used for object detection (category-level evidence). It is advantageous to have features of all levels for object recognition. The higher-level features are utilized for classification of the object, while low-level features are helpful for object localization. The most effective method to use middle-level features for object detection is still an open question.
This research work aims to build a more powerful feature pyramid with multiple levels for recognizing object instances of various scales with less computational effort, in order to avoid the previously mentioned constraints of existing methods. As depicted in Fig. 2, to accomplish this objective, we initially merged features from multiple layers (i.e., multilevel features) extracted through a backbone network such as VGG-16 and base features, and then fed them into a block of alternating residual standard convolution unit (RCU) to get more representative, multi-scale/level features. At this point, we compile the feature map of the same scales to develop the ultimate feature pyramid. Finally, the constructed feature pyramid is passed through Contextual features module to capture contextual information from a vast image region. In addition, each feature map contains layered information in the resulting feature pyramid. For this reason, we call our proposed network for building pyramids MRFPN (Multi-Level Refinement Feature Pyramid Network).
In this paper, a practical end-to-end single stage detector is designed and trained to assess the significance of our proposed Multi-level Refinement Feature Pyramid network. We call our model MRFDet (Multi-level Refinement Feature Detector) as it is constructed upon multi-level and multi-scale features network (MRFPN) integrate with the architecture of SSD [4]. MRFDet achieved the significant result (i.e., an Average Precision of 45.2), outperforming one-stage detectors on MS-COCO [9] and improvement of 8.5% in PASCAL VOC07/12 benchmark datasets. The main contributions of this work are summarized as follows: 1) We proposed a multi-level refinement feature pyramid network (MRFPN) for object detection with less computational complexity. It exploits the features from multiple levels and recursively refines the shallow features to generate a middle level and more in-depth feature maps. 2) For the first time, Chained Parallel Pooling has been used during the construction phase of the feature pyramid to introduce more robustness and able to capture the contextual information from a vast image region, followed by prediction layers for object detection. For this purpose, the features are efficiently pool with several window sizes and merged with learnable weights and residual connections. 3) These design features result in extensive training and significant recognition performance; even input images are not high-resolution images, further improving the tradeoff between accuracy and speed. 4) With qualitative and quantitative observations, we prove that our MRFPN shows a significant improvement over conventional SSD and M2Det. MRFDet can be used for both datasets; i.e., PASCAL VOC 07/12 and MS COCO achieve state-of-the-art performance.

II. RELATED WORK
The sliding window has a rich and long history of perspective, beginning with the use of convolutional networks to recognize handwritten digits. However, the invention of enhanced object detector [10], integral channel features [11], and the HOG [12] led to more effective methods of face detection and pedestrian detection. The rebirth of deep learning exemplifies the sliding window in the realm of classic computer vision. In this section, we mainly discuss the difference and similarities of our model and some previous works in details.

A. ANCHOR-BASED OBJECT DETECTION
Anchor-based object detection framework further categories into two clusters: two-stage approaches with proposal determined and one-stage proposal free approaches. The twostage approach in present-day object detection is a dominant metaphor. The Selective-Search [13] is a pioneering approach that comprises of two stages. In the first phase, a spare set of candidate proposal is generated that includes all objects, while the negative-locations are filtered. In contrast, the classification of background and foreground classes performed in the second stage. R-CNN [14] achieved a significant gain in accuracy by replacing the classifier of the second stage with a convolutional network. RCNN has improved in speed and accuracy over the years [2,15] by using learned object proposals [16][17][18]. Region Proposal Networks (RPN) combines second stage classifier with proposal generation into a single convolutional network such as Faster R-CNN [17]. Various research works have been proposed to enhance its performance, including redesigning and reforming of architecture [3,6,[19][20][21], attention and context mechanism [22], modified strategies in training and loss function [20,23], feature fusion enrichment and enhancement [24,25]. For twostage detectors, the proposal is predicted using anchors as regression references and classification candidates. Such models achieve the highest rate of accuracy but are usually slow. Due to high computational efficiency, one-stage anchorbased detectors have attained much attention. OverFeat [26] is the first modern deep learning based one-stage detector. SSD [4,27] directly predict the object anchor box offset and category by spreading the anchor boxes on multi-scale layers within a ConvNet. Recent development shows that a plenty of work have been proposed to boost its performance in different aspects, such as multi-scale feature fusion [27,28], training strategies (from scratch) [29], proposed new loss function [5], anchor matching and refinement [30,31], and enrichment of features [32,33]. One stage detector uses the anchor as a reference box for final selections.

B. ANCHOR-LESS EXPLORATIONS
The best known anchor-less detector could be YOLOv1 [34] with input image 448 × 448 and output of a 7 × 7 grid cell for the box prediction. YOLOv1 experiences from low recall as it used single point usages for bounding box prediction [35]. As a result, anchors are used to ensure high recall in YOLO9000 [35] and YOLOv3 [36]. Due to difficulties in detecting objects with multiple scales, some of detectors were considered inappropriate for generic object detection [37]. DenseBox (anchorless detector) [38] the image pyramid to detect objects with multiple scales that takes several seconds to process one image. RepPoints [39] uses a deformable convolution to create more precise features and represents an object as a set of sample points. FSAF [40] uses the anchorfree paradigm with the best feature prediction to train each instance. FCOS [37] uses a per-pixel prediction strategy and relies on centerness map to suppress poor-quality object detection. The CenterNet [41] represents each object through its characteristics at the center point. CornerNet [42] uses the Associative Embedding technique [43] to detect the bounding box of an object as a pair of key-points. Cascaded and central pooling is used to improve recall and precision in CornerNet. FoveaBox [44] proposed a technique with which the final class probability can be directly predicted by assigning objects to multiple adjacent pyramid levels.

C. FEATURE PYRAMID NETWORK/ MULTI-LEVEL FEATURE PYRAMID
The effective representation of multi-scales features scales in object recognition is always the main hurdle to improving the detection accuracy. Most previous approaches to detection use a pyramid feature hierarchy extracted from backbone networks to make a prediction. As far as we know, two main strategies have been used to deal with scale-variance. The first strategy is featuring image pyramids (i.e., input image with various sizes and resolutions) that is used to produce multi-scale semantic features. These semantic features further are used to separate prediction, and then all participate together to make the ultimate prediction. The feature of multi-scale images significantly improves the accuracy of recognition and localization precision, compared to single-scale images features such as used in [20] and SNIP [1]. Despite the increase in performance, feature image pyramid strategies are unable to gain popularity in the A.I. community and not plausible for real-time applications due to drastically high time and memory requirements. To remedy this problem, SNIP [1] used featured image pyramids only during the testing phase. "In contrast, other models such as Fast R-CNN [15] and Faster R-CNN [17] did not to use this strategy by default". The second strategy is the feature pyramid generation for object detection, i.e., extracts feature from multiple layers of the network using a single-scale image. This method is considerably more cost-effective than the image pyramid approach in terms of computing effort and memory requirements and enables FPN to be provided in real-time applications both in training and in the test phase. In addition, it is flexible and fits easily into state-of-the-art detectors based on a deep neural network." As one of the pioneering works, Feature Pyramid Network (FPN) [3] has implemented a topdown pathway and side links to generate features pyramid that takes accuracy and speed into account. Following this idea, PANet [45] includes extra bottom-up path aggregation network on the top of FPN; STDL [7] exploits cross-scale features in scale-transfer module, M2Det [46] proposes the Ushape module to fuse the multi-scale feature; and G-FRNet [47] control the information flow across features uses gate units. Most recently, NAS-FPN [48] uses the neural architecture search to automatically design feature network topology". Thousands of GPU hours are required during search in NAS-FPN and yielding an irregular feature network which is difficult to interpret. EfficientDet [49] proposes a weighted bidirectional feature network with customized compound scaling method for multi-scale feature fusion. Libra R-CNN [50] proposed a balance design comprises of simple components i.e., balanced feature pyramid, balance L1 loss, and IoU-balanced sampling, to solve the imbalance issue existing in the training process. The different flavors of feature pyramid have been shown in Fig. 1. In contrast, the purpose of this work is to understand whether single-phase detectors can match or outdo the precision of a two-phase detector at similar or faster speeds. 1-RCU creates a set of multi-scale features that enrich the semantic information in the base feature and feature map of backbone layers. 2-FSM module assembles the features into a multilevel feature pyramid using a scaled feature chain operation. 3-Finally, CFM uses to capture background contextual information from a large area of the image. It uses multi-window sizes to pool the features and fuse them using learnable weights. Finally, prediction layers produce dense bounding boxes and categories that are scored on learnable features, followed by non-maximum suppression (NMS) operation to get the final prediction similar to SSD.

A. MULTIPATH REFINEMENT FEATURE PYRAMID NETWORK (MRFPN)
As forementioned, this scheme is used to generate multi-level feature map in order to detect objects with different scales. This schema generates a multi-level feature pyramid by merging low-level semantic features to medium and high-level features. It has three basic blocks i.e., RCU, FSM, and CFM as shown in Fig.2. Firstly, VGG-16 [51] generate based features that include multi-level semantic information for MRFPN. The first block of MRFPN comprises of a stack of RCUs that uses base feature and feature maps of four layers (i.e., Conv1_3 to Conv4_3) of the backbone network (VGG-16) to generates low, medium and high-level feature maps of different scales. In first RCU, base feature and feature map of Conv1_3, Conv2_3 of backbone are used to generates lowlevel features. While the output of first RCU is combined with feature map of Conv3_3 layer of backbone network in the second RCU to generate medium-level features. Similarly, high-level features are generated using output of second RCU and feature map of Conv4_3 layer of backbone network. Note that the first RCU has no previous knowledge of any other RCU and is therefore only learned from base features (X bf ).

RCU
where X bf denotes the base feature, x i l denotes the features with the i th scale in the l th RCU, L denotes the number of RCUs, RC l denotes the l th RCU processing, and X Conv (i+1)_3 denotes a feature map of the i+1 layer of the backbone network. The feature standardization module is the second block of feature pyramid network. We used scale-wise concatenation operation to aggregates and up-samples the multi-scale features. Finally, Chained pooling is used to capture contextual background information from a huge image region. End-to-end training is done to efficiently train the entire network efficiently.

B. MRFDet
The architecture of MRFDet is illustrated in Fig. 2. The VGG16 backbone network develops base features that are exploited in the RCU stack to produce multi-scale feature maps. Multi-level feature fusion block concatenates the features map scale-wise to generate the feature pyramid. The prediction layer and NMS operation are applied to MLFP, similar to SSD except chained parallel pooling that is applied at the beginning of the Prediction layers block. It is worth considering here that our architecture is flexible and requires less computational effort, which makes it more convenient to adapt to real-time application.

1) RESIDUAL CONVOLUTIONAL UNIT (RCU)
The residual convolutional unit is first module of MRFPN that uses a set of adaptive convolutions to refine the base features and feature maps of backbone network and generates multilevel features. A optimized version of basic ResNet [19] block without batch normalization layers is used in the RCU. The detailed architecture of RCU is shown in Fig. 3. It has two branches, top branch comprises of two Conv layers with 1x1 and 3x3 filters and stride 2, while non-linearity has been added using pooling and ReLU (R) operations. The other branch has one 1x1 Conv layer, as shown in Fig. 2(a). Finally, refined feature maps of backbone layers, i.e., X Conv1_3 and X Conv2_3 merges with output of top branch of RCU and construct lowlevel features as described in equation 2. While medium-level and deep-level features are built using X Conv3_3 and X Conv4_3 respectively, and the output of previous RCU block, as describe in equation 3 and equation 4. The corresponding backbone layer filter determines the filter size for each input path. Moreover, an additional Conv-layer 1 × 1 is used to improve ability to learn and to keep feature smooth [52]. In this way, multi-scale features of current level are generated. X L = R (Conv 1×1 (X Conv1_3 )) + R (Conv 3×3 (X Conv2_3 )) + Conv 3×3 (pool (Conv 1×1 (R(X bf )))) (2) X = R (Conv 3×3 (X Conv3_3 )) + Conv 3×3 (pool (Conv 1×1 (R(X L )))) (3) X H = R (Conv 3×3 (X Conv4_3 )) + Conv 3×3 (pool (Conv 1×1 (R(X M )))) (4)

2) FEATURES STANDARDIZATION MODULE
The second block of MRFPN is a features standardization module (FSM) that aggregates refine features generated by RCUs in the form of multi-level feature pyramid, shown in Fig. 2 (b) and Fig. 4. Initially, features with equivalent scales are concatenated with channel dimension. The resultant features can be represented as X = (X 1 , X 2 , X 3 , … , X i ), where

3) CONTEXTUAL FEATURES MODULE
The multi-level feature pyramid is fed to contextual features module (CFM) to generate more robust multi-level contextual features, as shown in Figs  An alternate architecture of chained parallel pooling block is shown in Fig. 5 (b). This alternate architecture is a modified version of the architecture shown in Fig. 5 (a) by interchanging the position of pooling layer and convolution layer in parallel pooling block. The convolution layer adapts to learn the input features and consider their importance before being fed to pooling layer. In our observation, this approach may sometimes perform a little better in some datasets compared to the original architecture.

4) OBJECTIVE LOSS FUNCTION
To handle the different object categories, the MRFDet training objective is derived from the multi-box objective [16,54]. Let x ij p = {1,0} be an indicator for the agreement of ℎ default box with the ℎ ground truth box of class p. the overall objective loss function is a weighted sum of the loss of localization (loc) and confidence loss (conf): L(x, c, p box , g box ) = 1 N (L conf (x, c) + αL loc (x, p box , g box )) (5) Where N is the number of matched default boxes, the loss is zero if N=0. Localization loss is L1 smooth loss [55] between the predicted box p box and ground truth box (g box ) parameters. Similar to faster R-CNN [17], offsets regress for the center (cx, cy) of the default bounding box (b box ) and for its width (w) and height (h).
and cross-validation is used to set the weight term α to 1.

5) NETWROK CONFIGURATION
We have assembled MRFDet with the VGG-16 backbone framework. Pre-trained backbone framework, i.e., VGG-16 (trained on ImageNet 2012 dataset [56]), is used to train the entire network. The default configuration of MRFPN contains three RCU, each with two branches except the first RCU; the first branch has two Conv layers of filters (1 × 1, and 3 × 3) with stride 2 and non-linearity incorporated through the ReLU and pooling operations, so it produces multi-scale features.
The other branch has only a 1x1 Conv layer with a ReLU function. To decrease the parameters numbers, we use the Conv filter size less than 1024 to facilitate network training on the GPU. We use the same input sizes as it was used in the conventional SSD, RefineDet, and Retina Net such as 300, 512.
The last stage of the MRFPN consists of the contextual features module that comprises of parallel pooling unit that forms the chain. The contextual features module output is fed to the original SSD prediction layers as an input. We place six anchors with a total of three ratios for each pixel of pyramidal features. Then a probability rating of 0.05 is set as the threshold to filter out most of the low-scoring anchors. For more accurate boxes, post-processing is performed using soft non-maximum suppression NMS with a linear kernel [57].
Lowering the threshold to 0.01 can yield improve detection results; however, it significantly reduces the inference time. We don't see it as a pursuit of improving practical values. The SGD first uses a learning rate of 10-3, momenta of 0.9, and 0.0005 weight decay and batch size 32 to fine-tune the resulting model. The guidelines for learning rate and weight decay differ slightly for each data set. Complete training and testing code have been developed on TensorFlow.

C. EXPERIMENTAL RESULTS
To demonstrate the effectiveness of our approach, we carry out comprehensive experiments on two benchmark datasets of generic object detection, such as MS-COCO and PASCAL 07/12. PASCAL VOC07/12 includes 20 categories in 9,963 and 22,531 images, respectively. We compare with Fast-RCNN and Faster RCNN [17] in the Pascal data set. The previously trained backbone framework, i.e., VGG-16 is used to fine-tune the model. Localization and class confidences score is predicted using the MRFF block, pooling block, and prediction layer (i.e., Conv8_2, Conv9_2, Conv10_2, and Conv11_2). We set a learning rate of 0.0001 for the initial 5k iterations and afterwards keep training for the next 30k iterations with a learning rate 10-4 and 10-5. We use COCO-

1) IMPLEMENTATION DETAIL
To analyze the performance of proposed scheme based on MRFDet, training process starts with warm-up strategy of 5k epochs with learning rate of 0.0001 than gradually decreases up to 10 −4 and 10 −5 at 15k epochs and end at 30k epochs. The TensorFlow platform is used to develop the MRFDet, with input size 300 × 300 and batch size of 32. Experiments are carried out on NVIDIA Geforce RTX 2060, CUDA 10.1 and CuDNN 7.5.0 with memory data rate of 14.00Gbps. The training phase of MRFDet with VGG-16 and input size is 300x300, and 512x512; the total cost of training is four days and a week, respectively.

MS-COCO:
In Table 1 [31] acquires the benefits of one and two-stage detector. Keypoint regression is used in the CornerNet [42] to detect objects and borrows the advantages of Hourglass [59] and focal loss [5], thus gets AP of 42.1. Conversely, our proposed MRFDet is based on the original SSD regression method using multiscale multi-level features and reached 45.2 AP, which is higher than any single-stage detectors. The parameter of comparison between different sophisticated models is speed of single scale inference methods as different strategies and tools are used. Furthermore, the increase in performance of MRFDet is not entirely due to depth of the model or obtained parameters. In comparison with modern one-and two-stage detectors, we find that 201M parameters are generated in CornerNet with Hourglass and 205M parameters are generated in Mask R-CNN [6] (ResNet-101-32x8d-FPN [60]). While our model generates 146M parameters. PASCAL VOC2007/2012: The test images from PASCAL VOC 07/12 are used to compare the performance of our proposed model with most advanced one and two-stage detectors, as depicted in Table 2. The standard configuration of MRFDet contains three RCUs with a feature standardization module and a contextual feature module as shown in Fig. 2. While conventional SSD prediction layers are used to compute confidence score and localization. The "Xavier" method [61] is used to initialize the parameters of all prediction layers. The backpropagation is used to learn the scaling, since the feature norm is scaled to 20 at each location in the feature map using L2 normalization technique [62]. Our model is trained on PASCAL VOC 07/12 and further fine-tune on the MS-COCO trainval35k in order to achieve better results. To correctly assess the class confidence score and location, we cannot use the correction rate of image classification because each image has multiple objects under multiple categories in the object detection problem. In PASCAL VOC 07/12 test set, is used as the evaluation index of accuracy, and frames per second are used as the evaluation index for real-time detection. Precision and recall for each class are used to draw P-R curve. The formula of precision and recall is as follows: = ∑ ( ) =1 (12) The frame per second is defined as the number of pictures that recognize in one second by a detector. FPS rate above 24 is considered as smooth. As is clear from Table 2, our proposed model MRFDet-300 (low resolution) is already performing better than Fast R-CNN. When input size is increased to 512 × 512 for further training, it is even more accurate and outperforms the Faster R-CNN by 2.7% mAP. If we train the MRFDEt-300 with additional data (i.e., 07+12), we observe that MRFDet-300 is already 2.1% better than SDD and Faster R-CNN and MRFDet-512 by 3.6% better. We get our best results after fine-tuning our model on MS-COCO trainval35k that is 84.6%mAP. MRFDet is particularly sensitive to the size of bounding boxes, and due to multi-level feature pyramid with several scales, offers significantly better performance with small objects. Those small objects have semantic information in the MLFP that help in detecting small objects.

3) ABLATION STUDIES ON MS-COCO
In this section, we review the effectiveness of each module configuration of MRFDet on detection performance. We are trying three different designs of RCU in our MRFPN. First, a simple design of RCUs upgrades the AP to up to three units, as shown in the third section of Table 3. In the first branch of each RCU, an additional conv layer with a 3 × 3 filter is used, which improves the AP to 3.1. Finally, increasing the conv layer in the second branch of the RCU delivers the best results in the 30.9 AP performance. Although increments in convlayers can improve detection accuracy, the redundant use of the basic function in each RCU increases the number of parameters. For this reason, features of different layers of backbone networks are used to construct low, medium, and high-level features. While the necessary location information is obtained through the embedded basic feature. The Feature Standardization Module (FSM) improved all scoring measurements, as shown in the seventh column of Table 3. Next, we analyzed the effect of our proposed contextual features module on recognition performance. Additional block of parallel pooling improves performance significantly. But we've seen the best results with four consecutive blocks. Strong base functions provide a noticeable AP gain, such as using ResNet-101 [19] as the backbone instead of VGG-16, it generated an AP gain of 2% as shown in Table 3.

5) SPEED
In the context of comparing inference speed of MRFDet with latest models, we conclude that reduced version of VGG-16 [51] (without F.C. layers) speed up the base feature extraction process. The inference time of an image is a sum of NMS time and CNN time of 1000 images and is divided by 1000, and batch size is set to 1. Specifically, we assembled the MRFDet with VGG-16 (reduced version) and proposed the faster version of MRFDet with input size 320 × 320, and the standard and accurate version of MRFDet with input size 512 × 512. Taking advantage of our proposed MRFPN framework and one-stage detector, MRFDet has significantly improved the speed and accuracy curve compared to other advanced methods. MRFDet can achieve precise detection results with high speed based on the optimization of TensorFlow. The speed of SSD321-ResNet101, SSD513-ResNet101, M2Det-VGG16, RefineDet512-ResNet101, RefineDet320-ResNet101, and CornerNet are tested on our device for fair comparison. It comes to the conclusion that MRFDet performs far better in term of accuracy and efficiency.

IV. DISSCUSSION
According to our observations, the detection accuracy of MRFDet has improved mainly due to the proposed MRFPN and Contextual features module. Firstly, we merge the multilayered feature maps and backbone base-features through alternate blocks of RCU and MLFF modules to extract more robust multi-scale multi-level features. Finally, constructed features are fed into CFM to minimize the parameters and add more robustness to object detection. In contrast, existing detectors [3,27,31] are only used with an increase in the depth of the layers of the backbone or extra layers. Therefore, predominant detection performance has been achieved through our proposed method. In particular, multi-level multi-scale features are used to demonstrate better performance in dealing with appearance-complexity variation across objects instances. Our proposed MRFPN can learn effective features for detecting an object with large variations in appearance and scales. E.g., the input image contains person, vehicles and traffic signal of different sizes. Some of the findings are as follows: 1) A larger size person has higher and stronger activation value at feature map as compared to a smaller one. 2) While the small-sized person, traffic signal, and the vehicle have substantial activation values with the same scale feature map. 3) Conversely, the individual, vehicle and traffic signals have an extremely robust activation value on the feature maps of the significant level, the middle level and the lowest level. From our observations, the proposed method can effectively learn sensitive features to deal with variations in scale and complexity of appearance across object instances. It is essential to use multi-level/scale features to identify objects of comparable size but different in appearance.

V. CONCLUSION
Multi-level refinement feature pyramid network is proposed to identify the objects with different scales and complex appearances. The proposed strategy consists of three modules, a stack of three residual convolutional units is used in the first module to construct the multi-scale multi-level features using feature maps of backbone network and base features (i.e., constructed from VGG-16). The second module is used to standardize multi-level features after up/down-sampling with similar scale in the form of feature pyramid. Finally, contextual features module is used to strengthen the resulting feature pyramid for object identification. We achieve significantly improved scores compared to other single-stage detector on MS-COCO dataset (i.e., 45.2 AP with multi-scale inference strategy). The effectiveness of proposed architecture is demonstrated using the results of ablation studies. However, there is still a room for improvement in detection, such as GAN can be used to reconstruct high resolution deep features or to optimize the upper sample layer using interpolation techniques.