Integrated Feature Pyramid Network With Feature Aggregation for Traffic Sign Detection

Traffic sign detection is a critical task in the visual system of the Advanced Driver Assistance System (ADAS) and the Automated Driving System (ADS). Although the general object detection has achieved promising results by using Feature Pyramid Network (FPN) in recent years, we still observed that FPN cannot obtain satisfactory results in traffic sign detection because the size and class distribution of traffic signs are extremely unbalanced. To overcome this problem, a novel Plug-and-Play neck network Integrated Feature Pyramid Network with Feature Aggregation (IFA-FPN) is proposed in this paper based on the statistical characteristics of traffic signs. First, a lightweight operation is introduced to fully utilize the model and improve the inference speed of the model. Second, an Integrated Operation (IO) is introduced to solve the imbalance problem of Region-of-Interests (RoIs) in pyramid levels. Third, we introduce a Feature Aggregation (FA) structure to strengthen the feature representation capacity of feature maps, thereby enhancing the network robustness against the size discrepancy of traffic signs. The experiments are performed on three mainstream datasets, i.e., the German Traffic Sign Detection Benchmark (GTSDB), Swedish Traffic Sign Dataset (STSD), and Tsinghua-Tencent 100k dataset (TT100k). The experimental results demonstrate the superiority of the proposed IFA-FPN in the traffic sign detection tasks. Specifically, when the proposed IFA-FPN is applied to the Cascade RCNN, it achieves 80.3% mAP in GTSDB which surpasses FPN by 9.9%, 65.2% in mAP in STSD which surpasses FPN by 3.5%, and 93.6% in mAP in TT100k which surpasses FPN by 1.6%.


I. INTRODUCTION
With the development of the driver-assistance system and autonomous vehicle, the Traffic Sign Detection (TSD) system has been heavily studied over the past decade. A suitable traffic sign detection system helps vehicles perceive the surrounding environment. In the Advanced Driver Assistance System (ADAS), the traffic sign detection system reminds drivers of traffic constraints. In Automated Driving System (ADS), except for perceiving the surrounding environment, the traffic sign detection system can also provide traffic sign location information to the vehicle navigation system. The location information can be used as distinct landmarks for generating High Definition Map (HD Map).
The appearance of traffic signs is designed for attracting human attention easily and quickly. Methods [1]- [6] utilize The associate editor coordinating the review of this manuscript and approving it for publication was Sudhakar Radhakrishnan . the appearance characteristics of traffic signs to extract better features in the feature extraction step. These hand-craft features-based methods are not robust enough for distinguishing between real and fake signs in real-world because many objects have similar appearance with traffic signs. It is hard to use the low level hand-craft features to represent the distinguishing characteristics of traffic signs.
Thanks to the development of deep learning algorithms, object detection using Convolution Neural Network (CNN) made remarkable achievements. CNN-based architectures such as Fast R-CNN [7], Faster R-CNN [8], Cascade R-CNN [9], Single Shot multibox Detector (SSD) [10], and YOLO [11] became mainstream detectors that achieve remarkable performance. Different from general objects, traffic signs are relatively small-scale. The scale of most traffic signs in the Swedish Traffic Sign Dataset (STSD) [5], [6] and Tsinghua-Tencent 100k Dataset (TT100k) [12] is less than 100 pixels, shown in Fig. 2. It means that most traffic signs occupy less than 0.8% of an image in STSD, and less than 0.2% in TT100k. Detecting small objects is more challenging than large objects because the CNN extracts features using multi-level convolution and pooling operations to obtain deeper semantic features. Those operations result in that small objects only can exist in the shallow layers but the shallow feature is not powerful enough in complex traffic scenes because of lack of deep semantic information.
To obtain deeper semantic features in shallow layers, many works [13]- [18] utilized the Feature Pyramid Network (FPN) [19] or feature fusion architecture to merge deep and shallow feature layers. The original detection models without FPN extract features only using bottom-up pathway C i , thus the strong semantic features only exist in deep layers C 4 and C 5 . FPN merge the feature maps from a top-down pathway P i and lateral connections L i to build high-level semantic feature maps P 2 -P 5 for the following predictions. Specifically, the shallow feature layers P 2 -P 4 contain strong semantics features as deep feature layers C 5 , P 5 , and P 6 .
The FPN [19] is designed for detecting general objects and it achieved promising results in general object detection. However, traffic signs are relatively small-scale and size distribution of them are unbalanced. We observe that the FPN cannot achieve satisfactory performance in traffic signs detection. To design a more suitable neck network for traffic sign detection methods, we analyze the statistical characteristics of traffic signs including the size distribution and the usage of pyramid levels P 2 -P 5 in Region-of-Interest (RoI) Alignment step. The architecture of the proposed Integrated Feature Pyramid Network with Feature Aggregation (IFA-FPN) is shown in Fig. 1. The IFA-FPN is designed based on the following three ideas: 1) We found that deep pyramid levels play a minor role in traffic sign detection and therefore we remove deep layers for reducing inference time. 2) We found that dispersedly mapping RoIs into different pyramid levels is unsuitable in traffic sign case because the usage of pyramid levels is extremely unbalanced. Dispersedly mapping RoIs leads to the weak generalization ability of infrequently used pyramid levels.
To solve this problem, we proposed an Integrated Operation (IO) to integrating all RoIs into a specific pyramid level. Although some researchers [19] demonstrate dispersed mapping RoIs helps general object detectors, it is noteworthy that we aim to demonstrate integrated mapping RoIs is more suitable for traffic sign detection. 3) In FPN [19], the pyramid level P 2 only need to represent features of traffic signs which size in (0, 112]. In IFA-FPN, the P 2 need to represent features of all size of traffic signs because of the proposed integrated operation. Therefore, enhancing the feature representation capability of P 2 is necessary. Therefore, a Feature Aggregation (FA) structure is introduced to strengthen the feature representation capability of P 2 by aggregating features from different depths so as to enhance the network robustness against size discrepancy.
The contributions of this work are summarized as follows: 1) To overcome the size and class imbalance problem of traffic signs, we proposed an IO which integrates all scale RoIs into a certain pyramid level. To the best of our knowledge, this paper is the first proof that integrated mapping RoIs helps the performance of traffic sign detection. 2) To better represent data with large variance in size, we proposed the FA structure to increase the feature representation capacity of a layer by attaching FA before the layer.  [20]. 4) Comprehensive experiments have been done to evaluate the performance of the proposed method on three mainstream datasets including GTSDB, STSD [5], [6], and TT100k [12]. The proposed method achieve superior performance on STSD and TT100k dataset. Specifically, the proposed method obtain 80.3% mAP in GTSDB, 65.2% mAP in STSD, and 93.6% mAP in TT100k. The remainder of this paper is organized as follows. Section II reviews the related work. Section III describes the proposed methods. In section IV, the datasets, evaluation metrics, and the experiments details and analyses are introduced. Finally, Section V concludes this paper.

II. RELATED WORK
Because the appearance of traffic signs is designed for attracting human attention easily and quickly, the traffic signs are designed with regular shapes and high saturation color. The traditional methods [1]- [4] utilized the appearance characteristics of traffic signs to extract better features. Reference [1] utilized the early visual features: red, green, and blue channel of the input images to create three sets of feature maps, i.e., color pairs opponency maps, center-surround differences maps, and local orientation maps, which provide robust features for the subsequent classifier. Based on the feature extraction method in [1], [2] further proposed an enhanced color pairs opponency maps based on categories of traffic signs. After obtaining robust features, the traditional methods applied various classifiers on these features to pursue a robust detector. The traditional methods have two shortcomings, one shortcoming is that hand-craft features are not robust enough for distinguishing traffic signs in the real world. Another shortcoming is running a complex feature extractor and classifier is time-consuming.
The deep learning-based traffic sign detectors have made huge progress because they can solve the above two shortcomings well. Deep learning-based methods utilized CNN to extract useful and generalized features autonomously by training CNN on extensive images. Reference [21] first adopted a fully convolutional network (FCN) [22] to obtain potential traffic sign regions and then used CNN to classify the region's class. It achieved good performance, but the computation cost is expensive because of FCN. Lu et al. [14] proposed two sub-networks for traffic signs detection. First, some attention regions that are likely to contain traffic signs are obtained by using an Attention Proposal Modeler (APM). Then, it localizes and classifies traffic signs in these attention regions by an Accurate Locator and Recognizer (ALR). The computation cost is low because the high-resolution images are resized to lower resolution images in APM, but the recall accuracy is not satisfactory. Subsequently, for improving the detection performance of small traffic signs, a popular solution is to combine shallow and deep feature maps. Yuan et al. [15] proposed a multi-resolution conv-deconv feature fusion network that connects convolution and de-convolution layers to magnify the feature maps and obtain higher semantic features simultaneously. Tian et al. [16] proposed a multi-scale recurrent attention network which includes a multi-scale attention module and a recurrent attention module. Same with [15] and [16] obtained the multi-scale feature maps by the de-convolution operation which is timeconsuming. Instead of de-convolution operation, Tabernik and Skoaj [18] adopted FPN to generate high-resolution feature maps by up-sampling operations. In the meantime, [18] extended their traffic sign detector with several improvements. The improvements include the data augmentation and Online Hard-Example Mining (OHEM) [23].
Because the size and class imbalance problem in traffic sign datasets are extreme, the above methods [2], [5], [15], [16], [18], [21] did not use the complete dataset to evaluate their methods. References [5], [15], [18], [21] only aimed to detect large-scale and visible traffic signs in a dataset. For example, they only considered the visible traffic signs in STSD with at least 50 × 50 pixels. Reference [16] classified traffic signs into superclasses rather than classes. Specifically, [16] divided traffic signs in GTSDB [20] into four superclasses: prohibitory signs, mandatory signs, danger signs, and others though GTSDB provides 43 classes in total. We consider classify traffic signs into superclasses is impractical in real-world application because traffic signs in same superclass still contain different information, such as ''speed limit 20'' and ''speed limit 80''. References [2], [5] only considered six main classes, and [18], [21] considered ten classes in STSD though STSD provides 20 classes in total. These inconsistencies of the evaluation metrics make comparison difficult.
In this paper, based on the distribution characteristics of traffic signs, we propose an IFA-FPN by modifying the existing FPN [19] structure so that IFA-FPN is suitable for extracting traffic signs features. The proposed IFA-FPN is a Plug-and-Play neck network that can be applied in mainstream object detectors to improve performance. To carry out the comprehensive experiments, our proposed IFA-FPN is evaluated on three mainstream traffic sign detection datasets. To compare results in a fair manner, all results are re-performed by open MMLab Detection Toolbox and Benchmark (MMDetection) [13] under the same hardware environment in our local computer. The details of datasets will describe in Section IV.

III. PROPOSED METHOD
The proposed method IFA-FPN is designed based on the FPN structure but it solves what the FPN cannot do well in traffic sign detection. In this section, the feature pyramid structure for feature extraction is first described. Then, we introduce the motivation and details of the proposed integrated operation (IO). Finally, three types of multi-scale feature aggregation (FA) structures are described subsequently.  [7] and Faster RCNN [8] without FPN only use features on the bottom-up pathway C i to predict class. They cannot achieve satisfied performance because the semantic features in shallow layers of C i are weak.
To enhance network performance, the top-down pathway P i and lateral connections L i are built to generate high-level semantic feature maps. Then, prediction is performed on the generated high-level semantic feature maps P i . With the help of P i and L i , the shallow feature layers P 2 and P 3 contain strong semantics features as the deep feature layers P 4 as shown in Fig. 1. The top-down feature P i is computed by P i+1 and C i as follows, where F i is the i-th fusion operation which includes three steps in detail, shown in the top right of Fig. 1. The first step is up-sampling P i+1 . Then, up-sampled P i+1 and lateral information are merged by element-wise addition. The third step is to process the merged feature maps by a 1 × 1 convolution layer. P I +1 is a stride two down-sampling of P I . The L i denotes the i-th lateral connections. L i is denoted as follows, where Conv 1 × 1(·) indicates a 1 × 1 convolution layer. FA(·) denotes the feature aggregation module which will described in Sec III.C. Compared with the original FPN, the deep pyramid levels C 5 , P 5 , and P 6 in the FPN are removed in our proposed IFA-FPN to fully utilize the model and improve the inference speed of the model. There are two reasons. One is deep pyramid levels play a minor role in feature extraction step. Deeper features cannot provide more accurate and useful information of traffic signs because traffic signs are relatively small-scale, as shown in Fig. 2. Another is deep pyramid levels also play a minor role in RoI Alignment step. The usage of pyramid levels P 2 -P 5 in RoI Alignment step of original FPN are reported in Fig. 3. In STSD, 0.01% RoIs are mapped into pyramid levels P 4 and P 5 . In TT100k, 0.02% RoIs are mapped into pyramid levels P 4 and P 5 . Considering the trade-off between accuracy and efficiency, the deep pyramid levels are not used in our proposed IFA-FPN.

B. THE INTEGRATED OPERATION (IO)
In FPN, the RoI Alignment is performed on different pyramid levels based on the scale of RoI. an RoI is assigned to the feature pyramid level P {i|i=k} according its scale s by: where k 0 is the bottom level to map the RoI, default as k 0 = 2 in FPN [19]. Specifically, when a RoI with s < 112, P {i|i=2} is the target level to map the RoI; when a RoI with 112 ≤ s < 224, the RoI will be mapped into a low-resolution pyramid level P {i|i=3} . w and h is the width and height of RoIs corresponding to the input image.
The FPN achieved promising results in general objects detection, but it cannot achieve satisfactory performance in traffic sign detection. It is because mapping RoIs into features pyramid dispersedly is unsuitable in traffic sign detection task. To solve this problem, the Integrated Operation (IO) is proposed to integrate all RoIs from different feature pyramid levels to a certain high-resolution pyramid level P 2 . The usage of pyramid levels P 2 -P 5 in RoI Alignment step are reported in Fig. 3.
There are two advantages of the IO module for detecting traffic signs. One is that integrating large RoIs into P 2 improves generalization ability of P 2 . Compared with small-scale traffic signs, large-scale traffic signs provide better features. The distribution of the quality of traffic signs in STSD is reported in Fig. 2(a) which shows small traffic signs contain many blurred signs, and the quality of most large-scale traffic signs are visible. Also, during the feature extraction step, the small traffic signs lose information easier than large traffic signs with the increasing depth due to the max-pooling operator. Thus, integrating RoIs of large traffic signs into P 2 can providing more accurate and useful information, thereby improving small traffic sign detection performance. Another advantage is that IO eliminates the impact of low generalization ability of P 3 . Fig. 3 indicates that small part of RoIs are mapped on P 3 which leads to the weak prediction ability of P 3 because of lack of training samples.

C. THE LIGHT MULTI-SCALE FEATURE AGGREGATION (FA) STRUCTURE
To make the IFA-FPN work as expected, enhancing the feature representation capability of P 2 is necessary. Because not VOLUME 9, 2021 only small RoIs (s < 112) are mapped to P 2 , but also the large RoIs (s ≥ 112) need to assigned to P 2 in IFA-FPN. To help P 2 better represent data which has large variance in size, we convert lateral connection layer L 2 to the proposed feature aggregation structure, which aggregate multi-scale features from different convolution layers path. Three types of feature aggregation structures, including Full-FA, Shared-FA, and Light-FA, will be introduced step by step.
The baseline lateral connection used in FPN is illustrated in Fig. 4(a), which is a 1 × 1 convolution layer. Instead of only using one 1 × 1 layer in FPN, the proposed Full-FA is designed as residual-shape which contain an identity function path y 0 to learn simple features and a mapping function F(·) to learn complex features. As shown in Fig. 4 (b), given the feature maps C 2 , the Full-FA aims to learn a residual C 2 with a mapping function F(·) as follows, where F(·) including several convolution operations. Moreover, to achieve multi-scale feature learning, we design the function F(·) as a multi-stream building block, which consists of multiple convolution streams y 1 , y 2 , y 3 . The first stream contains a 1 × 1 layer that learns single-scale features (scale = 1). The second stream consists of one 1 × 1 layer and one 3 × 3 layer to learn larger scale features (scale = 3). The third stream further increases receptive field (scale = 5) by adding a 3 × 3 layer. Then, the element-wise summation is applied to aggregate the C 2 , y 1 , y 2 , and y 3 to obtain multi-scales feature map P 2 . Then, we introduce the Share-FA and Light-FA that illustrated in Fig. 4(c) and Fig. 4(d). Veit et al. [24] reveal that the paths in residual networks show ensemble-like behavior, and they do not strongly depend on each other. In other words, the residual-shape architecture can be seen as a collection of different paths. Inspired by [24], we further propose a Light-FA to reduce the inference time of model because the Full-FA is heavy and time-consuming. The Light-FA is designed to a residual-shape to represent ensemble-like behaviors of Full-FA. The structure of Light-FA is shown in Fig. 4(d). The Light-FA is equivalent to the Full-FA if the convolutional kernels in Full-FA share weights as Fig. 4 (c). The Shared-FA can be represented as follows, where f 1 denotes the first 1 × 1 convolutional layer, and f 2 denotes the 3 × 3 convolutional layer, and f 3 denote the another 3 × 3 convolutional layer. The Light-FA can be represented as follows, From Eq. (5), Eq. (6) and Eq. (7), it is clear that Light-FA is equivalent to the Shared-FA in convolution operations. Note that the ReLU-activated layers after each convolutional layer are ignored in notation in Fig. 4. Finally, the Light-FA are built to obtain P 2 with high representation capacity. In this paper, the FA is defaulted to Light-FA.

A. DATASETS AND EVALUATION METRICS
To carry out the comprehensive experiments, the proposed IFA-FPN is evaluated on three mainstream traffic sign detection datasets: GTSDB [20], STSD [5], [6] and TT100k [12]. The detailed information of datasets is reported in Table 1.
GTSDB: The German Traffic Sign Detection Benchmark provides 600 images and 815 traffic signs for training, and 300 images and 353 traffic signs for testing. The image size in GTSDB is 1360 × 800. There are 43 classes in total.
STSD: The Swedish Traffic Sign Dataset [5], [6] provides 6617 images and 6651 labeled traffic signs in total. It contains two sets (set1 and set2) of images with resolution 1280 × 960 that were captured from Swedish highways and city roads. Each set contains 5 parts and has 20% labeled images. Set1Part0 is used for training, and Set2Part0 is used for testing in this paper. There are 20 classes in total.
TT100k: The TT100k is a large-scale traffic sign detection dataset released by Tsinghua University and Tencent Corporation. Compared with GTSDB and STSD, TT100k provides a large number of images with high resolution 2048 × 2048. The number of traffic signs in TT100k is 19.9 and 3.5 times much than in GTSDB and STSD, respectively. Same to COCO [25], the scale s of objects is used to separate different size groups: Small (s < 32), Medium (32 ≤ s < 96), Large (96 ≤ s < +∞) to report the detection result, thereby analyzing the impact of IFA-FPN in different   scales. s is computed as, where w and h are the width and height of a traffic sign, respectively. The mean average precision(mAP), which is commonly used as the evaluation criteria in object detection dataset [25], [26], is used as the evaluation measurement in this paper. Average Precision (AP) is the area under the precision-recall curve, which reliably describes the trade-off between the precision and the recall. AP is calculated for one class object, and mAP is the average value of AP over all considered classes. In this paper, a fixed Intersection-over-Union (IoU) with value 0.5 is used for computing mAP.

B. IMPLEMENTATION DETAIL
All experiments have been tested on a desktop with Intel Core i5-6600 3.30-GHz CPU and 1 NVIDIA GeForce Titan 1080Ti GPU with 11 GB memory. MMDetection [13] is used to implement the experiments and evaluate the results. We follow the default pre-processing techniques and hyperparameters in MMDetection to perform all experiments on three dataset, i.e., data augmentation techniques and anchor setting. The backbone network, default as ResNet-50 [27], is pre-trained on ImageNet [29] to extract feature. The Stochastic Gradient Descent (SGD) optimization algorithm with 0.9 momentum is employed. For GTSDB and STSD, the network is trained in 20 epochs, and the learning rate is 0.01 for the first 15 epochs and 0.001 for the following epochs. The training batch size is 2 in GTSDB and STSD. For TT100k, the network is trained in 10 epoch. The initial learning rate is 0.001, which decreases to 0.0001 at the 8th epoch. The training batch size is 1 in TT100k because of the limited memory of the GPU.

C. ABLATION STUDY
In this section, we first analyze the influence of individual components IO and FA of IFA-FPN by mAP and Frames Per Second (FPS). Then, the ablation study of the effect of removing more pyramid feature layers are performed. After that, the comparison results of three types FA structure including Full-FA, Shared-FA and Light-FA are reported in Table 4 by mAP, FPS and GPU memory usage. Subsequently, the effect of each skip connection in FA are reported. At last, we report the performance of IFA-FPN with different backbone network.

1) THE EFFECT OF INDIVIDUAL COMPONENTS
We evaluate the performance and efficiency of individual components, including IO and FA, by mAP and FPS in GTSDB, STSD, and TT100k dataset, the results are summarized in Table 2. The top part of Table 2 performs the experiments using Faster RCNN as the detector, and the bottom part reports the results of Cascade RCNN. We show the effect of the IO and FA by adding them into the baseline model one by one. Baseline denotes that the detector adopted the original FPN [19] as the neck network. The performance of baseline is not satisfactory, especially in small and medium traffic sign detection. After integrating all scale RoIs into a certain pyramid level by IO, the performance of Faster RCNN and Cascade RCNN are enhanced remarkably, especially in small and medium traffic signs. All methods achieve 100.0% mAP in large size group of GTSDB because of the limited number VOLUME 9, 2021   of traffic signs shown in Table 1. IO also improves the performance of large traffic signs in STSD and TT100k, which indicates that IO can bring stable performance enhancements. These results show that the behaviors of IO are consistent with our intuition mentioned in Sec III.B. Moreover, the inference speed of detectors with IO are faster than before because of removing the deep pyramid levels C 5 , C 6 , and P 6 . These results demonstrate the effectiveness of our proposed IO.
We also evaluate the effect of FA. As reported in Table 2, it is clear that FA further improves the detector performance by large margins. Specifically, ''Faster RCNN + IO + FA'' improve the mAP from 68.7% to 78.0% and 58.2% to 60.2% in GTSDB and STSD, respectively. The IO and FA bring more substantial performance gains in small datasets (GTSDB and STSD) than big dataset (TT100k) because the size and class imbalance problem is more obvious in small datasets. Experimental results show that each component in our method boost performance, and the combination of them achieves the best performance.

2) THE EFFECT OF REMOVING DIFFERENT PYRAMID FEATURE LAYERS
As mentioned in Sec III.A, the deep pyramid feature layers in the original FPN [19] are removed in our proposed IFA-FPN to reduce the model inference time. We perform the ablation experiments to investigate the influence of removing more pyramid levels. The results are reported in Table 3. In Table 3, The detector is Cascade RCNN [9], and the backbone network is ResNet-50 [27], the IO and FA modules are not applied. The original original FPN used full pyramid layers C 2,3,4,5 + P 2,3,4,5,6 . When more deep pyramid feature layers are dropped, the inference speeds of models are improved.
The best performance is achieved when pyramid feature layers C 2,3,4 + P 2,3,4 are used. Dropping more pyramid feature layers C 3,4 and P 3,4 accelerate the inference speed but they cannot further improve the model performance. It is clear that the performance significantly declined when we only use C 2 + P 2 . Considering the trade-off between accuracy and efficiency, only the C 5 , P 5 , and P 6 are removed in our proposed IFA-FPN finally. Table 4 evaluates the architectural design choices of FA that are shown in Fig. 4. The 'w/o FA' in Table 4 denotes the baseline module that adopts 1 × 1 convolution layer as the lateral connection L 2 illustrated in Fig. 4(a). It is clear that all types of FA can consistently improve the results. It indicates the necessity of enhancing the feature representation capability of P 2 and the validity of the proposed FA structure. Among three FA structures, the primary structure is Light-FA. There are two reasons. One is that two detectors with Light-FA can achieve the best performance in all datasets. Another is that Light-FA reduces the model inference time and the model size. Specifically, Cascade RCNN with Full-FA occupies 4962MB GPU memory, and Cascade RCNN with Light-FA occupies 4554MB GPU memory. The experiments demonstrate the superior performance and efficiency of the proposed FA.

3) THE ANALYSIS OF FA STRUCTURE
Furthermore, we investigate the effect of skip connections in Light-FA by removing one of them from the trained model in testing step. The effects of skip connections are observed by mAP and its fluctuation in STSD shown in Table 5. We start from the baseline results that shown in the first row of detectors as ''All used'' and progressively measure the impact of removing each skip connection y 0 , y 1 , and y 2 . The fluctuation of mAP of Faster RCNN and Cascade RCNN are consistent. The performance of small and medium group size is greatly affected by removing the certain skip connection, while the performance of large group size has little effect. This indicates that skip connections play important roles in small and medium traffic sign detection, especially in small traffic signs. The results demonstrate the validity of the proposed FA structure in multi-scale feature learning. Table 6 demonstrates that our proposed IFA-FPN can be applied in mainstream object detectors with different   backbone networks to improve performance of them consistently with similar inference speed. We perform the comparison experiments between the FPN and our proposed IFA-FPN with the different backbone network including ResNet-50, ResNext-50, ResNet-101, and ResNext-101 [27], [28]. It is clear that IFA-FPN brings significant improvement over FPN in all backbone network cases, which are consistent with the ResNet-50 results in Table 2. The experiments demonstrate IFA-FPN has good scalability with other detectors and backbone networks, which can be considered as a Plug-and-Play neck network.

4) ANALYSIS OF SCALABILITY OF IFA-FPN
To further demonstrate the superiority of our proposed neck network IFA-FPN, we compare IFA-FPN with two state-of-the-art neck networks i.e., Balanced Feature Pyramid (BFP) [30] and Content-Aware ReAssembly of FEatures (CARAFE) [31] in Table 6. Both BFP and CARAFE are designed based on the original FPN architecture, and they achieve better performance than FPN based on ResNet-50 in GTSDB and STSD. Compared with BFP and CARAFE, the proposed IFA-FPN still brings more substantial and consistent performance gains in all traffic sign datasets GTSDB, STSD, and TT100k. It is because BFP and CARAFE are designed for detecting general objects rather than traffic signs. BFP and CARAFE still mapped RoIs dispersedly in different pyramid levels, which is demonstrated to be an unprofitable factor for traffic sign detection in this paper. Due to the GPU memory limitations, some results, such as BFP, and ResNet-101 and ResNext-101 results in TT100k are not provided, notated as ''-''.
The GTSDB † divided traffic signs in GTSDB into three superclasses: Prohibitory signs (P.), Mandatory signs (M.), and Danger signs (D.) though GTSDB provides 43 classes in total. The STSD † only considered the visible traffic signs in STSD with at least 50 × 50 pixels. Moreover, STSD † only considered six main classes, including PEDESTRIAN CROSSING, PASS RIGHT SIDE, NO STOPPING NO STANDING, 50 SIGN, PRIORITY ROAD, and GIVE WAY. Following the rules in [2], [4], [5], our proposed method is performed and the results are reported in Table 7. For GTSDB † , the performances are reported in the area under the curve (AUC) of three superclasses. For STSD † , the performances are reported in recall (Rec.), precision (Pre.), and F1-measure (F1.).
As reported in Table 7, the feature-based methods wgy@HIT501 [4] and AdaBoost + SVR [2] achieved better performance than our method on small and fewer class datasets GTSDB † . The wgy@HIT501 and AdaBoost + SVR considered that traffic signs within each superclass share the same color and shape, therefore they recognized the traffic signs by extracting their color HOG features. These hand-craft feature-based methods work well in simple and small datasets but they cannot achieve satisfying performance on larger datasets STSD † . Our method outperforms other feature-based methods on larger datasets STSD † . It is because our method uses CNN to extract features. The hand-craft features are not robust enough for learning discriminative features to represent the general characteristics of traffic signs, and therefore feature-based methods evaluate their methods only using partial high-quality traffic signs or parts of classes. Subsequently, we compare our method with CNN-based traffic sign detection methods.

2) COMPARISON WITH CNN-BASED METHODS
Due to the evaluation metrics in several state-of-the-art studies are different, we reported the performance of our proposed methods in two types of evaluation metrics in Table 8 and  Table 9. The methods in Table 8 are evaluated using recall (Rec.), precision (Pre.), and F1-measure (F1.), and the methods in Table 9 are evaluated using mAP.
As shown in Table 8, without bells and whistles, the proposed neck network IFA-FPN greatly improves the performance of Faster RCNN and Cascade-RCNN from 80.1% to 90.3%, and 85.6% to 89.0% in F1-measure for small-size groups, respectively. The SSD [10] and MF-SSD [32] cannot achieve satisfying performance because SSD-based detectors need to reduce the original image resolution to a fixed and smaller resolution before forwarding them to the network.
This zoom-out function declines the traffic sign detection performance because it removes image information.
For fair and comprehensive comparisons among different architecture, we performed all experiments in Table 9 under the same hardware limitations in MMDetection on three datasets. The results are reported in Table 9. We compare our method with single-stage detectors SSD300, SSD500 [10] and YOLO [11], of which the input images need to be resized to 300 × 300, 500 × 500, and 608 × 608 to fed into them, respectively. Therefore, these single stage detectors show poor performance in small and medium traffic sign detection, which occupy big proportion in datasets. The input image sizes of other methods are consistent with the original image provided by datasets. Attention [14] has no open-source implementation, hence we did not re-perform it in the local computer, and directly use the results reported in the paper. When our proposed IFA-FPN is applied to the Cascade RCNN, the best performances are achieved by 80.3% mAP in GTSDB, 65.2% mAP in STSD, and 93.6% mAP in TT100k. Our proposed IFA-FPN consistently improves the performance of Faster RCNN and Cascade RCNN by a large margin in three datasets, which demonstrate the effectiveness of the IFA-FPN in traffic sign detection.
To make straightforward illustrations of the superiority of our proposed IFA-FPN, qualitative detection results on STSD and TT100k by Faster RCNN with FPN and our IFA-FPN are shown in Fig. 5 and Fig. 6, respectively. The green bounding boxes are the true positive (correct) detection, and the red bounding box is the false positive detection. The predicted class and confidence score of the traffic sign are written on the boxes. We observe that the predicted bounding boxes by IFA-FPN are well aligned with the ground truth of traffic sign regions, which indicate IFA-FPN outperforms FPN in both STSD and TT100k. On STSD, IFA-FPN can reduce false positive detection shown in Fig. 5(b) and in Fig. 5(e). IFA-FPN can detect occluded traffic signs shown in Fig. 5(c) and in Fig. 5(f). On TT100k, IFA-FPN can perfectly detect traffic signs with deformation caused by the camera distortion, while the FPN cannot generate well-fitting box for deformed traffic signs or even cannot detect it. IFA-FPN can detect traffic sign 'io' shown in middle of the third row of Fig. 6, while the FPN failed. Moreover, IFA-FPN can always output more confident (higher) scores than FPN for the same  target in both STSD and TT100k dataset. The illustrations in Fig. 5 and Fig. 6 show that IFA-FPN gets more stable results than the FPN.

V. CONCLUSION
This paper proposed a Plug-and-Play neck network called IFA-FPN that can be applied in mainstream object detectors to improve the performance of a traffic sign detector while a similar inference speed is maintained. An integrated operation is introduced to overcome the size and class imbalance problem in traffic sign datasets by integrating all scale RoIs into a certain pyramid level. Three types of feature aggregation structures are proposed and compared that can enforce multi-scale features learning. The experiments have been done to evaluate the performance of the proposed method on three mainstream datasets including GTSDB, STSD, and TT100k. The experimental results demonstrate the superiority of the proposed IFA-FPN.
In the future, we will focus on light-weighting the network to achieving superior performance in both accuracy and efficiency, then we wish to integrate the proposed method in the ADAS or ADS of a real vehicle. After