Traffic Sign Detection and Recognition Using Multi-Scale Fusion and Prime Sample Attention

Traffic sign detection, though one of the key technologies in intelligent transportation, still has bottleneck in accuracy due to the small size and diversity of traffic signs. To solve this problem, we proposed a two-stage CNN object detection algorithm based on multi-scale feature fusion and prime sample attention. We improved the original Faster R-cnn model in terms of feature extraction and sampling strategy. For feature extraction, to elevate the ability of neural networks to detect small objects, we adopted HRNet as the feature extractor. There are four stages in HRNet - a series of high resolution subnets as the starting point with repeated adding parallel high to low resolution subnets to form other stages. In the whole process, the information in the parallel multi-resolution sub-network is repeatedly exchanged to perform repeated multi-scale fusion. For sampling strategy, we adopted a simple and effective sampling and learning strategy called Prime Sample Attention (PISA), consisting of Importance-based Sample Reweighting (ISR) and Classification Aware Regression Loss (CARL). PISA proposed the concepts of IoU Hierarchical Partial Sorting (IoU-HLR) and Hierarchical Partial Score Sorting (Score-HLR), which sort the importance of positive samples and negative samples in mini-batch respectively. With the proposed method, the training process is focusing on prime samples rather than evenly treat all ones. The algorithm complexity of our method is lower than that of other state-of-the-art. After experiments by TT100K dataset, our method can attain a comparable or even better detection accuracy and robustness.


I. INTRODUCTION
In recent years, technology of road traffic signs detection has attracted wide attention due to increasing cases of traffic accidents resulting from ignorance of road signs. Not only the academia has conducted in-depth research, but also BMW, Mercedes-Benz and other well-known automobile companies have invested in business plans to study the technology, BMW Road Environment Perception System (REPS) as an example. The REPS system includes detection of front cars, pedestrians, and the traffic signs. Those studies realized the automatic detection and recognition of traffic signs through computer vision technology; however, the accuracy of detection needs improving.
Using reflective materials, solid colors and simple geometric signs made the traffic signs eye-catching; however, The associate editor coordinating the review of this manuscript and approving it for publication was Alba Amato .
it remains difficult to detect and identify traffic signs by computer due to the unstable features of traffic signs in different occasions, such as viewing angle changes, self-damage, and bad weather. Accurate identification and detection of traffic signs remains a challenge.
For researchers, there are several technical challenges in achieving high accuracy of detection. Firstly, it is always difficult for computers to detect relatively small objects in the entire image.
Secondly, traffic signs of multiple instructions in a fixed shape result in the difficulty of accurate detection. For example, the traffic signs in the dataset TT100K [1] are in three shapes: rectangle, triangle, and circle, but they fall into 200 types of instruction.
Additionally, the accuracy of detection may be affected by multiple factors, such as the fluctuation of the object size in the field of view, bad weather, and damage to the traffic sign itself. Fig.1 shows some samples of detection difficulty. Aiming at these problems, a traffic sign detection system based on improved Faster R-CNN [2] neural network is designed. Our detection model accomplishes comparable detection and classification accuracy with state-of-the-art method. The main contributions of this article are as follows: First, we innovatively adopt HRNet [3], [4] as the feature extractor for traffic sign object detection. For the detection of small objects, the multi-scale fusion feature extraction layer retains more information than the concatenated feature extraction layer. HRNet can maintain high-resolution representation throughout the process, starting with a high-resolution subnet in the first stage, adding subnets from high to low resolution one by one to form more stages, connecting the multi-resolution subnets in parallel, and exchanging the information in the parallel multi-resolution subnets repeatedly throughout the process to perform repeated multi-scale fusion. This multi-scale fusion method is more advanced than traditional methods. Moreover, for traffic signs with fewer shapes but more types, high resolution feature map can provide neural networks with more effective information Second, we adopt the PISA [5] (Prime Sample Attention) method to optimize our model. PISA is a simple and effective sampling strategy through a simple weighting scheme to make the neural network focus on the samples which impose a greater impact on the training result (AP). Consequently, the learning efficiency of neural network gets higher, and the detection accuracy and robustness are enhanced.
The model designed in this paper shows a significant superiority compared to state-of-the-art in the dataset TT100K (Tsinghua-Tencent 100K). The rest of this paper is organized as follows. Section 2 briefly reviews the related work in recent computer vision approaches for object detection and small object detection. Section 3 presents the proposed Faster R-cnn approach for traffic signs detection. Section 4 discusses our results and ablation researches. Section 5 concludes our work.

II. RELATED WORK
Traditional image detection technology is based on manually extracting image features, such as SIFT (Scale Invariant Feature Transform) [6], SURF (Speeded-Up Robust Features) [7], and HOG [8] (Histograms of Oriented Gradient).
SIFT(Scale Invariant Feature Transformation) approach proposed by David G. Lowe, combined with local spatial histogramming and normalization, performed very well in object detection and image matching. Bay et al. proposed the SURF (Speeded-Up Robust Features) algorithm. It solves the shortcomings of SIFT calculation complexity and time-consuming on the basis of maintaining the excellent performance of SIFT, because the extraction of interest points and the description of feature vectors are improved, and the calculation speed is increased. Dalal et al. proposed the Histograms of Oriented Gradient descriptors with a conventional SVM [9] based sliding windows classifier, which method obtained good performance in human detection.
Traditional object detection algorithms are still widely used for fast calculation speed and low memory footprint [10]- [13]. For example, Anant Ram Dubey [14] et al. used HOG-SVM method to detect road objects, and Takaki Masanari [15] used SIFT [6] method to detect traffic signs. However, the accuracy of traditional object detection algorithms cannot compete with intelligent algorithms.
With the development of computer vision technology, machine learning and deep learning algorithms have been widely used in object detection with their high detection accuracy [16]- [20]. Deep learning algorithms can independently train and learn network models based on the labeled object dataset.
Deep learning detection algorithms include two-stage and single-stage algorithms. The two-stage algorithms include R-cnn [21], Fast-Rcnn [22], Faster R-cnn [2], R-FCN [23], and Mask R-cnn [24], etc. Single-stage object detection algorithms are represented by SSD [25], YOLO [26]- [28], etc.. Such algorithms directly predict objects' location and category without region proposal. However, the single-stage algorithm is not as accurate as the two-stage algorithm, especially in the detection of small objects.
Faster R-cnn [2] is a two-stage object detection algorithm proposed by He et al. in 2015. It mainly includes feature extractor, Region Proposal Network, ROI pooling, and Fully Connected Layers to classification and regression. Because of its excellent performance in object detection tasks, it is widely used in face, vehicle, pedestrian, traffic sign detection and other fields.
The focus of research on the traffic sign detection based on deep learning algorithm is to improve the feature extraction and sampling strategy of convolutional neural networks.
In response to this problem, Wang et al [29] improved the feature extractor of Casade R-cnn [30]. They adopted Resnet101 [31] as the backbone of Casade R-cnn, but their model is too complex to achieve real-time video detection. Han et al [32] improved the feature extractor and sampling strategy of Faster R-cnn [2]. They tried to use the shallow layer of VGG16 [33] as the feature map of RPN, and adopted OHEM [34] to improve the sampling strategy of their model. But this would lose the semantic information of the deep feature map of VGG16, and reduce the robustness of the model, and OHEM does not significantly improve performance. Jiang et al [35] improved the loss function of Yolov3 [28]. They adopted GIoU [36] and Focal Loss [37] as the loss function of the model, but the detection accuracy of the single-stage detection algorithm was too low to meet actual needs.
Based on the above related work, this paper innovatively adopted HRNet [3], [4] and HRFPN [3], [4] to improve the feature extractor of Faster R-cnn [2], and adopted PISA [5] strategy to optimize the learning strategy of Faster R-cnn. Our method can attain a comparable or even better detection accuracy and robustness than many state-of-the-art methods.

III. APPROACH
The architecture of Faster R-cnn is shown in Fig. 2. The image input is downsampled by the feature extractor to get the feature map. After the feature map fed into the Region Proposal Network, several proposals are obtained. These proposals are fed into the Roi Pooling Layer with the feature map to obtain that with the proposals, which is then used in the Prediction Layer. The Classification Layer predicts the category of proposals, and in the meanwhile obtains the precise position of the objects through bounding box regression.

A. EXTRACTOR
Traditional feature extraction backbones like VGG16 and ResNet have poor performance in the detection of traffic signs. One of the reasons is that they only make region proposals based on the last feature map of the extractor, but the receptive field of feature map is too wild. Taking Faster R-cnn as an example, if VGG16 is used as the extractor, the theoretical receptive field of the feature map output by RPN network is 228×228; if ResNet50 is used as an extractor, the theoretical receptive field is 299 × 299. We adopted HFM (Hot Feature Map) [38] as the visualization of the feature map. The calculation formula of HFM can be expressed as (1): Fig. 3 shows a sample of traffic signs dataset with a size of 800 × 800×3. Fig. 4 shows the visualized feature maps of Fig. 3, the feature map generated by VGG16 and ResNet50. Since the human eye is much more sensitive to color images than grayscale images, we map the grayscale picture of the hot feature map to YB color space in Fig. 4. It can be seen from Fig. 4 that the resolution of the feature map in deep layer is significantly reduced. Although feature map in deep layer contains rich semantic information of the image, it loses some detailed information of the object, which will significantly reduce the performance of small objects detection by the neural network. In other words, the large receptive field feature extractor is not suitable for the detection of small objects. In response to the above problems, we found that the extractor, which is mainly used for human pose estimation, has an amazing effect on the detection of small objects. Because the multi-scale fusion of feature maps, for example, FPN, can significantly improve the ability of neural networks to detect small objects. The structure of HRNet is shown in Fig. 5. The backbone of HRNet include four stages, the network started with a series of high resolution convolution layers, then repeatedly adding and connecting the parallel multi-resolution subnets to form the 2nd, 3rd,4th stages. In the whole process, the information in the parallel multi-resolution convolutional layer is repeatedly exchanged to perform repeated multi-scale fusion.
The specific process of fusion in the HRNet is shown in Fig. 6. In the backbone of HRNet, the layer with the same resolution is directly copied to the next layer. Bilinear upsample is used to upsample the low-resolution feature layer, and then use 1 × 1 convolution layer to match the channels of high resolution layer. For the high-resolution feature layer, we adopt 3 ×3 stride convolution kernel to downsample. After completing the upsample and downsample process,   The specific process of fusion in the HRNet network. The layer with the same resolution is directly copied to the next layer. We used bilinear upsample method and 3 × 3 convolution kernel to dawnsample. feature maps of different resolutions will be fused in the form of element-add. In order to reduce the information loss in the downsample process, pooling layer is not used. We adopted a feature pyramid network based on HRNet -HRFPN -to enhance the neural network's ability to detect small objects. Its architecture is shown in Fig. 5. It mixes the output representations, from all the four resolutions through a 1 × 1 convolution, and produce a 15x-dimensional representation, and then reduce the dimension of the high-resolution representation to 256, similar to FPN [39].
Ke Sun et. al proposed in their HRNet paper three models of w18, w32, and w48. Among them, 18, 32, and 48 represent the channel number of the last layer of feature layers. We adopted w18 as the improved feature extractor of Faster R-cnn. The reasons for this choice will be explained in the ablation research. We resized the images with a size of 3 × 2048×2048 into 3 × 800×800 and fed them into HRNet, then got 18 × 200×200, 36 × 100×100, 72 × 50×50 and 144 × 25×25 feature maps. Then we fed them into HRFPN, unified the channels of these feature maps to 256 through 1 × 1 convolution kernel, and fused them to obtain 256 × 200×200 feature maps, and then got 256 × 100 ×100, 256 × 50 × 50, 256 × 25 × 25 and 256 × 13 × 13 feature maps through average pooling layer. They are sent to RoI Pooling Layer separately.
Compared to the traditional sequential top-down fusion strategy, HRNet can maintain high resolution instead of restoring resolution from low to high. It performs repeated multi-scale fusion with the help of low-resolution block of the same depth and similar level to improve  high resolution rate, so the feature map may be more accurate.

B. SAMPLING STRATEGY
The sampling process of original RPN is to randomly select some positive and negative samples from all anchors. But according to Ke Sun et al. [4], the samples in each mini-batch are neither independent nor equally important. Therefore, we adopted a simple and effective sampling and learning strategy called Prime Sample Attention (PISA), which shifts the focus of the training process to prime samples. Our experiments showed that focusing on prime samples is usually more effective than on hard and random samples when training the detection neural network. According to Ke Sun et al. [4], the positive samples that affect training are mainly those with higher IoU, while the negative with higher classification scores.
PISA proposed the concepts of IoU Hierarchical Partial Sorting (IoU-HLR) and Hierarchical Partial Score Sorting (Score-HLR), which make model sort the importance of positive and negative samples respectively after region proposal in each iteration. As shown in Fig. 7, to compute IoU-HLR, we first divided all samples into different groups according to their nearest groundtruth object. Next, the samples in each group are sorted using IoU descending order with groundtruth, and then the IoU local ranking (IoU-LR) is obtained. Subsequently, samples are taken with the same IoU-LR and sorted in descending order. Specifically, we collected and classified all top1 IoU-LR samples, followed by top2, top3, and so on. These two steps were followed to sort all samples.
As shown in Fig. 8, we computed the Score-HLR of negative samples in a similar way to IoU-HLR. Unlike the positive samples that are naturally grouped by each gt object, negative ones may appear in the background area. So, we grouped them into different clusters based on NMS first. Then we chose the highest score in all foreground categories as that of the negative sample, and then perform the same steps as computing IoU-HLR.
PISA consists of two components: Importance-based Sample Reweighting (ISR) and Classification Aware Regression Loss (CARL). With the proposed method, the training process is focusing on prime samples rather than evenly treating all ones. VOLUME 9, 2021 FIGURE 9. Some categories in the TT100K dataset.
As described by Yuhang Cao et al. [5], the computation of ISR can be expressed as (2)-(4). Firstly, we ranked the samples by IoU-HLR or Score-HLR, and then transformed this rank to a real value, this process can be expressed as (2) u i is the importance value of the i th sample of category j. n max is the maximum value of n j in all categories, which ensures that samples in the same order of different categories will be assigned the same u i . And then we need a monotonically increasing function to further increase sample importance value u i to a loss weight w i . Among them, γ is a degree factor indicating the to, and β is the bias that determines the minimum sample weight. which important samples will be prioritized Based on the above improvements, the classification loss of Faster R-cnn can be rewritten as (4), where CE is the abbreviation of cross entropy; s andŝ represent the prediction score and classification object; n and m are the number of positive samples and negative samples respectively. and In order to keep the total loss relatively stable,we normalized w to w .
The role of CARL is to highlight the prime samples, while suppressing other ones. CARL can optimize the process of localization and classification relevantly, its specific method can be expressed as (5) p i presents the predicted probability of the ground truth and d i denotes the output regression offset. L is the commonly used smooth L1 loss. With CARL, the classification branch can be supervised by regression loss, and the impacts of unprime samples are greatly suppressed, and the focus on the prime samples is strengthened..

IV. EXPERIMENTS
A. DATASET TT100K [1], provided by Tsinghua University and Tencent Corporation, is a large traffic-sign benchmark from 100000 Tencent Street View panoramas. The dataset contains 9176 images (6105 for training and 3071 for testing). These images contain 221 types of traffic signs, and cover large variations in illuminance and weather conditions with a size of 2048 × 2048. Each traffic-sign in the benchmark is annotated with a class label, its gt bbox (ground truth bounding box) and pixel mask. We performed statistical analysis on the TT100K dataset, and summarized the statistical results in Table 1. From Fig. 10(a) and Fig. 10(b), we can conclude that the area of the gt bboxes in the TT100K dataset is mostly less than 9216 pixels, accounting for 92.81%. Among the gt bboxes, those with an area ranging from 1024 pixels to 9216 pixels account for 53.42%, and those with an area less than 1024 pixels account for 39.28%. Although some dataset like COCO [41] divides the objects into three groups based on their size, namely, small(area∈[1,1024)), middle(area∈[1024,9216)), and large (area>=9216), we think this method unsuitable for TT100K, because the size of image in TT100K is 2048 × 2048. So, even if the area of a gt bbox is 9216 pixels, it only occupies about 0.22% of the entire image, which obviously cannot be called a ''large object''. From the perspective of practical applications, we have retained all 221 types of objects in TT100K for detection.
In addition, Figure 10(C) is a scatter diagram of gt bbox. We used least squares regression to estimate the approximate interval of the aspect ratio of the bbox to (0.34, 1.38). We hoped to use the small area and aspect ratio of the anchor in the RPN of Faster R-cnn, thereby improving the detection accuracy of the algorithm. Unfortunately, such optimization has little effect. The AP (Average Precision) is 0.237 when obtained by original Faster_R-cnn, 0.238 by Faster R-cnn with a suitable aspect ratio and 0.242 by Faster_R-cnn with a  suitable aspects ratio and smaller anchor size. The detection performance was not improved significantly.

B. TRAINING DETAILS
The experimental environment of our approach is NVIDIA TITAN XP graphics card (if not specified, our experiment is usually implemented by two graphics cards working in parallel), Ubuntu16.04LTS system, CUDA10.0, and Pytorch1.3.1 programming framework based on Python 3.7.2.
In the preprocessing of the dataset, we first resized the input image to 800 × 800, then we used the randomflip strategy to augment the dataset. In each epoch of training, the probability of an image in training dataset being randomly flipped is 0.5.
During the training process, the total epoch is 48, and the initial learning rate is 0.02. We used the ''linear warmup [31]'' method to slowly increase the learning rate to 0.02. The warm up iterations is 500, and the warm up ratio is 0.001. The decay ratio is 0.1. The learning rate will be reduced to 0.002 after 32 epochs, and to 0.0002 after 44 epochs. We used the ''momentum'' method to accelerate the gradient descent, the momentum coefficient is 0.9, and we used the ''weight decay'' method in order to prevent overfitting. The weight decay coefficient is 0.0001. [40] In addition, for (3) and (5), we adopted the conclusions of the original paper after ablation study, where γ p =2.0, γ n =0.5, β p = β n =0,k=1,b=0.2, where γ , β for ISR.
Among them, γ p and β p are the weight and bias when ranking positive samples, while γ n , β n are the weight and bias when ranking negative samples. k and b for CARL. The specific experimental details will be demonstrated in the ablation study.

C. DETECTION PERFORMANCE AND EFFICIENCY
We evaluated traffic sign detection methods from the aspects of algorithm complexity, computing speed, accuracy, and robustness. We compared our method with two representative generic object detectors Faster-RCNN-FPN and Casade R-cnn, and three state-of-the-art traffic sign detectors proposed by Wang et al., Han et al. and Jiang et al.. The experimental results in the second section showed that the AP obtained by the original Faster R-cnn is only 0.237, which is much lower than other states of the arts, so we used the improved Faster R-cnn, namely Faster R-cnn with FPN, instead of the traditional one to compare with our method.
We used the number of parameters, FLOPs (Floating point operations) and Inference speed to represent the algorithm complexity and computational efficiency of these methods. The results are shown in Table 2.
The parameters of the model we built is 40.08M, of which HRNet is 21.3M, accounting for about 53%. It is 4.2M less than ResNet50 (25.5M Params), but higher than VGG16 (14.7M Params). However, the performance of HRNet is better than the classic feature extractor. Our model has fewer VOLUME 9, 2021 parameters and FLOPs than the models of Wang et al. simplified VGG16 network. This will reduce the robustness of the model. In addition, the inference speed of our model has reached 26FPS. Although compared to single-stage algorithms like SSD and YOLO, the speed of our method is slower, when compared to two-stage algorithms like Faster R-cnn and Casade R-cnn, the inference speed of our model is still better than most traditional two-stage algorithms.
We use AP (AP, AP 50 , AP 75 ) and AR(Average Recall) to evaluate the accuracy of methods. The results are shown in Table 3. In all experiments, we uniformly used 800 × 800 images as input images, except that the loss function of Jiang et al. (improved YOLOv3) converges slowly, and AP is relatively stable after 273 epochs. We trained the rest of the models 48 epochs, where the learning rate decay steps are 32 and 44. The AP obtained by our method is 0.352, AP 50 0.444, AP 75 0.428. All the AP and AR obtained by the object detection algorithm are relatively low, because the dataset has 221 types. But, our detection accuracy is still 10%∼20% higher than the current state-of-the-art. For example, the AP obtained by our method is 18% higher than Faster R-cnn with FPN and 7% higher than Casade R-cnn. Fig. 11 illustrates the precision-recall curves of our method and the other methods of AP 50 , AP 75 , and Loc(localization errors ignored, but not duplicate detections). The precision-recall curve is a common measure to evaluate performance of object detectors. AP 50 and AP 75 are the APs when the IoU threshold is set to 0.5 and 0.75 respectively. Loc (localization errors ignored) is an indicator which consider classification accuracy only. They are all commonly used object detection method evaluation indicators. From Fig.11, the P-R curve of our method is generally high for other methods. Considering using a much lower resolution of 800 × 800, our method is still competitive from precision-recall curve perspective. The visualization of our detection results is shown in Fig.14. As can be seen from it, our method can effectively detect small and multiple objects in images.
In addition, we also evaluated the robustness of the model. We used Hendrycks and Dietterich's corruption image generation method [42] to simulate the four severe weather images under brightness, frost, fog, and snow, and divided each severe weather into 5 levels according to the benchmark in their paper [43]. For example, take Fig.3 as the original image, the generated corruption images are shown in Fig.12.
We considered the AP obtained by detecting the original dataset as ''AP Clean'', and that by detecting the corruption dataset as ''AP Corruption'', and used the percentage of ''AP Corruption'' in ''AP Clean'' to represent the method robustness. This process can be expressed by (6). The results of our robustness experiment are shown in Table 4.
In order to make the experimental results easy to observe, we plotted the AP percentage of different methods into a line chart as shown in Fig.13. It can be concluded from it that, in general, our method can maintain good performance for traffic sign detection under severe weather. For brightness, our algorithm can almost ignore the corruption of bad weather when its severity is low. When brightness severity is 1 and 2, the AP percentage is 99.43% and 95.74%, which is second only to Casade R-cnn with ResNet50. When the brightness severity is 3, 4, and 5, the performance is slightly inferior, but the AP percentage can still be maintained at 92.33%, 86.93%, and 79.83%, respectively. For frost, our method obtains AP percentages of 91.76%, 81.25%, 71.59%, 68.47%, and 61.36%, respectively. In our experiment, when severity is 1, the AP Percentage is second only to Wang. et al.. When severity is 2, the AP percentage is the highest. When severity is 3, the AP percentage is only lower than Han et al.. When severity is above 3, the AP percentage is slightly inferior, but still higher than 60%, which is within the acceptable range. For fog, our method obtained 87.78%, 82.39%, 77.27%, 74.72%, and 67.90% as AP percentages. In our experiment, when severity is under 5, the AP percentage is second only to Faster R-cnn with RPN, and both reached more than 70%. When severity is 5, the AP percentage is slightly inferior, but it is still higher than 60%. For snow, our method obtained 86.08%, 67.90%, 64.20%, 52.84%, 46.59% as AP Percentages. Although in our experiment the AP percentage is always second only to Faster R-cnn with RPN, when severity is 5, our AP percentage is lower than 50%. It can be considered that our model has a slightly weaker detection ability for snowy weather and the best robustness for foggy weather.

D. ABLATION ANALYSIS
The main improvements in Faster R-cnn in this paper are feature extraction and sampling strategy. This section will discuss the impact of these two improvements in detail. The total epoch of our ablation experiment training is 48, and the learning rate decay steps are 32 and 44. The experimental results are shown in Table 5.

1) IMPROVED FEATURE EXTRACTOR
In the HRNet paper [3], Ke Sun provides three output sizes of the backbone, namely HRNet-w18, HRNet-w32 and HRNet-w40. Their final layer sizes are 18 × 18, 32 × 32, 40 × 40, respectively. The parameters of HRNet-w40 are 77.5M, but its detection performance of the COCO [41] dataset has not been significantly improved, so we do not consider it as a feature extractor of the traffic sign detection model. In Table 5, comparing experiments 1 and 4, 2 and 5, 3 and 6, when using the PISA sampling strategy, the AP obtained by using HRNet-w18 is 0.352 higher than HRNet-w32 by 0.007; when using the OHEM negative hard sample mining strategy, it is 0.319, which is lower by HRNET_w32 0.002; when the random strategy is adopted, it is 0.324, which is higher than HRNet-w32 by 0.004. In general, the performance of HRNet-w18 is better. Too many parameters for the low resolution input of 800 × 800 of HRNet-w32 result in overfitting. For experiments 2 and 5, we believe that the small size feature map caused the inability of OHEM.

2) SAMPLING STRATEGY
For HRNet-w18, the AP obtained by the PISA sampling strategy is 0.352, which is 0.033 and 0.028 higher than the OHEM and random strategies, respectively. For HRNet-w32, the AP obtained by the PISA sampling strategy is 0.345, which is 0.025 and 0.024 higher than the OHEM and random   strategies. respectively. This indicated that the PISA strategy is more suitable for traffic sign detection than other sampling strategies.
In addition, we also conducted ablation studies on the hyperparameters of PISA. The experimental results are shown in Table 6: In Table 6, γ and β are for ISR. Among them, γ p and β p are the weight and bias when ranking positive samples, and γ n , β n when ranking negative samples. k and b are for CARL.
The conclusion of the hyperparameter experiment is basically the same as the original paper [4]. So, we adopt γ P = 2.0, γ N = 0.5, β P = β N = 0 for ISR, and k = 1.0, b = 0.2 for CARL. According to the results in Section 2, the AP obtained by the original Faster R-cnn is 0.237, while the AP obtained by our model is 0.352, which is 0.115 higher, and an increase of about 48.5%.

V. CONCLUSION
In this paper, we proposed a two-stage CNN traffic sign detection algorithm based on improved Faster R-cnn. We used the parallel fusion feature extraction network, HRNet, to improve the feature extractor of Faster R-cnn and the attention mechanism of Faster R-cnn. Through the overall designs, the algorithm complexity of our method is lower than that of other state-of-the-art. After experiments by TT100K dataset, our method can attain a comparable or even better detection accuracy and robustness. In the future, we will continue to speed up our model while maintaining high accuracy and conduct in-depth research on the feature fusion machine of object detection neural network. VOLUME 9, 2021 JUNJU ZHANG was born in 1979. He received the Ph.D. degree in optical engineering from the Nanjing University of Science and Technology.
In recent years, he has been responsible for participated in more than 20 national, provincial or horizontal development research projects. He has published more than 30 papers in journals and conferences, of which more than 20 were indexed by SCI and EI, applied for five national invention patents, and compiled one national textbook. His main research interests include photoelectric information detection, image signal processing, photoelectric emission material theory, and preparation technology.
WEI HUANG was born in Hubei, China, in 1996. He is currently pursuing the master's degree with the School of Electronic Engineering and Optoelectronic Technology, Nanjing University of Science and Technology. His main research interests include photoelectric detection and image engineering. VOLUME 9, 2021