An Improved Light-Weight Traffic Sign Recognition Algorithm Based on YOLOv4-Tiny

Aiming at the problems of low detection accuracy and inaccurate positioning accuracy of light-weight network in traffic sign recognition task, an improved light-weight traffic sign recognition algorithm based on YOLOv4-Tiny was proposed. By improving the K-means clustering algorithm, the anchor with appropriate size is generated for the traffic sign data set to improve the detection recall rate and target positioning accuracy. The strategy of large-scale feature map optimization is proposed, which enriches the feature level of the network by using the low-level information, strengthens the representation of the feature information of the small target, and improves the detection accuracy of the long-range small target. In view of the problem of missed detection of high overlapping targets in the post-processing stage of the model, the paper proposes an improved NMS algorithm to screen the prediction box, avoid deleting the prediction results of different targets, and further improve the detection accuracy and recall rate of the target. Experimental results show that, compared with the original YOLOv4-Tiny algorithm, the improved algorithm in traffic sign recognition task based on TT100K dataset, mAP and recall are improved by 5.73% and 7.29% respectively, and FPS value is maintained at about 87 f/s, which meets the accuracy and real-time requirements of traffic sign recognition task.


I. INTRODUCTION
The identification of traffic signs is an important research content in the field of automobile autonomous driving system and driver-assisted driving system. Traffic signs contain a lot of useful information, which can prompt drivers to make a correct response to road condition information in real time, greatly reduce the occurrence of traffic accidents and improves the safety of driving [1]. Therefore, the study of fast and accurate traffic sign recognition system under the real scene has important practical value and a wide range of application scenarios.

A. LITERATURE REVIEW
Traffic sign recognition is often carried out in a complex outdoor environment, which is vulnerable to the interference of The associate editor coordinating the review of this manuscript and approving it for publication was Sudhakar Radhakrishnan . natural environment and human factors, such as bad weather, different light and shade, sign blocking and damage, etc., thus causing recognition difficulties [2]. Based on this, in the research of traffic sign recognition, a large number of complex algorithms have been proposed [3] Traditional algorithms mainly rely on artificial feature extraction, such as local binary pattern (LBP) [4], Gabor [5], histogram of oriented gradient (HOG) [6], etc., and use support vector machine (SVM) [7], AdaBoost [8] and other classifiers to complete traffic sign recognition. However, in the face of complex outdoor environment, artificial feature extraction cannot meet the actual needs.
In recent years, deep learning model has gradually become the main algorithm in the field of target detection, and convolutional neural network has achieved remarkable results in the field of target detection [9], [10]. The target detection model based on deep learning can be divided into two types: a two-stage detection algorithm based on region proposal, such as region convolutional neural networks (R-CNN) [11], Fast R-CNN [12], Faster R-CNN [13], etc. This two-stage detection algorithm needs to generate the target candidate box first, and then classify and regress the target candidate box. The other is single-stage detection algorithms such as you only look once (YOLO) [14]- [17], single shot multibox detector (SSD) [18], which can directly extract features from the network to predict object classification and location. In contrast, the two-stage detection method has high precision but slow speed, and the single-stage detection algorithm is fast but less accurate.
At present, many scholars apply target detection algorithm to traffic sign detection. Zuo et al. [19] et al use Faster R-CNN to detect traffic signs. You et al. [20] et al tailor-made the network to reduce the computational complexity based on SSD algorithm and applied it to traffic sign detection. Yang et al. [21], Yuan et al. [22] extract the region of interest from the input image by adding an attention module to the convolutional neural network to refine the feature extraction of traffic signs under complex background. Zhang et al. [23] effectively use fine-grained features at the bottom to achieve accurate target positioning through image enhancement and the introduction of spatial pyramid pooling (SPP) module in YOLOv3.

B. MAIN WORK
Although the existing deep learning methods have achieved some results in the task of traffic sign detection, they still have limitations in the face of complex natural environment. For example, the traffic sign algorithm model for real-time detection has low recall and detection accuracy. The influence of shooting angle makes the traffic signs overlap, resulting in missing detection of some targets. Therefore, in order to solve the above problems, this paper proposes several improvement strategies based on the lightweight version YOLOv4-Tiny algorithm of YOLOv4, so as to improve the detection accuracy and the robustness of the model on the premise of meeting the real-time performance. The main works are as follows: (1) The K-means clustering algorithm is improved to generate anchor boxes. By introducing generalized IOU (GIoU) [24] instead of intersection over union (IOU) in the distance calculation formula in the K-means clustering algorithm, GIoU can consider the area of non-overlapping areas, that is, when the two boxes do not completely intersect, GIoU introduces non-overlapping area items, which can better reflect shape information, thus improving the recall rate of the algorithm model and accelerating the convergence of the model.
(2) A large-scale feature map optimization strategy is proposed, and the size of the two-scale output feature map of YOLOv4-Tiny is changed from the original 19 × 19 and 38 × 38 to 38 × 38 and 76 × 76, so as to improve the detection accuracy of small targets.
(3) An improved algorithm based on soft non-maximum suppression (NMS) [25] is proposed. Aiming at the problem of missing detection caused by overlapping traffic signs in the traffic sign data set, the missing detection rate is reduced and the generalization of the model is improved by improving the score reset function.

II. ALGORITHM MODEL
YOLOv4-Tiny is a simplified version of YOLOv4. Compared with YOLOv4, YOLOv4-Tiny has a faster detection speed, but the accuracy has declined. Fig. 1 shows the network structure diagram of YOLOv4-Tiny. Compared with YOLOv4, the backbone network of YOLOv4-Tiny is greatly simplified. Feature pyramid network is used for 32 times down-sampling and 16 times down-sampling to get two kinds of different sizes of feature map for target detection, which improves the detection speed.
There are several basic components in the YOLOv4-Tiny network structure. The CBL module consists of Conv convolution, BN normalization and Leaky-Relu activation function. The implementation method of down-sampling CBL module is that the stride length of convolution kernel is set as 2 to achieve the purpose of down-sampling. CSP module uses CSPNet network structure for reference and is composed of CBL module and Concat tensor splicing module to better integrate feature information.
Assuming the input image size is 608 × 608 × 3, the details of the backbone network parameters of YOLOv4-Tiny are shown in Table 1.

III. IMPROVED NETWORK MODEL OF YOLOV4-TINY A. IMPROVED K-MEANS CLUSTERING ALGORITHM
In order to improve the recall rate of the algorithm model, the anchor box mechanism is introduced [13]. The anchor box is an initial candidate box with a fixed size and aspect ratio. The design of the anchor box will directly affect the convergence difficulty of the loss function in model training, thus affecting the detection accuracy and speed of the model.
The size and aspect ratio of anchor boxes are affected by the size of all real boxes in the data set. Therefore, for different data sets, it is necessary to select the appropriate anchor boxes in order to make the model training stable and accelerate the convergence.
In the process of clustering iteration, the K-means clustering algorithm [26] uses distance as the similarity index to find K classes in the given dataset, and the center of each class is obtained according to the mean value of all data points in the class. The standard K-means uses Euclidean distance to calculate the distance between two samples. The distance of the original YOLOv4-Tiny algorithm is based on the IOU when clustering and selecting the candidate box on the MS COCO data set [27]. In this section, GIoU is used as the distance basis because the IOU can't distinguish the different alignment between two objects. As shown in Figure 2, the aspect ratios of the two red boxes are different. The aspect ratio of the left red box is 2 and the aspect ratio of the right red box is 1.78, but the IOU of the red box and the blue box in Fig. 2(a) and 2(b) are both 0.333. Therefore, only using IOU cannot distinguish the difference in the aspect ratio of the two red boxes. GIoU considers the area of the non-overlapping areas, when the two boxes do not completely intersect, GIoU introduces the non-overlapping area item, which can better reflect the shape information. If GIoU is used as the measurement factor of distance, the GIoU values in Figures 2(a) and 2(b) are 0.0833 and 0.1759 respectively.
The calculation formula is shown in equation (1).
where D(x i ) is the distance between the width and height (w i , h i ) of each sample point x i and the width and height (w c , h c ) of the clustering center.
IoU is the intersection ratio of sample box size and cluster box size, A represents the size of the sample box, B denotes the size of the cluster box, C is the minimum size of the rectangular box that simultaneously surrounds the two boxes A and B, and C\ (A ∪ B) represents the difference between the area of the rectangular box C and the union area of the two boxes A and B.

B. LARGE SCALE FEATURE MAP OPTIMIZATION STRATEGY
The output layer of the original YOLOv4-Tiny network is a two-scale feature map with 32 times and 16 times down sampling. Generally speaking, receptive field [29] refers to the area acting on the input image, so the deeper the network layer is, the larger the receptive field is. The receptive field of deep feature map with low resolution is larger, which is used to detect large targets. The receptive field of shallow feature map with high resolution is smaller and rich in spatial information, so it is more suitable for detecting small targets. Corresponding to the proposed algorithm in this paper, the feature map with a resolution of 19 * 19 is responsible for detecting large targets, and the feature map with a resolution of 38 * 38 is responsible for detecting small targets. Since there are more small targets in the traffic sign data set, this section makes corresponding improvements.
The pooling layer MaxPool3 is deleted, the 32 times down sampling is modified to 16 times down sampling, and the original network path after network layer CSP3 is re-routed to network layer CSP2. After 8 times down sampling, the two scale output feature graph sizes of YOLOv4-Tiny detection network are changed from the original 19 × 19 and 38 × 38 to 38 × 38 and 76 × 76, so as to improve the detection accuracy of small targets. The improved network structure is shown in Figure 3.

C. IMPROVED NMS ALGORITHM TO SCREEN PREDICTION BOXES
The idea of non-maximum suppression algorithm [30] is that in the post-processing stage of model detection, for a target object, the one with the highest confidence score among all prediction boxes of the target is selected as the benchmark prediction box, and a threshold is set. For the prediction box that overlaps with the benchmark prediction box, the prediction box whose overlap degree is greater than the threshold value is deleted, the prediction box whose overlap degree is less than the threshold value is retained, and all the prediction boxes without overlap are retained.
NMS algorithm has obvious disadvantages. Firstly, it needs to set a threshold manually, which is determined by experience. Secondly, when similar targets are dense and the detected objects are highly overlapped, the overlap degree between prediction boxes is high. NMS algorithm is easy to delete the prediction box belonging to another target, resulting in missed detection. As shown in Figure 4, the two targets of speed limit of 20 km/h of traffic signs and prohibition of trucks are highly overlapped, and only one target box is retained after NMS algorithm processing, which leads to missed detection.
Aiming at the existing problems of NMS algorithm, this paper proposes an improved algorithm based on soft-NMS [25] to complete the screening task of YOLOv4-Tiny prediction box. In the execution process of the original soft-NMS algorithm, the confidence score of the prediction box larger than the threshold value was reduced by the score reset function, rather than deleted directly. Based on this method, when the unprocessed prediction box overlaps most of the benchmark prediction box, the box will have a low confidence score. On the contrary, if there is only a small amount of overlap, the original confidence score will not be significantly affected.
The fractional reset function can be expressed in two forms. One is linear weighting, as shown in equation (4), which can be expressed as f x (IoU (M , b i )). The other is Gaussian weighting, as shown in equation (5), which is expressed as f g (IoU (M , b i )).
In the experiments, Gaussian weighting is mostly used. Based on the Gaussian weighting function, a new fractional reset function is proposed in this paper, as shown in equation (6), which can be expressed as f t (IoU (M , b i )).
where A is an artificially set coefficient. According to the conclusion of experiments conducted by the author of Soft-NMS, when σ = 0.5, Soft-NMS has better performance. In order to ensure that when the overlap degree IoU (M , b i ) between the prediction box b i and the benchmark prediction box M is small, its original confidence score will not be greatly affected, the experiment shows that the effect is better when A = 2. At this time, the image comparison of the exponential part function of the score reset function is shown in Figure 5.
As can be seen from Figure 5, when IoU (M , b i ) is small, the attenuation range of y value is small, When IoU (M , b i ) is large, the decrease range of f t (IoU (M , b i )) is larger than that of f g (IoU (M , b i )), that is, the attenuation range of fractional reset function is larger, which indicates that when the overlap IoU (M , b i ) between prediction box b i and benchmark prediction box M is larger, the confidence fractional attenuation of prediction box b i is more serious. For prediction box with high overlap, it is beneficial for the confidence score of the prediction box b i to decay below the score threshold, so as to speed up the screening process of prediction box.

IV. MODEL PERFORMANCE EVALUATION A. DATA SET AND LABORATORY ENVIRONMENT
There are 9176 pictures in TT100K data set in China, with 221 kinds of annotation categories. The resolution of the images is 2048 × 2048. Because of the high resolution of the original image, the original image is cropped in this experiment, and the scale of the cropped image is 608 × 608. Due to the serious imbalance of data amount among various categories in the data set, only 45 categories of traffic signs with a large amount of data are selected for recognition in this experiment. Figure 6 shows the traffic sign category in the TT100K data set, in which '' * '' is used to represent other numbers of signs of the same type. For example, the speed limit sign ''pl * '' includes pl25, pl30, pl35, etc.
The experimental training environment configuration is as follows: Intel Xeon W-2135 CPU, NVIDIA GeForce RTX 2080TI graphics card, 32G memory, Windows10 operating

B. EXPERIMENTAL SCHEME AND RESULT ANALYSIS
This paper focuses on the realization of the light weight traffic sign recognition algorithm model, which provides the possibility for practical application. Taking YOLOV4-Tiny as the benchmark, different innovative strategies are combined to perform training and performance statistics on the TT100K data set, so as to improve the detection accuracy of the model as much as possible under the premise of ensuring real-time performance. The test results of different algorithm models based on TT100K data set are shown in Table 2.
As shown in Table 2, on the TT100K data set, the mean average precision (mAP) value of the original YOLOv4-Tiny (experiment 1) was 46.34%, and the mAP value of the cluster anchor box (experiment 2) was 47.16%. The improvement effect was obvious, and the recall rate was increased by 2.38 percentage points. Because there are more small-scale targets in the traffic sign data set, the combination of the largescale optimization feature map strategy and the improved clustering anchor strategy proposed in this paper achieves good detection effect, as shown in experiment 3, the mAP value reaches 51.02%, which is 4.68% higher than that in experiment 1, and the recall rate reaches 62.01%, which is 4.78% higher than that of experiment 1. This is because the proportion of small targets in the traffic sign data set is higher. After introducing this optimization strategy, the shallow feature semantic information extracted from the feature extraction, especially the learning and training of small targets is more in-depth, so as to improve the detection performance of the algorithm model.
Considering the partial occlusion or overlap of traffic sign targets in the data set, soft-NMS algorithm is introduced in experiment 4, and the mAP value reaches 51.39%. Compared with experiment 3, the accuracy of the model is further improved. It can be seen that soft-NMS algorithm has a positive impact on the accuracy of model improvement, so on this basis, experiment 5 adopts the improved soft-NMS strategy, the mAP value reaches 52.07%, 0.68% higher than experiment 4, 5.73% higher than the original YOLOV4-Tiny (experiment 1), the recall rate reaches 64.52%, 2.2% higher than Experiment 4, 7.29% higher than experiment 1, and the improvement range is obvious. Moreover, the FPS value of the algorithm models is maintained at about 87f/s, which is basically unaffected.
For comparison, the test results of experiment 1 and experiment 5 on the TT100K data set are counted, as shown in Table 3, where class represents the categories in the data set, AP1 represents the average precision (AP) value of each category in experiment 1, and AP2 represents the AP value of each category in experiment 5.
According to the analysis of table 3, the model test results of experiment 5 using the improved strategy in this paper perform better than the model test results of experiment 1 using the original YOLOv4-Tiny. The average precision of more than 65% of the 45 categories (black/red and bold representation) in TT100K dataset has been improved to varying degrees, and the AP values of 8 categories (red and bold representation) are improved by more than 15%, which are il100, il80, io, p11, p26, p3, pn and po. In experiment 1 and experiment 5, the PR curves of these 8 types are shown in Figure 7.
In Figure 7, the red curve marked with circle symbol is the PR curve of experiment 1 (the original YOLOv4-Tiny algorithm model) on each category, while the black curve marked with triangle symbol is the PR curve of experiment 5 (the improved YOLOv4-Tiny algorithm model) on each category after the combination of optimization strategies proposed in this paper. The area under the PR curve and enclosed by the coordinate axis of each traffic sign category is the AP value of the corresponding category. The higher the curve is to the upper right, the larger the area enclosed by the curve and the coordinate axis is, the higher the corresponding AP value is and the better the performance is. As can be seen from the figure, the algorithm model of experiment 5 has achieved good detection performance in all the 8 traffic sign categories, and the AP value of each category has been significantly improved compared with experiment 1.
The overall detection performance and some other evaluation indicators of the improved YOLOv4-Tiny algorithm, YOLOv3-Tiny and YOLOv4-Tiny are compared. The results are shown in table 4 and table 5.  As can be seen from table 4, in terms of detection accuracy, the improved YOLOv4-Tiny algorithm improves 7.92% and 5.73% respectively compared with YOLOv3-Tiny algorithm and YOLOv4-Tiny algorithm. In terms of detection speed and model size, it has obvious advantages over YOLOv3-Tiny.
It can be seen from table 5 that some evaluation indicators of the improved YOLOv4-Tiny algorithm are obviously better than the other two algorithm models. Among them, F1-measure is the weighted harmonic average of precision and recall. The two indicators of precision and recall are integrated to measure the quality of the algorithm model. The higher the F-measure value, the more effective the algorithm model will be. The detection results show that compared with the original YOLOv4-Tiny algorithm, the proposed method has better target location and recognition effect and higher detection accuracy. Moreover, the selected test images are real road scenes, which also proves the high robustness of this method.

V. CONCLUSION
In this paper, based on the framework of YOLOv4-Tiny algorithm, aiming at the characteristics of traffic sign data set and the shortcomings of original YOLOv4-Tiny algorithm in traffic sign detection, three feasible improvement strategies are proposed: improved K-means clustering algorithm to generate anchor box suitable for traffic sign data set, large-scale optimization feature map strategy, and improved soft-NMS algorithm to filter prediction box aiming at the shortcomings of NMS algorithm in the post-processing stage of the model, so as to improve the detection accuracy on the premise of ensuring the real-time traffic sign recognition. The experiment shows that the mAP value and recall value of the improved YOLOv4-Tiny algorithm in TT100K data set are 5.73% and 7.29% higher than the original YOLOv4-Tiny algorithm model respectively, reaching 52.07% and 64.52%, which greatly improves the detection accuracy and recall rate, and provides a certain experimental basis for the practical application of the subsequent traffic sign recognition algorithm.