Multi-Feature Fusion and Enhancement Single Shot Detector for Traffic Sign Recognition

Road traffic sign detection and recognition play an important role in advanced driver assistance systems (ADAS) by providing real-time road sign perception information. In this paper, we propose an improved (Single Shot Detector) SSD algorithm via multi-feature fusion and enhancement, named MF-SSD, for traffic sign recognition. First, low-level features are fused into high-level features to improve the detection performance of small targets in the SSD. We then enhance the features in different channels to detect the target by enhancing effective channel features and suppressing invalid channel features. Our algorithm gets good results in domestic real-time traffic signs. The proposed MF-SSD algorithm is evaluated with the German Traffic Sign Recognition Benchmark (GTSRB) dataset. The experimental results show that the MF-SSD algorithm has advantages in detecting small traffic signs. Compared with existing methods, it achieves higher detection accuracy, better efficiency, and better robustness in complex traffic environment.


I. INTRODUCTION
The detection and recognition of road traffic signs are meaningful in advanced driver assistance systems [1] (ADAS) for enhanced driving safety. As traffic signs usually consist of specific shapes (circles, squares, and triangles) and colors (red, blue, and yellow), which have significant visual effects in road environments, traffic sign detection methods can be divided into color-based, shape-based, and color-based methods [2]- [5]. In color-based methods, RGB images are usually converted into other color spaces, such as HSI [6], CIELab [7], and HSL [8]. Then, the traffic signs are extracted via color threshold segmentation through intelligent data processing [9]. Color-based detection methods are usually vulnerable to complex lighting conditions in the traffic scene. In shape-based traffic sign detection, the geometric contour shape of traffic sign is detected by geometric symmetry [7], [8], [10], [12]. Compared with the template matching in complex lighting environment, geometric moment invariant detection has better adaptability, but requires higher The associate editor coordinating the review of this manuscript and approving it for publication was Jiankang Zhang . computational complexity. Nevertheless, the recognition rate of these methods should be further improved.
In recent years, the deep convolution neural network (CNN) for feature extraction has received much attention [13]- [15]. Benchmark works include GTSRB [16] and GTSDB [17]. The faster region-based CNN (Faster R-CNN) [18] is a representative two-stage target detection framework that has become a popular object detection framework, but it still has difficulty detecting small objects. In recent years, some new methods have been proposed to identify traffic signs [19]- [21].
Due to its small proportion in the image, the recognition of small traffic signs plays an important role for ITS security, but is difficult due to low resolution and noise effects. For instance, although PASCAL VOC and MSCOCO can achieve satisfactory performance for large objects, small object detection is still a challenge [22]. In this paper, we propose a small traffic sign recognition method that is different from previous ones based on GTSRB and GTSDB datasets.
The reasons for the difficulty of small target detection are summarized below.
A small target occupies fewer pixels with fewer features and is difficult to detect.
In CNN methods, low-level features may contain smaller target information but less semantic information, whereas high-level features contain abundant semantic information but less small target information. Consequently, small targets are not easy to detect.
Large computation complexity is required to detect small objects.
To improve the detection accuracy and speed for small objects, we propose an improved SSD algorithm by jointly exploiting feature fusion and enhanced SSD algorithm MF-SSD. The proposed traffic sign detection process is listed in Fig. 2. To resolve the problem that the SSD algorithm is not effective in detecting small objects, this paper proposes an improved SSD algorithm through feature fusion and enhancement, named MF-SSD. We fuse low-level features into high-level features to enhance the detection performance and detection efficiency of small targets in SSD.

II. RELATED WORK
In recent years, the deep convolution neural network has been successfully applied to object recognition and target detection, with AlexNet being the representative case [23]. In 2012, Krizhevsky et al. demonstrated the CNN' s ability to significantly improve image classification accuracy in the ImageNet Large-scale Visual Recognition Challenge Competition. Inspired by AlexNets work, Girshick et al. [24] proposed a deep learning model named R-CNN, which also has been applied to target detection. The model first uses a selective search algorithm to calculate the candidate regions of images and then inputs all candidate regions into R-CNN model. The feature is extracted from type A and the classification is completed in SVM [25]. Moreover, the model designs a bounding box regression algorithm to calculate the coordinates of candidate regions and tests it on the target detection set of PASCAL VOC. The average accuracy is about 20% higher than the non-neural network algorithm.
Model preprocessing also is applied in the above method. First, the weight of the network is initialized on the small dataset of ImageNet and then the network is fine-tuned on the PASCAL VOC dataset. In doing so, although the R-CNN accuracy is greatly improved, a large computation complexity is needed because there are about 2,000 candidate regions in each image. In the application of SPPnet [26] to target detection, Microsoft Asia Research Institute first makes a mapping and calculates the position of candidate regions mapped to the feature map of the highest convolution layer, then the pooling layer based on SPP algorithm is used to reduce the dimension and, finally, a feature layer of a specific size is obtained. Although its accuracy is similar to R-CNN, the running time is greatly reduced. In 2015, Ross Girshick further combined the idea of SPPnet with R-CNN to propose a convolutional neural network model, Faster R-CNN [18], and then replaced the SVM classifier with soft Max [27] regression to reduce the space and time overhead. The whole training process does not need to be graded and the detection process is more efficient and accurate. After training and testing on the GPU, the experimental results show that the extraction time of candidate regions is significantly shortened, the detection time is shortened to one-tenth, and the classification accuracy is increased.
In 2016, Liu Wei et al. combined the structure of the YOLO network with Girshicks Faster RCNN and proposed an SSD (Single Shot multibox Detector) target detection algorithm [28], [29]. The SSD network is much faster than Faster R-CNN, but its working mode is significantly different. Faster R-CNN [18] uses region inference to generate candidate regions and uses a classification algorithm to generate target frames in each candidate region. In contrast, the SSD algorithm generates target boundary frames of various sizes directly on the whole image and uses non-maximum suppression technology to integrate highly overlapping boundary frames into one. The candidate regions are transformed into a linear regression problem to find the prediction frame closest to the target so as to improve the calculation speed and accuracy. In 2017, SENet and SE modules were proposed [30]. SENet enables the network to enhance effective channel features and suppress invalid channel features according to global information. The SE module is not a complete network structure but a sub-structure, which can be embedded in other classification or detection networks. The method of embedding the SE module into the ResNet network in literature is the first in the ILSVRC2017 classification project. The working mode of SE module is to learn feature weights according to global letters, which makes the weight of effective channel features increase and the weight of ineffective or ineffective channel features decrease. Although embedding the SE module in the original classification or detection model will increase some parameters and computational complexity, the additional parameters and computational complexity are very small.
Recently, some small object detection methods based on original Faster R-CNN have been proposed, e.g., multi-scale input [31], multi-scale detector [32], [33], multi-task learning [34], [35], and multi-scale features [36]- [38]. However, these methods easily lead to heavy computation time in the training stage. To enhance the information representation ability of small objects in the feature map, the multi-input method [31] produces a high-resolution feature map. In references [32] and [33], the multi-scale detector is used to extract features from multiple consecutive layers to increase context information. However, the multi-detector also increases the computation cost in the training and testing stages. In literature [34], [35], the multi-task learning method is used to improve detection performance. However, the feature map is only the output of the last layer and the information contained is not enough for small object detection. By combining the features of different layers, the representation of small objects in the feature map can be effectively enhanced. The multiscale feature method [36]- [38] has attracted more attention than other methods in the field of small object detection.
Most of the existing SSD improvement algorithms are based on feature fusion. In reference [39], the RSSD network  structure is proposed, where low-level features are fused to high-level features while high-level features are fused to low-level features. Literature [40] studies the FPN network. By extracting features from the image moving from the bottom to the top, a set of pyramid features are constructed and then the feature fusion is realized by using the up-sampling; finally, the target detection accuracy is improved. In reference [41], multiple low-level features are fused to enhance SSD detection of small targets. Figure 1 shows the scheme of SSD, while the proposed MF-SSD is illustrated in Fig. 2, which is significantly modified to improve the performance of small traffic sign recognition. The overall framework of the algorithm is shown in Fig. 2. On the basis of the SSD framework, a feature fusion layer is added and the SE module is added to the feature extraction layer after fusion. The detailed process will be described next. Note that the feature fusion method is to fuse low-level features into high-level features. In the MF-SSD, the features at conv4_3 and fc7 are taken as low-level features, and the features at fc7, conv6_2, conv7_2, conv8_2, and conv9_2 are fused by pooling operation. The following feature fusion process can be performed in a similar manner. The advantage lies in the shared pooling features, which will increase the relationship between layers. In the following, we provide an example of feature fusion from conv4_3 to fc7 to illustrate the details of feature fusion.

III. OUR PROPOSED APPROACH
As shown in Fig. 4, X ∈ R W 1 ×H 1 ×C 1 represents the feature map at conv4_3, Y ∈ R W 2 ×H 2 ×C 2 represents the feature map at fc7, X is transformed into U through pooling operation, U ∈ R W 2 ×H 2 ×C 2 , and then U and Y are converted into Z through series operation, Z ∈ R W 2 ×H 2 ×C 2 .  The pooling operation can then be formulated as follows: where U c denotes the data of the C channel of U characteristic graph, U c (x, y) denotes the data at the height equal to y width equal to X . Similarly, it can be inferred that X c and X c (i, j). k × k is the pooling core with K being the step size and the patch is 0.
The concatenation operations can then be expressed as where • denotes the series operation, F conv4_3 and F fc7 are connected in series on the channel dimension, Z ∈ R W 2 ×H 2 ×C 3 , After feature fusion, the features at fc7, conv6_2, conv7_2, conv8_2, and conv9_2 increase more at the lower level. These features contain more semantic information. Adding low-level features that contain additional small target information is helpful for subsequent small target detection. In the feature fusion, it should be noted that U and Y need Batch Normalization (BN) before concatenation in order to construct consistent scales for different feature maps. The batch standardization algorithm is listed as Table 1: The principle of the SE feature enhancement module is as follows. Figure 5 illustrates the process of using the SE module to enhance the features at conv4_3, which enhances the feature U at conv4_3 to Y in the graph. The operation of converting feature X into feature U is a convolution operation, which belongs to SSD itself.
The operation of converting feature U to feature Y is the Squeeze-and-Excitation (SE) model enhancement operation. It is seen that after enhancement of the SE module, the size of feature U is not changed while the feature weights of different channels of feature U are. Finally, the enhanced feature Y of the SE model is sent to the detector for subsequent classification and detection.
F conv_3 is the convolution operation of conv4_3 in which the convolution core is 3 × 3, the patch is 1, and the step is 1. X is a three-dimensional matrix of size W × H × C and U is a three-dimensional matrix of size W × H × C. After conv4_3 convolution operation, the size of X remained unchanged.
The formula of F conv_3 is as follows: where v c denotes the convolution core of c, X s denotes the input of s, U denotes the three-dimensional matrix of C with the size of W × H , u c denotes the two-dimensional matrix of C in U , and C denotes the channel. The operation after F conv_3 is extrusion, where global average pooling is adopted. The formula is as follows: The number of channels of z is C and the number of channels of input feature graph is C. z has global information to some extent, representing the response intensity of each channel in the feature graph.
The extrusion operation is followed by an excitation operation, which converts the second matrix in Fig. 5 to the third matrix in Fig. 5. The activation formula is as follows: where Z is the output of the front extrusion operation. W 1 z denotes the full connection layer process and W 2 δ(W 1 z) denotes a full connection operation after the previous full connection. Finally, the sigmoid function is applied to the previous results and the s of dimension 1 × 1 × C is obtained. The final operation is to multiply s as a weight and U . The formula is as follows: where s c denotes the number C in S. Model evaluation index: the formulas of precision, recall and mean Average Precision (mAP) are as follows: The formulas of F1 − Measure are as follows: where P = TP TP + FP , R = TP TP + FN (12) as far as the location of the target is concerned, it is necessary to introduce an Intersection Over Union (IoU ) to determine the positive case as a normal or a negative case. The formula is as follows: where Gt ∩ Dr is the intersection of Gt and Dr, Gt ∪ Dr is the union of Gt and Dr. The range of IoU is 0-1. Note that, in this paper IoU is set to 0.5. Once the detection position and label location are achieved, the target position is determined accordingly:

A. DATASETS
There are two kinds of datasets used in this paper. One is the domestic (Chinese) traffic sign dataset, which contains 1,465 pictures. At present, seven kinds of image samples are marked as shown in Fig. 6. The image comes from a real picture of the city. Seven of them are shown in Fig. 6: right, straight, stop, nohonk, crosswalk, left, and background. The other data comprise German Traffic Sign Detection Benchmark (GTSDB) traffic signs (Fig. 7). There are 1,000 pictures and 43 kinds of marks as shown in Fig. 8. To test the detection effect of MF-SSD on each kind of traffic signs and evaluate our method for both small and large traffic sign detection, we divided the traffic signs into three size groups: small (0-32 pixels), medium (32-96 pixels), and large (96-200 pixels). In addition, it is worth noting that all the traffic signs used occupy less than 1% of the original image.

B. DETECTION PERFORMANCE
As is seen in Fig. 9, our model obtains better detection results for Chinese traffic signs.
We use GTSDB datasets to compare experiments.  Table 2 provides detailed test indicators for each of the five methods, demonstrating that MF-SSD achieves the best performance in most categories. The experiments were run on a Linux PC with an Intel Core i5-8400K, 8 GB of memory, and one GeForce GTX 1060 GPUs.
We evaluated the performance of traffic signs through recall and accuracy.
Examples of detection using five different models in a road scene are illustrated in Fig. 10. All detections are correct in the examples. As can be seen from the figure, our method has the highest detection rate.
Second, we divided traffic signs into three size categories: small (0-32 pixels), medium (32-96 pixels), and large (96-200 pixels). For more intuitive comparisons, we also use F1_measure as an additional metric. To verify the effectiveness of the method, we compared MF-SSD with SSD, faster_rcnn, and FSSD [40]. Faster_rcnn is a detection method for multi-scale objects, which achieves better performance on MS COCO and PASCAL VOC datasets. FSSD,  as proposed by Zuoxin Li and Fuqiang Zhou, has higher accuracy and speed than the conventional SSD by a large margin. Table 3 provides a comparison of the performance of these three methods on different traffic sign size groups. The precision measurements of small and medium sizes obtained by the proposed MF-SSD are 28.8 and 67.5 respectively. The precision value of the large size is 82.6, which is superior to the precision of other methods. This shows that MF-SSD can accurately identify small traffic signs and medium or large traffic signs. Figure 11 shows the partial visualization results of the test dataset under different weather conditions. It can be seen that each traffic sign instance is very small, accounting for less VOLUME 8, 2020  than 1% of the whole scene; nevertheless, our method can recognize them accurately.
The German Traffic Sign Detection Benchmark (GTSDB) used in this paper is highly accepted and widely used in traffic sign detection methods in comparative literature. GTSDB includes natural traffic scenarios recorded in various types of roads (roads, villages, cities) during daytime and dusk, and numerous weather conditions. The dataset consists of 900 complete images containing 1,206 traffic signs, which are divided into 600 training sets (846 traffic signs) and  Figure 6 shows some images of the dataset. We divided the traffic signs into three size categoriessmall (0-32 pixels), medium (32-96 pixels), and large (96-200 pixels)-and tested the detection effect of each traffic sign by MF-SSD.

V. CONCLUSION
To improve small target detection performance, this paper proposed an improved algorithm named MF-SSD, which combines low-level features with high-level features and adds the SE module to improve the detection accuracy. The experimental results verified that the proposed method outperforms conventional methods detecting small objects with respect to detection accuracy and efficiency. Although our method has a great improvement in detecting small target image, there is still large room for improving the accuracy for a real-time application. In future work, we will continue to improve the algorithm and strive to apply the framework to the domestic traffic sign dataset to achieve real-time application.
YUSHENG FU received the bachelor's degree in avionics engineering from the Air Force Engineering University, Xi'an, China, in 1995, and the master's and Ph.D. degrees from the University of Electronic Science and Technology of China, Chengdu, China, in 2000 and 2004, respectively.
Over the past five years, a total of 10 million yuan has been spent on scientific research, with an average annual expenditure of more than 1 million yuan. More than ten academic articles have been published, including three SCI articles and more than ten EI articles. His research over the past five years has mainly focused on signal processing, aeroelectronics, biomedical electronics engineering, and, more recently, network science and technology. He is a Reviewer of Circuit System and Signal Processing and other academic journals. CHUNHUI REN received the bachelor's, master's, and Ph.D. degrees from the University of Electronic Science and Technology of China, Chengdu, China, in 1992China, in , 1998China, in , and 2006 She has published more than ten academic articles, including several articles published in SCI (including SCIE) source journals and several EI-retrieved journals and conference papers. In recent years, her research has mainly focused on electronic countermeasures, statistical signal processing, non-cooperative signal processing, and other fields.
XIN XIANG received the Ph.D. degree from Xidian University, Xi'an, China. He is currently a Professor with Air Force Engineering University, Xi'an. His research interest is in communication signal processing.