SADANet: Integrating Scale-Aware and Domain Adaptive for Traffic Sign Detection

It holds great implications for large-scale traffic scene understanding, to enable traffic sign detection in real environment. The large intra-category variance in features caused by instances of different spatial scales and appearances are the main problem. In this paper, a traffic sign detection framework using scale-aware and domain adaptive network (SADANet) was proposed, which seamlessly combines a multiscale prediction network (MSPN) with a domain adaptive network (DAN) in a tightly-coupled manner in order to tackle the challenge. Specifically, MSPN is dedicated to extracting the multiscale feature. It utilizes fully, the low-level location and high-level semantic information, and considers the combination of the context information and the instance specific content awareness in the scale transformation. DAN is dedicated to making features domain invariant without enough labeled test data. It aligns the domain distributions from the different scales effectively by leveraging the mapping relationship between the image representation and the multiscale features. Experimental results show that the SADANet is effective in traffic sign detection task and is also competitive when compared with the state-of-the-art methods.


I. INTRODUCTION
Intelligent transportation is dedicated to building a safe, convenient, efficient and green transportation system. The development of multisource traffic big data provides a possibility for the implementation of intelligent transportation system (ITS), and a lot of relevant research has emerged [1]- [5]. Traffic scene perception and understanding based on the image data are essential components in intelligent vehicle and infrastructure, as well as in modern traffic management.
Traffic sign detection is one of the critical tasks in traffic scene understanding. Its purpose is to localize and classify all the traffic sign instances from the image collected by the camera. Although, the advancement in intelligent transportation has brought about much attention on the research of traffic sign detection [6]- [8], it is typically leaving an incompliant The associate editor coordinating the review of this manuscript and approving it for publication was Xiaobo Qu.
issue of detecting traffic sign in the wild, due to different scales of target and the complexity of application scenarios.
The first challenge is scale variety. Scale problem is extensive in object detection due to different distance between camera and target. The large-size traffic sign instances generally exhibit completely different visual features from the small-size ones, which affect the performance of detection considerably ( Fig.1(a)). To address the scale-variance problem, detection methods based on single feature map [9], [10], pyramidal feature hierarchy [11], and feature pyramid network [12] were proposed successively. Although existing methods have partly satisfied the detection requirements of multiscale objects, these methods lose the geometric details in feature extraction and overlook the context information and the feature consistency in scale transformation, which reduces the fluency of feature and it has adverse effect to the detection of multiscale object.
Another challenge is the dataset bias caused by the large variance in backgrounds, illumination, and image quality. As the standard in object detection, deep neural network remains dependent on abundant annotated dataset for supervised learning. While the complexity and randomness of traffic environment is necessary, it causes a considerable domain shift between the training and test data. Such domain inconsistency will produce an unfavorable impact on model generalization and cause significant performance decrease shown in Fig.1(b). Although collecting more additional training data can possibly improve the performance, it is time consuming and expensive to generate annotations. Various unsupervised domain adaptation methods which can alleviate the impact of domain shift without using ground truth labels in the target domain have been proposed to address the domain-shift problem [13]- [16]. Presently, fewer research has been conducted on object detection and domain adaptation models for object detection are based on single feature map, overlooking the scale-aware influence on image level representation.
In this paper, scale-aware and domain adaptive network (SADANet) was proposed in order to achieve the traffic sign detection in real fluctuating traffic scene. This simple yet effective framework integrates two different components to solve the respective problems above. Multiscale prediction network (MSPN) aims to obtain the multiscale feature maps. In order to reduce the loss of detailed information in the process of feature extraction, low-level geometric detailed and high-level semantic features were densely connected. Meanwhile, the aggregation of context information and instance specific content awareness in scale transformation was considered to avoid the inconsistency of feature semantic information in scale transformation. The adaptive feature weighting model (AFWM) was introduced to better utilize the region of interest (ROI) features from different feature branches, which will provide a better ROI features for subsequent location regression and classification.
Domain adaption network (DAN) is dedicated to address the cross-domain object detection problem by unsupervised domain adaptation. The divergence between two domains on image level and object level based on the covariate shift assumption was minimized and the domain distributions from different scales was effectively aligned by leveraging the mapping relationship between image representation and multiscale features. In each component and each branch, the domain classifier was trained and the opposing training strategy was adopted to optimize feature extraction parameters of MSPN, which can study the robust features with the invariant domain.
The main contributions of this present research can be summarized as follows: 1. A new multiscale feature extraction network was utilized, which can effectively integrate the low-level location information and high-level semantic information as well as combine the context information and instance awareness in scale transformation, to provide more expressive features for object detection.
2. An adaptive feature weighing model was proposed to fuse the ROI features from different feature level branches adaptively, which provide more effective ROI features for object classification and location regression. 3. A domain adaptation model was constructed with two domain adaptation components at the image and object levels, and the multiscale image level branches was designed to effectively alleviate the domain conflict.
The other parts of this paper are organized as follows. In section II, some previous works on multiscale object detection, domain adaption and traffic sign detection was briefly reviewed. Section III presented the details of the proposed scale-aware and domain adaption network (SADANet). In section IV, a series of experiments of the proposed algorithm were conducted and the experimental results were presented. Finally, in section V, a brief conclusion for this paper was made.

II. RELEATED WORK A. MULTISCALE DETECTION METHODS
Multiscale is one of the most precise techniques that improve detection accuracy. Fast R-CNN, Faster R-CNN, and R-FCN [17] use convolutional features from the top layer to detect objects by region proposals with different scales. However, the receptive field of each layer is fixed, it is not appropriate to predict object of different scales with feature in single level. The last layer of single-shot multibox detector (SSD [11]) is generated by all previous multiscale feature map from different layers, but the ''multiscale'' is limited, that is, it is powerless for extremely small object. Feature pyramid network (FPN [12]) uses a top-down architecture to fuse and harvest the feature information from different layers. Zhou et al. [18] proposed a new scale transformation layer and further developed a scalable detection network. Tang et al. [19] proposed deep domain confusion network (DeFusionNet) for defocus blur detection, the network extract multiscale feature by repeatedly fusing and refining multiscale depth feature. Li et al. [20] proposed a novel trident network (TridentNet) with three branches to generate scale-specific feature maps.In the process of feature fusion, these methods lost the geometric details of the object seriously, which reduces the ability of multiscale feature representation and is adverse to the detection of small object. Therefore, the above methods are all not suitable for traffic sign detection.

B. DOMAIN ADAPTATION FOR OBJECT DETECTION
Domain adaptation is a common requirement in computer vision. Cross-domain learning using deep representation has been extensively studied, especially for unsupervised applications. However, domain adaptation for object detection is still in its inception. Chen et al. [21] proposed a domain adaption model based on Faster R-CNN, which reduced the domain shift by aligning the distribution of the source and targets the domain at the image and instance level. RoyChowdhury et al. [22] used trackers to generate pseudo-labels and used the pseudo-labels to generate fake samples, which solved the unsupervised adaptation of existing object detectors to new target domains. Kim et al. [23] introduced a learning paradigm for object detection, combining domain diversity (DD) and multidomain invariant representation learning (MRL), to alleviate the long-term limitations of domain adaptive methods. Chen et al. [24] proposed a progressive feature alignment network (PFAN) to gradually and effectively align discriminative features across domains by using intra-class changes in the target domain. Saito et al. [25] proposed a new method to adjust the distribution of source domain and target domain by using task specific decision boundary. However, existing domain adaptation models for object detection are based on single feature map, which overlooks the influence of scale-aware on image representation.

C. TRAFFIC SIGN DETECTION
Traffic sign detection has completely changed from traditional method to deep learning network. Sermanet and LeCun [6] applied convolutional neural networks (CNN) based multiscale architecture to traffic sign classification for the first time and achieved excellent results. In [7] and [8], CNN method was applied for the traffic sign detection. Zhu et al. [26] used fully convolutional network (FCN) [27] and Edge box [28] to extract rough and fine regions of interest respectively, and used CNN to divide the regions into different categories. Yang et al. [29] introduced attention network (AN) in Fast R-CNN to obtain the region of interest and obtain the detection results on the feature map extracted by AN. Tabernik and Skočaj [30] improved the network on the basis of Mask R-CNN [31], effectively improving the detection accuracy of small traffic signs. Kumar [32] applied the capsule network [33] to improve the detection performance of fuzzy, rotating and deformed traffic signs. Li and Wang [34] combined Faster R-CNN with Mobilenets [35] to advance the efficiency of the detection process. Meanwhile, they designed an efficient and robust asymmetric convolutional kernel CNN to recognize traffic sign detection. Tian et al. [36] introduced attention mechanism in traffic sign detection task, and improved detection accuracy by combining local context information. In [37], traffic signs were regarded as small targets in a specific mode, and traffic sign detection was regarded as regional sequence classification and regression task. By using the attention mechanism, the local sequence of the region was modeled explicitly to obtain more context information and improve the detection accuracy. Pei et al. [38] proposed a multiscale deconvolutional network (MDN) for real-time traffic sign detection based on conditional random field (CRF) network and improved FPN. Zhang et al. [39] proposed an end-to-end detection network for Chinese traffic signs based on Yolo9000 [40]. In order to reduce the computational complexity of the top-level convolution layer, multiple 1 × 1 convolution was used to segment the input image into a dense grid. It improved the detection performance of small-scale traffic signs.
Although the above methods have achieved excellent performance, they were all based on abundant annotated dataset for training, and test in the same dataset. In the real scene, the traffic situation has a great deal of randomness, which makes the task of traffic sign detection to be surrounded with difficulties, and causes significant performance decrease. In order to make the traffic sign detection algorithm to be well applied to the real traffic environment, it needs to be further studied. 77922 VOLUME 8, 2020

III. PROPOSED METHOD
The overall architecture of our proposed SADANet is illustrated in Fig.2, which seamlessly combines multiscale prediction network (MSPN) with a domain adaption network (DAN). MSPN extracts multiscale features from input image, and DAN alleviate the domain conflict by studying the robust features that are domain-invariant. The architecture of MSPN and DAN were described in the following sub-sections and more implementation details are presented as well.

A. MULTISCALE PREDICTION NETWORK (MSPN)
The scales of object instances could vary in a wide range, which was one of the cores of traffic sign detection. Recently, scale problem tends to be solved by integrating information on feature maps at all scales. However, these methods lose the geometric details in feature extraction and ignore the context information and the feature consistency in scale transformation, which reduces the expressiveness of the feature. To address these short-comings, a multiscale prediction network (MSPN) was developed, the structure is shown in Fig.3.

1) BASE NETWORK
We adopt densely connected convolutional networks (DenseNet [41]) as the base network. The increasing number of layers in deep networks deepens the semantic information in deep-level feature, which are robust to the changes of object attitude, occlusion, local deformation and so on, but also added additional extensive parameters to be studied. Short connection between the layers close to input and output in the network was proven to improve the effectiveness without expanding the network depth. DenseNet can integrate low-level and high-level features within a CNN naturally in a feed-forward manner. The combination mode of feature reuse not only improves the fluency of extracted features, but also substantially reduce the number of parameters.
DenseNet is a densely connected network structure, which was connected by dense block. In each dense block, the l th layer output contains the output of all preceding layers, as shown in (1), and its own output were used as input into all subsequent layers. The low-level and high-level features were integrated this way to acquire more representative and comprehensive features, both semantic information and detail information were preserved, which makes the object detection network more accurate and efficient.
where [F 0 , F 1 , . . . , F l−1 ] refers to the feature maps produced in layers 0, 1, . . . , l −1. F l is the output feature maps of the l th layer. (BN +ReLU +Conv 3×3 ) refers to a composite function of operations Batch Normalization (BN) [42], rectified linear units (ReLU) [43], and 3 × 3 convolution layer (Conv 3×3 ). DenseNet-169 was chosen for feature extraction, traffic sign detection application and redesigning the stem blocks before the first dense block. The input 7 × 7 convolution layer was replaced into two 3 × 3 convolution layers, and the following 3 × 3 max pooling layer was replaced into 2 × 2 mean pooling layer. Table 1 shows the stem block in detail. The experiments show that the accuracy of traffic sign detection can be significantly improved by using DenseNet-169 as the base network (Table 3).

2) SCALE TRANSFORMATION
The feature information of the object appeared in different layers (related to the size of the object). With the decrease of resolution, geometric details of the features extracted from small object may completely disappear (the receptive field is too large). For instance, when the input image was 1024 × 1024, the feature size in the last dense block of densenet-169 was 32 × 32. Nevertheless, if the low-level feature was adopted only, the lack of semantic information would cause low performance on detection results. In order to detect multiscale objects, predictions from feature maps with different resolutions should be combined. The feature maps in low-level dense block have incomplete semantic information, which may influence performance on object detection. In consideration of the lack of abundant and integral semantic information of the feature maps in low-level dense block, the feature maps in the last dense block was used, which acquire detailed description and semantic representation through feature reorganization. With regards to the outputs of each layer in each dense block having the same width and height, scale transformation into the dense block to obtain different resolution feature maps was directly embedded. The high-resolution feature maps was obtained by content-aware reassembly of features (CARAFE [44]). As regards the traffic sign instances in the video, surveillance images were usually divided into three scales (small, medium, large), traffic sign through three feature level branches: FP 1 , FP 2 , FP 3 , was proposed which can be obtained from the last dense block of DenseNet-169, as shown in (2).
where F 1 is the feature map of DB4-10 (the 10th layer of the 4th dense block), F 2 is the feature map of DB4-20, and F 3 is the feature map of DB4-32. CARAFE is a lightweight and highly effective upsampling operator which aggregate contextual information within a large receptive field, and generates adaptive kernels by instance-specific content-aware handling. Specifically, CARAFE uses the input feature mapping to predict the upper sampling kernel, which is different in each location, and then reassembles the feature based on the predicted upper sampling kernel. Assuming that the upper sampling rate was σ , given an input feature map with the shape of H × W × C, after CARAFE, the output feature map with the shape of σ H × σ W × σ C could be obtained for a target location (i , j ) of the k-th feature map branch, the feature map after the CARAFE is where r = S 2σ , S is the size of input feature map, σ = 2 when k = 2, and σ = 4 when k = 3. W k (i ,j ) is the upper sampling kernel in location (i , j ). This scale transformation method reconstructs the location-specific local features by using weighted sum, which consider the context information and the feature consistency. The experiments show that CARAFE would bring substantial gains to the accuracy (Table 3).

3) OBJECT LOCALIZATION AND CLASSIFICATION
The acquired multiscale feature map in Faster R-CNN was adopted to predict the detection results of the traffic sign. Region proposal networks (RPN) is the network for region proposal in Faster R-CNN, performing binary classification (object or non-object) and bounding box regression. The reference boxes in RPN are called anchors, which refer to both the scale and position information. In order to reduce the invalid anchor, stimulated by [45], an anchor location prediction (ALP) network was added in RPN. The ALP consists of a 1 × 1 convolution layer and the sigmoid function. ALP first generates position score map of the feature map, and then predicts the confidence of each pixel (corresponding receptive field) based on the position score map. By this design, only a small part of the region was selected as the candidate center position of anchor in the anchor selection process, which significantly reduces the number of anchors. Meanwhile, a set of anchor boxes of different sizes with each feature level branch was grouped. Traffic signs occupied up to 1/4 of the image, and when it was less than 32×32 pixels (the image size was 1024×1024 pixels), there was no significance in its detection. Consequently, the anchors were defined to be areas of 128 2 , 256 2 in FP 1 , 64 2 , 96 2 in FP 2 , 16 2 , 32 2 in FP 3 , respectively.
The ROI pooling layer integrates feature maps and the output proposal boxes from RPN in order to calculate the proposal feature maps, and subsequently deliver them to the full connection layer which determines the object category. The feature for each ROI in FPN was obtained by pooling on one certain feature level, usually the choice was made according to the scale of that ROI. Employing this strategy, the neglected feature levels may be beneficial to object classification or regression. Therefore, adaptive feature weighing model (AFWM) was proposed to parameterize the procedure of ROI pooling, which acquire information to generate more effective ROI features from features maps at all scale levels. Specifically, AFWM generates different spatial weight for different levels of ROI features, and integrate the ROI features with weights adaptively. The architecture of AFWM is shown in Fig. 4, the model only consists of a convolution layer and Softmax function which utilizes very few parameters. The experiments showed that the accuracy of detection could be improved by using AFWM (Table 3).

B. DOMAIN ADAPTION NETWORK (DAN)
The state-of-the-art object detection frameworks is dependent on abundant annotated dataset for supervised learning. However, the complexity of traffic scene would bring about great changes to object appearance, background illumination and image quality, which would cause a considerable domain shift between the training and test data.
To address these problems, a domain adaption network (DAN) was developed. The DAN is composed of two domain adaptation components at the image level and object level. The image level is the multiscale corresponding to the feature level branches in MSPN. The architecture is shown in Fig.5. In this paper, with reference to the general terminology in domain adaptation, the domain of the training data was defined as the source domain, and the domain of the test data was defined as the target domain. In addition, access to obtain the complete information (i.e., bounding box and object categories) in the source domain was received except for unlabeled images in the target domain. The DAN aims at adapting the MSPN to the unlabeled target domain.
Motivated by [21], based on the covariate shift assumption, the domain shift into image level and object level was resolved. Reporting the domain shift, a domain adaptation model with two domain adaptation components were built at the image and object levels, and the multiscale image level branches were designed to reduce the influence of scale-aware on image level representation. In each component, the domain classifier was trained by minimizing classification loss separately, and the opposed training strategy to determine the domain-invariant feature was applied. Thus, the domain adaptation problem could be written as: The image-level domain adaptation loss in each branch was denoted by using the cross-entropy loss, (5) VOLUME 8, 2020 The integral image-level domain adaption loss could be written as: (6) where K = 3 refers to the number of feature maps, a k is the weight of loss function in different branch.
In order to enable the domain classifier to distinguish the domain of the training image, the parameters of the domain classifier were optimized to minimize the loss of image domain classification. Meanwhile, the optimization of the MSPN parameters was prospected to maximize this loss which makes the domain classifier unable to distinguish between the different domains. The gradient reverse layer (GRL) was employed for the implementation [46]. Using the GRL, the sign of the gradient was reversed and the MSPN could be optimized.
The object level domain adaptation component was designed based on the object level representation. The object level representation in the framework of this study, refers to the feature vectors based on ROI. In MSPN, the feature vectors of ROI from different feature level branches are effectively fused by AFWM, in order to reduce the local domain shift caused by the appearance, size, angle of the object in highest measure, using the fused ROI-based feature vectors as the object level representation.
Similar to the image level domain adaptation component, an object level domain classifier (ODAC) was trained and the parameters of the domain classifier were optimized to align with the object level distribution. The object level adaptation loss is denoted as: where P i,j represents the output of the domain classifier for the j-th region proposal in the i-th training image. Similarly, is the addition of a GRL before the object level domain classifier is optimized to the parameters in MSPN. Experiments show that the accuracy of the detection could be improved by introducing DAN (Table 3), and the design of the two components is effective (Table 5).

C. NETWORK OVERVIEW
An overview of the proposed SADANet is shown in Fig.2.
The MSPN detection architecture with the domain adaptation components (DAN) was enhanced, which directed the SADANet model design. The upper part of Fig.2 is MSPN. The multiscale features are extracted from the input image, then, the position and category of the target were obtained based on these features. The lower part of Fig.2 is DAN. The scale-aware image level domain classifier was added after the feature map was obtained from the scale transformation, and the object level domain classifier was added to the end of the ROI-wise features. The final training loss of the proposed network is a summation of each individual part, which can be written as: where λ is a parameter to balance the MSPN and DAN. The network could be trained end-to-end. The entire network was used for the training phase. In the practical application of traffic sign detection task, the framework used is MSPN. DAN only optimizes the parameters of MSPN during the training process in order to extract the domain invariance feature.

IV. EXPERIMENTS
In this section, the dataset was first introduce and the effectiveness of each component was then verified separately. The proposed method was compared with Faster R-CNN, DenseNet and DA Faster R-CNN [21]. All the experiments were conducted on a computer configured with NVIDIA GeForce GTX TITAN X GPU, and 12GB of memory. The proposed SADANet was implemented based on the publicly available Caffe platform [47].

A. DATASET AND IMPLEMENTATION DETAILS 1) DATA DESCRIPTION
In this paper, all the experiments involve three public datasets: [48], TSD-MAX [49], and GTSDB [50]. TT100K was collected publicly from the perspective of driving scene with scene image sized 2048 × 2048 pixels. The training dataset in TT100K contains 6105 images, and the test dataset contains 3071 images. Each image contains several traffic signs. In this study, TT100K was selected as the training dataset amongst all the experiments.
The TSD-MAX consists of 6023 images from real world, which was divided into 400 groups. Each group contains 10 to 30 images, which is a continuous frame of the same scene in the perspective of vehicle moving forward, and each image contains several traffic signs. The size of each image is 1280 × 1024.
GTSDB contains dataset of 4500 scene images (3000 for training and 1500 for testing), each image has a uniform size of 1360 × 800 pixels, and each image of the training set contains 0 to 6 traffic signs.

2) IMPLEMENTATION DETAILS
The unsupervised domain adaptive protocol was adopted in this study. The training data was divided into two parts, which include: the source training image data and their annotations (bounding boxes and object categories) and the target training data that only provides the unlabeled image.
On the other hand, in order to make the traffic signs of large number (e.g. speed limit sign) and small number (e.g. warning sign) have equal training probability, a balanced sampling strategy was adopted. All the training images were grouped according to categories, and each category generates an image list. Images were randomly selected from the image list corresponding to each category for training, and each category has a more composed opportunity for participation in training.
For all experiments, the size of the image was set to 1024 × 1024. The networks were first trained with learning rate of 0.01 for 47K iterations in the pre-training stage (use source domain datasets only), followed by the learning rate 0.01 for 26K iterations in the joint-training stage (use source domain datasets and target datasets) and the final performance was reported. The momentum was set as 0.9 and the weight decay was set as 0.0005 without any specific notation. All models used in the comparative experiment are trained with same schedule and the parameter settings are consistent, the prepared performance was reported after 73K repetitions.

B. EVALUATION INDICATORS
Precision was used to quantitatively evaluate the performance of the different framework in traffic sign detection. Recall that the precision-recall curve (PRC) and F-measure are widely applied standard measures approaches in object detection [51].
Precision and recall were calculated from three generally recognized evaluation components in information retrieval, which are: true positive (TP), false positive (FP) and false negative (FN) [52]. TP and FP represented the number of correct and false predictions. FN is the sum of regions not proposed. The precision and recall rate are denoted as: PRC refers to the line chart composed of precision and recall rate, F1 is the evaluation index of comprehensive recognition rate and recall rate is defined as follows:

C. EFFECTIVENESS OF MODEL STRUCTURE 1) ABLATION ANALYSIS OF SADANet
To evaluate the effectiveness of the component in the proposed SADANet, ablation experiments were carried out. First, MSPN and DAN were evaluated, using the TT100K as the training dataset, the TSD-MAX and GTSDB as the test datasets, respectively. Experiment was carried out on the single and different domain datasets, respectively. The DAN could only optimize the parameters and the object detection cannot be achieved alone. Thus, DenseNet was used instead of the MSPN to comprehend the effective verification of the MSPN.
The experimental results are shown in Table 2. It could be seen that the average precision rate of MSPN could reach 93.34% in the same dataset, and the average precision could be higher than the DenseNet with about 3%. However, when the training and test datasets are different, the precision decrease with about 14%. The average precision rate of the network with DAN increases with about 0.3% in a   single dataset, and with about 4% in the different datasets when compared with the framework without DAN. From the experiments, it could be shown that the traffic sign detection network based on the MSPN, has a good detection accuracy on a single dataset, but its performance declines when it crosses the domain. The DAN was found to be effective in solving this problem.

2) EVALUATION OF MSPN
The main motivated portions of proposed MSPN include the base network DenseNet-169, the CARAFT and the adaptive feature weighting model (AFWM). To understand MSPN better, several ablation experiments to evaluate how each  component affects the final performance were carried out. Since the performance distribution of the detection results on TSD-MAX and GTSDB was consistent, only one was selected for explanation. Table 3 shows the experimental results, CLF refers to the select feature map from certain layer to obtain the ROI features. One or two components in the MSPN were replaced successively, and were tested on the dataset, respectively. First, the effect of the base network, DenseNet-169 was investigated. When DenseNet-169 was replaced with VGG-19 [53], the detection result decreases more than 5% when compared with the MSPN. CARAFT is an upsampling operator, Deconvolution [54] is the most commonly used method to improve image resolution, therefore, deconvolution as the alternative algorithm of CARAFT was adopted. It could be seen that this change would produce a 1% decrease in the detection performance. In the two-stage multiscale detection method, a basic strategy to obtain the ROI feature was pooling on one certain feature level, called the CLF. Similarly, AFWM was replaced with CLF and the average precision decreased by about 2%. Outlined in Table 4, is the design  of the base network which has the greatest impact on the detection results.
The ALP in MSPN was also proposed, and RPN was used with ALP instead of RPN alone. This design effectively reduced the number of anchors, which was verified through the experiments. As shown in Table 4, the introduction of ALP could effectively select a small part of the region as the candidate center of the anchor, which significantly reduces the number of anchors, in each image. About 20 anchors are basically reduced, thereby achieving 40% reduction. Meanwhile, this design also effectively improves the average recall rate by 7%.

3) EVALUATION OF DAN
In order to verify the proposed DAN, for all the domain transition scenarios, the final detection results of the framework were evaluated by combining different components (i.e., image level adaptation, object level adaptation). In consideration of that the proposed image level adaptation is multiscale, image level feature map on each scale was selected as comparison. Comparatively, the TT100K is the training dataset while the TSD-MAX is the test dataset.
The results of the different models are shown in Table 5. Specifically, it was able to improve the performance by 3.6% using the IDAC only, and 5.3% using ODAC only when compared with the MSPN. This proves that the image and the object level adaptive component could effectively reduce the domain shift at all levels. Compared with IDAC, the average precision rate of the IDAC-b i decreases.

D. COMPARISONS WITH STATE-OF-THE-ART TRAFFIC SIGN DETECTION METHODS
In order to prove the SADANet was more competitive than the other traffic sign detection methods, which are state-of-the-art in computer vision, the proposed method was compared with the DenseNet, Faster RCNN and DA Faster R-CNN. DA Faster R-CNN is a domain adaptive object detection framework based on the Faster R-CNN. All the algorithms were studied and tested on the same hardware platform with the same dataset. To carry out the verification of the detection effect of the three algorithm models under different datasets, the TT100K dataset was utilized as the source domain, the target domain consists of either the TSD-MAX datasets or the GTSDB.

1) OVERALL PERFORMANCE ANALYSIS
In the comparison of the overall performance, SADANet was preferable to other algorithms. Fig.6 is the statistical results of the evaluating indicator. When the training image and the test image are from the same dataset, the precision of Faster R-CNN was 89.73%. DenseNet improves the precision by 2.75% depending on the feature reorganization. DA Faster R-CNN has no advantage when there is no domain shift, its precision was only 89.65%. SADANet, not only adopted feature fusion, but also considered multiscale transformation, which obtains the highest precision value (93.59%) among the four methods. If there is difference between the test dataset and the training dataset, the average precision rate of SADANet exceeded the Faster R-CNN with about 6.9%, that is, 4.6% higher than DenseNet and 3.7% higher than DA Faster R-CNN. As a result of the large shift between the features of objects in the different domains, the strength of the model is reduced. Although, DA Faster R-CNN also adopts domain adaptive structure, its accuracy was affected by the base network as shown in some of the experimental results (Fig. 7). The first two lines are the detection results on TT100K, the third and fourth lines are the results on TSD-MAX, and the last two lines are the results on GTSDB. It could be seen that whether the size of the object changes, or there is a change of illumination and resolution of the test scene image, there was no significant impact on the performance of the proposed SADANet. Faster R-CNN and DA Faster R-CNN have limitations in small target detection, while Faster R-CNN and DenseNet are more vulnerable to environmental change.

2) MULTISCALE TRAFFIC SIGNS DETECTION
The detection results of the four algorithms for the different size objects are shown in Fig.8. It could be seen that the SADANet has greater advantages than all the algorithms compared. For the small size object, the proposed method obtains the detection rate of 82.88%, which exceeded the DenseNet by 6.62% and the Faster R-CNN by 8.25%. For the medium size object, the detection rate of the proposed model was 85.39% compared with 80.81% of DenseNet, 83.86% of DA Faster R-CNN, and 80.17% of Faster R-CNN. For the large size object, the proposed method achieves the detection rate of 83.56%. In general, the proposed SADANet exceeds the other methods and achieves the state-of-the-art performance on all the scales, which not only confirms its generalization capacity to the different domains, but also validates its predominance in multiscale traffic signs detection.

3) PERFORMANCE IN REAL IMAGES
Some images from the real scene were collected, and four comparative algorithms, trained based on the TT100K and tested on the real image. Some of the experimental results are shown in Fig.9. The results show that all four algorithms performed well under proper lighting and favorable conditions similar to that of the training scene image. When the external environment differs in the test scene image and the image in the training dataset, the difference between the data features in the source domain and the data features in the target domain was large. Faster R-CNN and DenseNet have a large degree of missed detection. DA Faster R-CNN could extract domain invariant features, but due to the detailed information feature extracted, the base network was incomplete, when the appearance of the object is similar to the surrounding environment, it was easy to produce false detection. Although, SADANet could achieve superior detection effect under the traffic signs with significant features, however, when there was an excessive discrepancy between the training and the test domain, the extracted domain invariant features were still different from the features of the object to be detected, which is easy to have false detection.

4) RUNTIME ANALYSIS
We also measured the training and testing time of the contrast model. Specifically, the training time of Faster R-CNN is about 4.1 hour and that of SADANet is nearly 5.9 hour for each epoch on TT100K dataset with the same batch size. As for the inference time, Faster R-CNN can run at 13.6fps, DenseNet can run at 11.8fps, DA Faster R-CNN can run at 12.3fps, and SADANet can run at 11.2fps for images with same size of pixels. The inference time is the average inference time and all the runtimes are tested on same hardware and datasets. In general, our SADANet consistently achieves better accuracy than the existing model with less time difference.

V. CONCLUSION
In this paper, a scale-aware and domain adaptive network (SADANet) was presented for traffic sign detection. Combining the multiscale prediction network (MSPN) and domain adaptive network (DAN) effectively predicted the location and categories of traffic sign. Representative experiments show that the SADANet achieves better performance and exceeds most existing algorithms, on accuracy. In the future work, extending SADANet to more object detection tasks, such as vehicle detection, and pedestrian detection, is of interest. As for the feature and scale variation in traffic VOLUME 8, 2020 sign, vehicle, and pedestrian are different from each other, further design of the multi-scale prediction network would be needed. On the other hand, improving the strength of the SADANet to make it maintain stable performance in more complex environments would be needed. CHAO  She is currently an Assistant Engineer with the School of Electronics and Control Engineering, Chang'an University. Her research interests include computer vision, intelligent transportation, and traffic scene understanding. VOLUME 8, 2020