Yolo V4 for Advanced Traffic Sign Recognition With Synthetic Training Data Generated by Various GAN

Convolutional Neural Networks (CNN) achieves perfection in traffic sign identification with enough annotated training data. The dataset determines the quality of the complete visual system based on CNN. Unfortunately, databases for traffic signs from the majority of the world’s nations are few. In this scenario, Generative Adversarial Networks (GAN) may be employed to produce more realistic and varied training pictures to supplement the actual arrangement of images. The purpose of this research is to describe how the quality of synthetic pictures created by DCGAN, LSGAN, and WGAN is determined. Our work combines synthetic images with original images to enhance datasets and verify the effectiveness of synthetic datasets. We use different numbers and sizes of images for training. Likewise, the Structural Similarity Index (SSIM) and Mean Square Error (MSE) were employed to assess picture quality. Our study quantifies the SSIM difference between the synthetic and actual images. When additional images are used for training, the synthetic image exhibits a high degree of resemblance to the genuine image. The highest SSIM value was achieved when using 200 total images as input and $32\times 32$ image size. Further, we augment the original picture dataset with synthetic pictures and compare the original image model to the synthesis image model. For this experiment, we are using the latest iterations of Yolo, Yolo V3, and Yolo V4. After mixing the real image with the synthesized image produced by LSGAN, the recognition performance has been improved, achieving an accuracy of 84.9% on Yolo V3 and an accuracy of 89.33% on Yolo V4.


I. INTRODUCTION
Traffic sign identification has emerged as a critical study area in the science of computer vision in recent years. Moreover, it plays a critical part in advanced driver assistance systems, self-driving vehicles, and traffic safety [1]- [3].
In the previous research, Convolutional Neural Networks (CNN) [4] has achieved very good research results in traffic sign detection and recognition. A considerable quantity of data must be utilized in the neural network to ensure that the model is adequately trained and capable of recognizing The associate editor coordinating the review of this manuscript and approving it for publication was Krishna Kant Singh . traffic signs. As a result, if a significant quantity of labeled data is available, developing a CNN-based traffic detection model should not be difficult. Many researchers have done extensive research and discussion on the identification of traffic signs. They also provide many data sets for public use, such as the German Traffic Signals Dataset (GTSRB) [5], [6], the Chinese Traffic Sign Database (TSRD) and Tsinghua Tencent 100K (TT100K) [7]. Researchers usually use open datasets or collect traffic signs through roads to do their experiments. Nevertheless, obtaining a huge quantity of highquality images of traffic signs is not straightforward [8], [9]. It takes a considerable amount of time, whether it is using a dashcam or on-site filming. In addition, the design and color of traffic signs are different for each country, meaning that it is necessary to collect and mark traffic signs from different countries.
Synthesizing images is a prominent problem in computer vision [10], [11]. To acquire more varied and inexpensive training data, traffic sign pictures generated from standard templates have been routinely employed to train machine learning classification algorithms [12], [13].
Deep Convolutional Generative Adversarial Network (DCGAN) was proposed by Radford et al. [14] and Karras et al. [15] in 2016. DCGAN combines the Generative Adversarial Network (GAN) with CNN so that all GANs can get better and more stable training results. Other versions of GAN are Least Squares Generative Adversarial Networks (LSGAN) and Wasserstein Generative Adversarial Networks (WGAN) [16], [17]. Both of them can better solve the problem of instability training due to GAN. Each GAN has achieved excellent results in producing synthetic image. Because of the absence of a training dataset, our studies use DCGAN, LSGAN, and WGAN to generate synthetic pictures.
For Traffic Sign Detection (TSD), it is very important to detect small objects at high speeds accurately and quickly. Moreover, CNN can effectively detect and classify object such as Faster R-CNN [18], Single Shot Multibox Detector (SSD) [19], and You Only Look Once (Yolo) [20]. Yolo has the most significant influence under conditions that require faster time detection. It has a high-efficiency detection speed and high accuracy. The newest version of Yolo, Yolo V4 was proposed in 2020 [21]. The majority of modern scientific models need several GPUs for training with large mini-batch size. Usually, when training with one GPU makes the training process slow, heavy, and ineffective. Yolo V4 [21] approaches this problem by constructing an object detector trained on a single GPU with a smaller mini-batch size. This method makes it potential to train a super quick and precise object detector with a single 1080 Ti or 2080 Ti GPU. This paper analyzes in detailed CNN models and feature extractors, specifically, Yolo V3 and Yolo V4 for object identification. Our study refined them using Taiwan prohibitory sign datasets that we created to detect traffic signs. Our dataset consists of no entry (Class P1), no stopping (Class P2), no parking (Class P3), and speed limit (Class P4). We have been unable to locate a research article that assesses a large number of object detectors based on deep learning that is expressly tuned to the traffic sign recognition problem domain while taking into account numerous crucial aspects such as mAP, IoU, and detection time.
The primary contributions of this research are as follows: (1) High-quality prohibitory sign pictures are synthesized using DCGAN, LSGAN, and WGAN. (2) The development of a CNN-based solution for traffic sign classification tasks, as well as the augmentation of the CNN training set using created synthetic data, in order to enhance classification and recognition performance. (3) We proposed an experimental setting with various GAN style to generate synthetic image, after that we evaluate the synthetic image using SSIM and MSE. (4) The Yolo V3 and Yolo V4 model evaluation includes the mAP, detection time, IoU, and floating-point operations (FLOPS). (5) Experiments show that using synthetic image data generation using various GAN can improve all models' IoU and performance.
This research work is structured as follows. Section II discusses contemporarily published works. Section III details our recommended technique. The experiment and its findings are described in Section IV. Detailed discussion about our research describes in Section V. In Section VI, conclusions are stated and recommendations for further study are made.

II. RELATED WORKS A. IMAGE RECOGNITION
Image recognition is an important role in the field of computer vision. For humans, it is easy to recognize objects, but it is very difficult for machines. The machine needs to learn the meaning of each image, observe slowly, and test. According to the learning results, the machine can learn how to recognize the object.
Shijin Song et al. [22] proposed a better CNN network architecture that allows small objects to be better detected, with less computation, and easier deployment. They eliminated the CNN network, significantly lowering the model's size and operation time while preserving accuracy. At the same time, the fully connected layer is replaced with a fully convolutional layer, which improves the efficiency of calculation. Tao et al. [23] were inspired by the Fully Convolution Network (FCN) and used the combination of Yolo to produce a new optimized Yolo. This combination models allows object detection to provide higher accuracy in traffic scenes. An average accuracy of 69.3% was obtained on the VOC07 and VOC12 datasets, while the traditional Yolo was only 64%.
In traffic scenes, traffic signs are relatively small and make it impossible to detect the sign precisely. Hasegawa et al. [24] proposed a better recognition method in complex traffic scenes in Japan. The experiment use Yolo V2 and 5 different image sizes as input. The multiscale images also implemented to make the model steady and train better. Chih-Chung Hsu et al. proposed a new architecture based on Densenet [25], called Common Fake Feature Network (CFFN). The architecture uses pairwise learning to optimize the network, and matching real images with fake images. The neural network captures the features and recognize them as fake images and real images well. The cross-entropy loss function used to optimize the classifiers method. Finally, the experimental results are significantly higher than other traditional methods. called Deep Convolutional GAN (DCGAN), is more stable in most situations [14], [26].
GAN [27], [28] has two parts that are simultaneously trained, namely generative (G) and discriminative (D). The discriminatory model is used to detect if a sample contains valid or invalid data. The generative model captures certain target information distribution to puzzle the discriminative model [29], [30]. The D model is a binary classifier that classifies the G model's data in the training system as either realistic or unrealistic. G minimizes its loss function by supplying data that D classifies as real, as modeled by Equation (1).
We employ DCGAN to build a synthetic traffic sign picture in this work [31]. Following that, we will integrate the synthetic image with the real image in order to expand our dataset and enhance traffic sign recognition algorithms. DCGAN is a baseline model, other models build on it by adding additional restrictions or making enhancements.
Numerous research efforts have been directed toward different GAN variations to improve the overall performance of GANs. Brock et al. [32] introduced models named BigGANs, which realized the work of generating high-resolution and different images from heterogeneous datasets. The image recognition algorithm can process high-resolution images and a wide range of samples from the difficult dataset, Ima-geNet. Karras et al. [15] suggested an alternative generator design called StyleGAN. The authors designed a new generator architecture that can dynamically vary the style of the generated picture depending on the latest information in all convolutional layers. By starting with low resolution, it helps to guide the full picture synthesis process which begins with low resolution and works its way up to high resolution.
Li and Wand [33] proposed an efficient texture synthesis method named Markovian Generative Adversarial Networks (MGANs). It can decode brown noise straight into a realistic texture, but it can also decode pictures into the painting, which increases the texture synthesis quality. Bergmann et al. [34] introduced an architecture named spatial GAN (SGAN) which is very good for texture synthesis. This approach is capable of creating high-quality texture pictures and fusing numerous diverse source photos to create complex textures.
There are two benefits of Least Squares Generative Adversarial Networks (LSGAN) over regular GANs. The first benefit of LSGANs is that they can create higher-quality pictures than normal GANs. Second, LSGANs are more stable in their performance throughout the learning process [35], [36]. In reality, training GANs is a difficult challenge due to the instability of GAN learning. Recently, many articles have shown that the goal function affects the uncertainty of GANs learning [37]. In particular, reducing the usual GAN objective functions can cause gradient loss problems, which makes it difficult to update the generator. LSGANs overcome this barrier by penalizing samples depending on their distances to the decision boundary, which results in more gradients being generated when the generator is updated. Furthermore, demonstrate theoretically that the training instability of standard GANs is attributable to the objective function's mode-seeking tendency, while LSGANs display less modeseeking activity.
The algorithm of Wasserstein Generative Adversarial Networks (WGAN) [17] has been created to confront the instability in training networks [38], which is believed to be associated with the existence of undesirable sharp gradients of the GAN discriminator function. Yang et al. [39] adopted Wasserstein GAN for denoising low-dose CT images and attained a successful application in medical imaging reconstruction. In the synthesis data production module, WGAN is used to produce simulated fault signals in order to incrementally supplement minority fault classes, and the synthetic signals are used to balance the training dataset.

C. PERFORMANCE EVALUATION OF SYNTHETIC IMAGES
Structural Similarity Index (SSIM) [40], [41] is employed to determine the structural similarity between two photographs. The traditional picture's SSIM assessment method calculates the SSIM index of the local block via the sliding window in the deformed picture. After obtaining the SSIM evaluation value of the whole picture, it normalizes all the local block evaluation indications to produce the overall SSIM evaluation value [41], [42]. The SSIM metrics for brightness, contrast, and structural comparison are presented in Equation (2) and are computed as follows: where µ x is the average of x, µ y is the average of y, σ 2 x is the variance of x, σ 2 y is the variance of y, and σ xy is the covariance of x and y. C 1 , C 2 , and C 2 are small constants.
The Mean Square Error (MSE) is obtained to compute the difference between estimated values and the real values of the quantity being estimated, which is squared as the square of the difference of pixels. The error is the difference between the estimator's inferred value and the quantity to be estimated as indicated in Equation (3) [11].
where P i describes the observed value, Q i represents predicted value, and n is the number of data points. In this research, DCGAN, LSGAN, and WGAN-generated synthetic images will be assessed using SSIM and MSE. SSIM has a value between −1 and 1, which is meaning the higher is better. Smaller MSE values, on the other hand, imply a more positive outcome.

D. YOLO V3 AND YOLO V4
Yolo V3 was introduced by Redmon and Farhadi [43] in 2016. It splits the input image into (N × N ) grids cells [44] with the similar size. Yolo V3 forecast bounding boxes and probabilities for each grid cell. Also, Yolo V3 utilizes multi-scale fusion to provide predictions, and a single neural network is used to gather and preprocess the holistic picture. In the earlier box forecasting, the dimension clusters are used as boxes to which the border boxes are assigned. Furthermore, the K-means algorithm is used to perform dimensional clustering on the objective boxes in the dataset, yielding nine prior boxes of varying sizes that are distributed uniformly among feature graphs of various scales. Additionally, Yolo V3 allows for the establishment of a customized bounding box anchor for each ground truth item [45].
Yolo algorithm [55] is a standard network design throughout the whole process. This algorithm is shorter than the R-CNN algorithm [56], [57]. Yolo V4 adopts Yolo V3 as a one-stage dense prediction in the head. Further, Yolo V3 divides the input image into S × S grids cells of the same size [44], predicts bounding boxes and possibilities to every grid cell. Besides, Yolo V3 utilizes multiscale fusion to predict the whole image. This algorithm adopts a single CNN to prepare the entire image. The clusters are used to evaluate boundary lines.

III. METHODOLOGY
The next sections discuss our system's synthetic data generation methodologies for enhanced traffic sign identification utilizing DCGAN, LSGAN, and WGAN. Figure 1 depicts a high-level overview of system techniques.
The BBox mark tool [58] was adopted to create a bounding box for all signs. All classes are labeled, and a single picture may have many marks. In the detection phase, a single class detector model was used, and each class label corresponds to a single training model. The bounding box labeling tool's return values are object coordinates (x 1 ,y 1 ,x 2 ,y 2 ). These item coordinates are separate from Yolo's input value. The Yolo input value is the center point, width, and height (x, y, w, h). As a consequence, the system's bounding box coordinates in the Yolo input format must be adjusted. The adjustment process is based on Equations (5) -(10).
H stands for the image's height, dh for the image's absolute height, W for the image's width, and dw for the image's absolute width. As a result, float values relative to the width and height of the picture (dw, dh) might range from 0.0 to 1.0.
In the data preparation step, we employ several GAN to produce synthetic prohibitory sign pictures. Additionally, the dataset was divided into four groups. The first group dataset contains solely original photos. The second group dataset contains both the original picture and a DCGAN-generated synthetic picture. Next, the third groups combined the original image with synthetic image produced by LSGAN. The fourth group employ the dataset that mix real image with synthetic image generation by WGAN. The first group dataset comprises just original images, and the others consists of merging the original image with a synthetic image generated by various GAN. The complete dataset combination shown in Table 1.
Object detection using Yolo V3 and Yolo V4 proceeds as follows.
Step 1: Splits the input image into S × S grids. By calculating anchor boxes, each grid generates K bounding boxes. It predicts B boundary boxes for each grid cell, with one box confidence score for each box.
Step 2: Regardless of the number of boxes B, detects just one item. Additionally, forecasts C conditional class probabilities (one per class for the similarity of the object class).
Step 3  Step 4: Compares the optimum confidence IoU truth pred of the K bounding boxes with the threshold IoU thres . If IoU truth pred > IoU thres , meaning that the bounding box contains the object. The bounding box would not contain the object otherwise.
Step 5: After finding out the potential category for the object's category, the computer will identify the specific item. Non-Maximum Suppression (NMS) is employed in our study to search for and uncover potential problem drop boxes, redundant output, and discoveries of object detection.
Additionally, the NMS is implemented as follows: (1) Sort predictions according to their confidence ratings. (2) If we evaluate the same class predictions and IoU > 0.5 with the current prediction, we should start with the top scores and dismiss any current forecast. (3) Further, repeat step 2 to see whether the predictions hold. Before training was completed, our designs were fine-tuned using pre-trained Taiwan prohibitory sign weights, which sped up training by a significant amount.
Step 6: The last step results in a categorized image labeled with the class.

A. DATA GENERATION BY DCGAN, LSGAN, AND WGAN
The DCGAN gives a list of conditions to the CNN backbone network, making sure that the network can be trained, and learning features from previously illustrated pictures can categorize pictures. FDCGAN enhances the picture quality by adopting the following improvements. To begin, DCGAN replaces pooling layers with strangled convolutions on the discriminator and fractional stride convolutions on the generator. CNN is often used to extract characteristics. Second, DCGAN employs the Batch Normalization technique to address the gradient disappearance issue. BN incorporates a gradient propagator in each layer, ensuring that the gradient makes it to each layer, and prohibiting the generator from gathering all samples to the corresponding point. The third issue is that DCGAN uses distinct activation functions, such as Adam optimization, ReLU activation function, and leakyReLU, for distinct neural networks. The findings reveal that DCGAN offers improved performance. DCGAN is widely regarded to be the standard when used in conjunction with other GAN models. The advantages of using LSGAN are as follows: (1) LSGAN increases the original GAN loss function by replacing the original cross-entropy loss function with the least-squares loss function. This way corrects two major traditional GAN problems. (3) LSGAN makes the result image quality better, the training process stable, and the convergence speed is faster. ReLu and Leaky ReLu parameters are used in generators and discriminators the same as traditional GAN. On the other hand, the disadvantage of LSGAN is that excessive penalties for outliers lead to reduced sample diversity.
WGAN [59] solves the problem of training instability due to its efficient network architecture. In this model, the sigmoid function eliminates the discriminator's last layer. The Wasserstein distance formula can effectively narrow the generation, distribution and ensure the diversity of the resulting image. The disadvantage of WGAN is that the training time is longer. Due to inappropriate pruning of weight gradient may disappears or exploded [60].

B. EXPERIMENT SETTING
This study classifies images according to the overall image used to train. The first set consists of 200 photos with 64 × 64 and 32 × 32-pixel dimensions. Then, for each combination of the same size, it will generate 1000 images. The second category uses 100 images of 64×64 and 32 × 32 dimensions.
Following that, it will generate 1000 identical photos for each combination. The last group uses 50 images of 64 × 64 and 32 × 32 dimensions. Further, 1000 prints of a similar combination size will be produced. The image size was chosen to reflect the reality that traffic signs often constitute a minor fraction of the picture. Table 2 describes various GANs experiment setting in our work.

A. DATA GENERATION RESULTS
The training model environment for data generation by various GAN was Nvidia GTX2070 Super GPU accelerator, an AMD Ryzen 7 3700X Central Processing Unit (CPU) with 8 Core Processor, and the RAM is using 32GB DDR4-3200 memory. Further, our method is implemented in Torch and TensorFlow [61]. The training setting is the same for DCGAN, LSGAN and WGAN. The generative network and discriminative network are trained with Adam [20] optimizer with β1 = 0.5, β2 = 0.999, and learning rate of 0.0002.
The batch size is 32, hyperparameter λ is set to 0.5 and the normalization method using layer norm. The iterations for training are set as 2000. Then, the total images are 200, 100, and 50. Further, the images sizes are 64 × 64 and 32 × 32, respectively, for input and output. Figure 2, Figure 3, and Figure 4 displays synthetic traffic sign images generated by DCGAN, LSGAN, and WGAN with size (a) 32 × 32, and (b) 64 × 64. Additionally, the picture is pretty authentic, since we cannot determine which picture is fake and which is genuine. The photos seem to be very crisp, natural, and realistic. The synthetic picture created using different GAN techniques will be utilized for training and combined with the actual picture to improve the performance of the traffic sign recognition system.
Our research evaluated the data-generating capabilities of several GANs by comparing the synthesized pictures to their matching actual photos. In order to determine the SSIM value, we produced a picture of a similar type to the original picture and compared it to the original picture. SSIM contains luminosity and contrast masking. However, the error computation requires strong connections between nearby pixels, and the metric is based on short picture windows. Table 3 represents the complete SSIM and MSE calculation for various GAN. Based on Table 3

B. DATASET
The dataset is made up of 70% for training and 30% for testing, and the experiment with the Taiwan prohibitory sign was performed using photos that were created, pictures that were synthesized, and a mixture of the two.
We use 200 synthetic images from various GAN and mix it with real image. The original images consist of 235 no entry images, 250 no stopping images, 185-speed limit images, and 230 no parking images. Next, for the synthetic image, we use 100 images at 64 × 64 sizes and 100 images with 32 × 32 sizes. Moreover, Table 4 explains the Taiwan prohibitory signs in detail. In our experiment Yolo V3 use width = 416, height = 416, and Yolo V4 employ width = 512, height = 512. So, the image is converted to 416 × 416 for Yolo V3 and 512 × 512 for Yolo V4.

C. YOLO V3 AND YOLO V4 TRAINING RESULTS
During the training stage, our work enhances the Yolo V3 and Yolo V4 model using a learning rate of 0.001 for analysis, learning rate decay of 0.1 at each epoch, and 0.9 for momentum. To solve the over-fitting problem, we implement the cross validation and early stopping in our experiment. 5-fold VOLUME 9, 2021  cross-validation is a conventional procedure to obtain outof-sample prediction error. Early stopping rules indicate the maximum number of iterations that may be performed before the learner becomes to overfit. This experiment applies max_batches = 8000 iterations, policy = steps, steps = 6400, 7200, scales = 0.1,0.1, momentum = 0.949, decay = 0.0005, and mosaic = 1. Typically, m-class object detectors need 2000 × m as the maximum  batches. In the experiment, the training process stops at 8000 iterations (2000 × 4 classes). Further, the scale (0.1, 0.1) and the current iteration number 0.001 batches are used in the training process. The calculation of the current learning rate becomes learning rate × scales [0] × scales [1] = 0.00001, and the learning rate value will be updated regularly.

IoU =
Area pred ∩ Area gt Area pred ∪ Area gt (11) Nevertheless, the output examples can be classified into three classes. True positive (TP) is the number of correctly recognized samples; false positive (FP), which assigns to the number of samples with incorrect identification; true negative (TN) is the number of unrecognized samples. Precision and recall are represented by [64], [65] in Equation (12)- (13).
VOLUME 9, 2021 where I obj ij denotes if the object appears in cell i, and I obj ij denotes that the j th bounding box predictor in cell i is responsible for the prediction. Next, x,ŷ,ŵ,ĥ,ĉ,p are used to express the anticipated bounding box's center coordinates, width, height, confidence, and category probability. True labels are those without the cusp. Furthermore, our works set the λcoord to 0.5, indicating that the width and height errors are less effective in the calculation. Then, λnoobj = 0.5 is used to mitigate the effect of several grids devoid of objects on the loss value. Additionally, this study separated the model into four groups, each with its dataset. The first group uses just syn-thetic pictures, whereas the second group uses both the actual picture and the DCGAN synthetic picture. Next, the third groups utilize the original image mix with LSGAN synthetic image and the last groups combined with synthetic image generated by WGAN. Training performance result shown in Table 5. Moreover, Group 3 (original images, LSGAN) dataset obtains the maximum mAP, around 99.98% with IoU 90.35% for Yolo V4, followed by Yolo V3 at 99.83% with IoU 73.11%. As a result, Table 5 demonstrates that combining authentic and synthetic pictures strengthens all models and increases the IoU and mAP percentages. This study used IoU to determine the extent to which our anticipated border overlaps with the ground truth, which is the boundary of the actual object. Yolo V4 demonstrated superior mAP over Yolo V3 in almost all testing groups.

V. DISCUSSIONS
We use sixty prohibitory sign pictures to test Yolo V3 and Yolo V4 in various sizes and environments. Table 6 presents the testing accuracy results performance of the experiments using the images beyond our datasets. Yolo V4 is generally more accurate than other versions. Yolo V4 increases the accuracy of Yolo V3 in all groups except group 1. The highest average accuracy is Group 3 (Original Image, LSGAN) with the accuracy of Yolo V4 model at 89.33%, followed by Yolo V3 at 84.9%. Previously there are seven errors detection on Yolo V3, but using Yolo V4 there are only two detection error in Group 3. The second highest model is achieved by Group 4 (Original Image, WGAN) with 88.2% accuracy for Yolo V4 and 77.41% accuracy for Yolo V3. In Group 4, there were seven detection errors on Yolo V3 and two detection errors on Yolo V4. The worst group in our experiment was Group 1 which only used the original image. Group 3 shows the optimal accuracy because it uses the LSGAN synthetic image combined with original images. As shown in Table 3, the synthetic image performance evaluation by LSGAN has a maximum SSIM value and a minimum MSE. Hence, this affects the recognition performance result.
The big dataset, which contains both the original picture and the synthetic picture created by several GANs, will improve the detection and recognition performance of both versions. Overfitting occurs when a neural network is trained with an inadequate dataset. Small datasets may also act as a mapping impediment for neural networks when they attempt to find the object. One technique that makes the images simpler to learn the input picture is to apply noise or create a synthetic image during preparation. The inclusion of noise during training will improve the training phase and minimize general errors. Therefore, combining original images and synthetic images in the dataset improves object recognition performance. Figure 6 shows the result of Group 3 (Original Image, LSGAN). Furthermore, Figure 6   the highest accuracy 99%, followed by Figure 6 Figure 6, we can summarize that every model can identify all class properly with various bounding box coordinate and accuracy. Figure 7 represents the result of Group 4 (Original Image, WSGAN). Hence, Figure 7

VI. CONCLUSION
The major goal of this study is to explore how the quality of synthetic pictures generated by DCGAN, LSGAN, and WGAN. Our work combines synthetic images with original images to enhance datasets and verify the effectiveness of synthetic datasets. We use different numbers and sizes of images for training. Our research investigates and analyses CNN models for object identification when paired with different backbone architectures and extractor features, notably Yolo V3 and Yolo V4. This experiment investigates the detector's primary characteristics, such as precision accuracy, detection time, workspace size, and BFLOP numbers. Meanwhile, we are developing a CNN-based solution for traffic sign classification and expanding the CNN training set using synthetic data collected to improve classification and identification outcomes. Yolo V4 is generally more accurate than other models using original image and synthetic image produced by LSGAN. Our study demonstrates that training with a mixture of original and synthetic pictures improves traffic sign identification ability.
Based on our experiment result we summarized as follows: (1) The best dataset in the experiment is Group 3, combined the original image with synthetic image produced by LSGAN. (2) The highest SSIM value was achieved when using 200 total images as input and 32 × 32 image size.
(3) After combining the original image with the synthesized image produced by LSGAN, the recognition performance has developed, obtaining an accuracy of 84.9% on Yolo V3 and an accuracy of 89.33% on Yolo V4. (4) The addition of noise during training will gain the training phase and minimize general errors. Hence, integrating the various original images and synthesized images in the dataset improves object identification performance.
In the future, we want to combine synthetic images of different sizes for training. Currently, only images with a total input of 200 and 2000 epochs were used. Through a model trained on synthetic images with different sizes, we will understand about the size or quality of synthetic images affects the model. We will compare with other traffic sign benchmarks to reflect the advantages of synthetic imagery. Future research also tries other detection methods combined with Explainable AI (XAI) and the other GAN method (Big -GAN, styleGAN, MGAN).