A Robust Deep Learning Approach for Automatic Iranian Vehicle License Plate Detection and Recognition for Surveillance Systems

The process of detecting vehicles’ license plates, along with recognizing the characters inside them, has always been a challenging issue due to various conditions. These conditions include different weather and illumination, inevitable data acquisition noises, and some other challenging scenarios like the demand for real-time performance in state-of-the-art Intelligent Transportation Systems (ITS) applications. This paper proposes a method for vehicle License Plates Detection (LPD) and Character Recognition (CR) as a unified application that presents significant accuracy and real-time performance. The mentioned system is designed for Iranian vehicle license plates, which have the characteristics of different resolution and layouts, scarce digits/characters, various background colors, and different font sizes. In this regard, the system uses a separate fine-tuned You Only Look Once (YOLO) version 3 platform for each of the mentioned phases and extracts Persian characters from input images in two automatic steps. For training and testing stages, a wide range of vehicle images in different challenging and straightforward conditions have been collected from practical systems installed as surveillance applications. Experimental results show an end-to-end accuracy of 95.05% on 5719 images. The test data included both color and grayscale images containing the vehicles with different distances and shooting angles with various brightness and resolution. Additionally, the system can perform the LPD and CR tasks in an average of 119.73 milliseconds for real life data, which illustrates a real-time performance for the system and usable applicability. The system is fully automated, and no pre-processing, calibration or configuration procedures are needed.


I. INTRODUCTION
Nowadays, Intelligent Transportation Systems (ITS) are known as the vital means of municipalities and governments for applying modern traffic planning policies. ITS are mostly integrated with video cameras installed in different locations of the roadways and urban areas for surveillance and enforcement purposes [1]. Since the outputs of these cameras are video frames or a set of images, fetching valuable results from them using machine vision and Digital Image Processing (DIP) techniques, lead to the development The associate editor coordinating the review of this manuscript and approving it for publication was Xian Sun . of numerous applications for ITS. Among most of these applications, Automatic License Plate Recognition (ALPR) is considered a major step [2]- [6]. ALPR refers to the process of recognizing the characters existing inside the vehicles' license plates through machine vision and artificial intelligence methods. Some of the most common applications in which the license plate recognition process is essential are traffic monitoring systems [7]- [9], navigation and vehicle tracking [10], [11], toll payments and control systems in public/private parking areas [12], [13], and the identification of stolen vehicles [14], [15]. In this regard, the correct extraction of characters from license plates is significantly essential to take further actions like enforcement and prosecution. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ A typical ALPR system consists of three main modules to provide reasonable outcomes. These three steps are known as License Plate Detection (LPD), Character Segmentation (CS), and Optical Character Recognition (OCR) which can be found in most of the approaches [2], [3], [5], [7], [14], [15]. In the first step of the pipeline, i.e., LPD, the candidate regions containing vehicles' license plates are detected and cropped for further processing. This phase plays a crucial role in the final performance of the system because the misclassification of non-license plate objects results in errors for later stages. Thus, the ideal output of this step is a set of correctly detected and cropped license plate areas with no misclassified objects or loss of valuable data, e.g., characters. In the character segmentation stage, the system receives the extracted license plates from the LPD stage and produces a set of cropped areas containing the characters. Consequently, the main goal of CS is to detect characters inside the license plate area and segment them from other portions of the license plate, such as a logo or blank fields. Finally, the third phase, i.e., OCR, is responsible for classifying and recognizing each of the segmented characters from the previous stage. The more the system can recognize the characters, the better the total accuracy of the ALPR system will be.
There are numerous methods for each of the mentioned stages in an ALPR system. Some of the most used techniques for LPD are edge and contour detection [17]- [20], texture analysis [21], and color and geometric feature extractions [7], [22]. It has to be mentioned that the ALPR systems face performance deterioration in complex circumstances. Some of the main challenges include occlusion, various license plate layouts and languages, different sizes and width-to-height ratios of the license plates, noisy or dirty input images, and a variety of illumination and weather conditions [4]. Employing an appropriate method or algorithm for each of the mentioned steps to handle these challenges with the lowest possible side effects is a formidable task.
The layout of the typical vehicle license plates in Iran comprises of two sections. The leftmost section provides a unique string of digits and characters for the vehicle, which includes two digits in the left, an alphabetic character in the middle, and three digits in the right. The range of the digits in this section is 0 to 9, and there are 26 classes of characters for the middle part, where eight of these characters are incredibly scarce (such as diplomatic, protocol, and service vehicles). In the rightmost section, there is a two digit code with decreased font size referring to the city where the vehicle owner lives. Here, the digit ''0'' can only appear in the right digit of the license plates' rightmost section. Furthermore, considering the type and usage of the vehicle, the background color of the license plates may vary. Fig. 1 shows some of the typical Iranian vehicle license plates. As can be seen, the license plates of the free trade zone vehicles have a different width-to-height ratio and layout, containing only numbers along with their transliterated English values. It should be noted the layout designed for motorcycles is completely different, and recognizing them is not the purpose of this manuscript. These variations make the LPD and CR processes in Iranian vehicle license plates a formidable challenge.
The main goal of this paper is to introduce a robust method for ALPR, which presents very high accuracy and real-time performance in realistic data. The proposed method utilizes Deep Neural Networks (DNNs) and considers both the LPD and CR phases as classification practices. In this regard, in the LPD stage, if one or multiple objects containing the visual features of Iranian vehicle license plates in various distances are found in a given image, they will be cropped and separated from the input data. Similarly, the recognition of characters and digits inside cropped license plate areas can be accomplished through a classification procedure. The proposed method uses You Only Look Once (YOLO) [23] approach for each of the mentioned stages. The main contributions of the proposed method are: • Implementing a real- The rest of the paper is organized as follows: Section II reviews some background information and introduces several related approaches in summary. In Section III, the main architecture of the proposed method is presented. Section IV provides the experimental results and discusses the major challenges. Finally, the paper is concluded in Section V.

II. BACKGROUND AND RELATED WORK A. DEEP LEARNING APPROACHES IN MACHINE VISION
Deep architectures refer to the structure of DNNs, which contain hundreds and even thousands of hidden layers between the input and output layers. The hidden layers progressively extract valuable information from their previous layers by applying high-level filters [24]. In the fields of machine vision and object detection, Convolutional Neural Networks (CNNs) are the most popular branch of DNNs that attracted huge interest among researchers in recent years [25]. These networks have proven to provide superb accuracy comparing to other supervised learning models like Support Vector Machines (SVMs) in solving classic machine vision issues. By feeding the CNNs, the existing relations between the discriminative features are fetched, and the weights of the neurons are updated via the backpropagation algorithm [25]. The various types of layers in CNNs include convolution layers for high-level feature extraction (i.e., kernels), fully-connected layers for non-linear feature learning, and pooling layers for lessening of the input tensor's parameters.
Since the number of candidate regions as the output of CNNs is large [25], other extensions of these networks have been introduced. Region-based Convolutional Neural Networks (R-CNNs) [26] is a state-ofthe-art approach that reduces the number of candidates through merging neighboring regions employing Selective Search greedy algorithm. Similarly, the R-CNN approach has other optimized extensions, including Fast R-CNNs [27] and Faster R-CNNs [28]. In Fast R-CNN, the training process has been improved by indirect feeding of data along with employing a SoftMax probability layer and a Region of Interest (RoI) structure in the pooling layers. The process has even been improved in Faster R-CNN through utilizing Region Proposal Network (RPN) instead of the Selective Search algorithm in R-CNN. Moreover, other deep learning-based approaches present different resolutions for classification and feature extraction goals. YOLO [23] is the most well-known approach which provides substantial performance and accuracy by equally segmenting the input image into equal-sized grids, each with a probability measure of containing a trained object. By looking at the whole image, this approach can bring about almost real-time performance, which is barely reachable through R-CNN approaches. Later versions including YOLO v.2 [29] and YOLO v.3 [30] provide even better results in terms of both accuracy and performance. The second version can classify more than 9000 objects due to an 11-layered architecture trained on ImageNet and a 19-layered network for classification through Common Objects in Context (COCO) dataset [31]. However, the newest version of YOLO, known as YOLO v.3, provides a trade-off between performance and accuracy. YOLO v.3 uses Darknet framework and is pre-trained by COCO dataset. Comparing to the previous version of YOLO, the third version consists of 53 layers as the trained network and another 53 layers for object detection. According to [30], the main contribution for YOLO v.3 was to increase the accuracy rather than performance by redesigning the underlying Darknet. YOLO v.3 utilizes 53 Darknet layers trained on ImageNet along with another 53 object detection layers, which makes this 106-layered network stronger than previous versions. In addition, since YOLO v.3 applies down-sampling of the dimensions of the image by three times (i.e., 32, 16 and 8, respectively), the detection process takes place on three different sizes. Accordingly, and based on our analysis, YOLO v.3 can be a great choice for ALPR systems, especially for the CR phase, where the objects inside the license plates are rather small.

B. RELATED WORKS
Before the popularization of utilizing DNNs in object detection, an enormous body of research was devoted to the traditional DIP methods. In this regard, since the deep learning approaches provide excellent accuracies compared to the traditional image processing methods, we have only focused on the first and third categories of ALPR systems. In [32], a new two-stage workflow is introduced, which eliminates the need for annotation and segmentation processes. This approach utilizes a hierarchy structurer, where in the first step, the candidates of license plates are detected using a weak sparse network, and the validation takes place using a strong CNN. This model has a strong statistical foundation and can handle realistic data challenges; however, the implementation cost is increased due to the structure of the merged stages. Authors in [33] proposed a Multi-Task Convolutional Neural Network (MTCNN) with the advantages of high accuracy and low computational cost. Their network consists of three layers known as P-Net for obtaining the license plate candidates, R-Net for training and fine-tuning of the networks, and O-Net for localization of the license plate corners. Finally, the candidate(s) with the highest score of being a license plate will be extracted using the Non-Maximum Suppression (NMS) algorithm. Zhang et al. presented a deep LPD method consisting of both CNNs and Recurrent Neural Networks (RNNs) [34]. Their method is also able to recognize the characteristics of the characters and predict their locations using a vertical anchor mechanism. Through this structure, they have obtained superb results in challenging scenarios like illumination variation and blurry negative effects. Several recent studies have utilized different versions of YOLO for LPD and CR stages, as presented in [44] and [45]. A study in [35] presented a novel system for multinational license plates recognition with a three-layered architecture, including LPD, Unified Character Recognition (UCR), and layout detection. In this approach, the authors have utilized a simplified YOLO v.3 for LPD and another YOLO v.3 with Spatial Pyramid Pooling (SPP) block for CR stages. Another YOLO object detector for ALPR is presented in [36], which works as an online framework to process video frames for detection and tracking targets. Authors in [37] proposed a method for Chinese vehicle license plate recognition, which employs YOLO v.2 detector for LPD and Convolutional Recurrent Neural Network (CRNN) for CR stages. Their recognition architecture consists of a CNN for context feature extraction, and a two-layered Gated Recurrent Unit (GRU) for feature sequences decode. Hendry et al. [38] proposed a combination of YOLO and sliding-window processes. In their method, each character of Taiwan's vehicle license plates is detected by the sliding-window, the YOLO framework then identifies the window.
It has to be mentioned that although some of the presented ALPR approaches may have achieved great performances, they have not experimented under realistic circumstances, where the complexity is inevitable. Most of the employed datasets for training and testing phases include thousands of vehicle images captured in standard and controlled conditions with low acquisition noises and appropriate brightness and contrast. It should be emphasized that, to the best of our analysis, no one has presented a deep learning approach for Iranian LPR, which provides both superb accuracy and performance. The main reason behind this claim is due to the variation of Iranian vehicle license plates and the lack of realistic training data with annotations and determined classes. The mentioned reasons motivated us to propose a novel approach for Iranian vehicle LPR with substantial accuracy and real-time performance, along with the ability to become a commercial product for real condition surveillance. In contrast with most of the cases where there are some inter-dependencies among each stage of the ALPR process, the proposed approach provides an end-to-end recognition.
The authors of this paper have presented several published works in the field of deep learning-based ITS applications. In [16], we have proposed a license plate localization method based on YOLO v.3, which is similar to the approach presented in this paper. Experimental results figured the robustness of our approach with precision and recall values of 0.979 and 0.972. As another approach, we proposed a deep model for vehicle detection based on Faster R-CNN, which can detect the vehicles by their visual features [39]. For this research, we employed the Cars Dataset provided by Stanford University for training and testing purposes. Additionally, in [40], we presented a review of different approaches for ALPR using deep neural networks. In the mentioned paper, we discussed the variety of methods and techniques with their corresponding pros and cons in providing a welldesigned ALPR system with high accuracy and acceptable performance.

III. PROPOSED METHOD
In this section, the flowchart to extract the character sequence existing in vehicles' license plates is introduced. The proposed system is aimed to be used as an automatic commercial software for driveway security gates; thus, we have focused on training the system with such circumstances. The system consists of two deep object detection layers, both implemented by YOLO architectures. Since we need to provide a real-time platform with high accuracy in LPD and CR, the third version of YOLO has been selected for both stages.
According to the descriptions presented in Section II, YOLO v.3 is beneficial to the proposed method, where the license plate localization process should be done in real-time, and most of the characters inside the license plates are small in size. Fig. 2 illustrates an overview of the proposed method in brief. According to this figure, the input is an image, a series of images, or video frames, and the output is a sequence of characters referring to the context inside the vehicles' license plates. As a vehicle image is fed to the system, some image enhancement methods are utilized to improve the given image quality and produce a new image which is better in some sense compared to the original one. After that, the LPD stage triggers and seeks for the object(s) with the features of the vehicle license plates. In this phase, the system utilizes a pre-trained YOLO v.3 deep architecture to detect all existing license plates. For CR, a second YOLO v.3 deep network is triggered to recognize the characters and digits inside the cropped license plates. As a result of this, the character segmentation and OCR processes are combined to build a classification process for CR, which treats each character or digit as a unique object. If no license plates have been detected in the LPD stage, the next modules, including license plate cropping and character recognition are ignored, and the system waits for another input image.
Hereby and to explain more about the diagram, in the first step the input image is processed by means of some image enhancement techniques. In this stage, a Histogram Equalization (HE) method is adopted to standardize the intensities of the input image by spreading out the most recurrent ones, which results in an improved contrast. Fig. 3 demonstrates the impact of applying the histogram equalization technique on some of the challenging images. Accordingly, images with dark and indistinguishable license plates have been improved.
After improving the image quality, the second stage, i.e., deep learning-based license plate detection, will be triggered. Fig. 4 shows the structure of the mentioned YOLO v.3 networks of the proposed approach. As it can be seen, the input images of each deep network are resized into a 1:1 square image, because the input resolutions in YOLO v.3 can be 320 × 320, 416 × 416, and other multiples of 32. YOLO keeps the aspect ratio of the images. If the input image is not a square, it automatically uses black bars to fit an image with different aspect ratios to a square one. In the proposed method, we have used the resolution of 416×416 for LPD and 224 × 224 for CR networks. These values have been obtained empirically to acquire the best performance for the system. The deep architecture of YOLO performs a grid segmentation on the image, where the grid sizes are equally M × M. The prediction process needs two parameters, including several bounding boxes N and a confidence score S for each bounding box, known as the ''objectness'' or how likely the box contains an object. To measure the S parameter, we should calculate the Intersection over Union (IoU) evaluation metric to compare the location and dimensions of ground truth and predicted bounding boxes. Equation (1) shows how IoU is where A and B are two bounding boxes. Higher values for IoU indicates that the two bounding boxes have a similar dimension and location, to a large extent.

A. LICENCE PLATE DETECTION NETWORK
In the first stage, i.e., the vehicle license plate detection, the application searches for the existing license plates with their specified visual features. In this regard, the input of the LPD network is a set of color or grayscale images containing vehicle images. For better detection of license plates and provision of a better IoU, we changed the last convolution layer of YOLO v.3 based on the one possible class of ''Iranian Vehicle License Plate''. Other parameters, including the confidence threshold and anchors, remained unchanged based on the default configurations to access a beneficial performance. As Fig. 1 illustrates, there are some specific features inside the Iranian license plates which are extracted by the deep architecture. These features include the heightto-width ratio (4.77:1 for typical license plates and 2:1 for free trade zone vehicles), the logo section with a blue background, the edges of the characters inside the license plate area, and other extractable features by high-pass filters. The system automatically ignores the license plates with small and unusual resolutions by a pre-defined threshold. In this stage, there is only one class, i.e., Iranian vehicle license plate, and all the existing objects with the features that represent this class will be detected through a classification solution.
After detecting the license plate candidates, they are cropped and stored as separate objects for further processing. Now, by configuring and fine-tuning of the LPD deep module according to the presented description, the region of interest for CR is limited to one or several objects. This flow of the architecture results in a huge reduction in the misclassification of characters in the next phase.

B. CHARACTER RECOGNITION NETWORK
The input of the second network is a set of Iranian vehicle license plate images resized to 224 × 224 for the recognition process. As for the LPD network, the system looks at the CR phase as a separate classification problem. Accordingly, the segmentation and character recognition stages are combined in this architecture. The second network utilizes YOLO v.3 as well due to the requirement for classification of small objects, i.e., digits and characters. In this stage, the network should classify 28 different classes, where ten of them are digits, and the remaining eighteen classes are characters. As it has been described in Section I, the font size for the two rightmost items is smaller than the other five digits in the leftmost VOLUME 8, 2020 section of the Iranian vehicle license plates. Consequently, we have defined a threshold based on the two rightmost digits for the class resolutions. After recognizing all the candidates of Persian characters/digits in a given license plate, they will be sorted in descending order based on their confidence scores. In this regard, candidates with higher confidence scores have more chances to be selected as the recognized class. There is also a threshold θ CS that leads to ignorance of the candidates with confidence scores less than a pre-defined value. We have chosen 0.6 for θ CS , which means that only the candidate classes with confidence scores more than 0.6 are acceptable, and the remaining classes should be ignored in an ordered queue, named as the rejection list. This process will result in multiple predicted classes with a high probability of correct classification. If the number of final candidates with acceptable confidence scores was more than eight, i.e., the number of characters in a typical Iranian car license plate, only the first eight are chosen and concatenate together to produce the final character sequence.

C. DATASETS
It has to be noted that the training data are color samples with various resolutions. However, we discuss the performance of the system for grayscale images in Section IV. The main reason for collecting color images is that to the best of our analysis, most of the commercial ALPR systems use color video cameras. For night-time license plate recognition, they need illumination sources to provide lighting for the surveillance zone. However, infrared cameras produce grayscale images that can provide a better representation of the photos in low illumination conditions, like night and late afternoons.
Additionally, the training data are collected from realistic camera outputs existing at tollbooths and organization entrance gates. These wide-angle security cameras are typically mounted 1.5 to 2.5 meters above the ground and cover a range of 3 to 10 meters. Comparing to the high-quality cameras installed in highways or city junctions where the coverage area can be even more than 15 meters, the cameras used in entrance gates aim to capture images of the vehicle with a quality suitable for character recognition. We divide this section into two different parts to introduce the datasets used for training and testing the system.

1) LICENSE PLATE DETECTION DATASETS
To train and evaluate the proposed method based on realistic Iranian license plate images, we have utilized three different datasets to cover various illumination, shadow, noise, and weather conditions. Table 1 illustrates the main characteristics of the mentioned datasets in detail. Accordingly, the first dataset is gathered from the Fashtam semi-industrial complex with the contribution of Guilan Science and Technology Park (GSTP). Most of the images of this dataset are captured in normal light circumstances with different weather conditions. The second dataset is gathered from Roudbar toll road with the supervision of the Iranian Traffic Police, which contains the images of the vehicles entering the toll collection zone. The image acquisition system in this toll road is a camera mounted on a two-meters pole to cover a ten-meters distance from the tollbooth. Since these two datasets include all the possible characteristics and challenges of the real condition vehicle surveillance systems, we have utilized them for both test and train stages. Finally, to evaluate the performance of the proposed method on a wide range of real condition data, we have collected several images of the vehicles with Iranian vehicle license plates. Some of these images are captured using mobile phone cameras, while others have been gathered from realistic surveillance systems. In contrast to the two previous datasets which are utilized for both training and testing purposes, the images of the third dataset are not used for training the model. It should be noted that the third dataset includes 1249 grayscale images to experiment with the performance of the system in the output of the infrared cameras.
Accordingly, these instances are collected explicitly for training and evaluating of Iranian ALPR systems. The dataset consists of more than 15000 vehicle images, captured in realistic environmental conditions. In contrast with some of the large-scale Iranian vehicle license plate datasets [41]- [43], our datasets cover all types of Iranian vehicle license plates, even free trade zone license plate samples. On the other hand, these images are the output of real functional systems, which makes them more challenging compared to the standard conditions. Fig. 5 shows some instances of the mentioned datasets as well. As can be seen, different shooting angles and vehicle orientations, weather and illumination variations, harsh light and shadow conditions, various types of license plates, and both color and grayscale samples are included.
Since there were highly limited instances for some of the particular Iranian vehicle license plates, we have ignored them for training the character recognition stage. Hereby, eight out of 26 license plate variations belonging to armed forces (three variants), political (two variants), protocol, agricultural, and historic vehicles were not taken into account for the CR training purpose. The remaining instances, including private cars (13 different characters), taxis, public vehicles, governmental, police, and private cars of people with disabilities exist in these datasets.

2) CHARACTER RECOGNITION DATASET
For the character recognition stage, we have utilized a dataset containing all the digits and typical characters existing on Iranian vehicle license plates. As a result, we have used all the cropped license plates produced by the LPD deep network. The characteristics of the CR dataset are presented in Table 2. As can be seen, the dataset contains ten digits, 17 characters, and one symbol for private cars of the disabled people. We have considered an equivalent label for each of the mentioned classes.
According to Table 2, there were only a limited number of samples among some of the classified characters. These classes include ''0'' (only possible in the rightmost character of the license plate), ''Alef'' (governmental vehicles), ''M'', ''N'', ''Ayn'' (public vehicles), ''Taxi'', and ''Janbaz'' (private cars of people with disabilities). To provide more samples of these classes, we used data augmentation based on reference [46] mentioned in Table 2. Accordingly, each of the scarce classes of characters was fed to the designed augmentation module to increase the size of the dataset artificially. In this section, we have applied Gaussian Blur, Contrast Normalization, Additive Gaussian Noise, Multiply, Rotation, Translation, Scaling, and Shear augmentation techniques to enlarge the scarce samples. Fig. 6 shows some of the augmented license plates. As can be seen, the output license plates cover a more extensive range of different poses and states of the mentioned character, which may occur in practice.

D. TRAINING THE DEEP NETWORKS
To train the LPD deep network, we have used 85 percent of the images existing in Fashtam complex and Roudbar toll road (4437 and 5416 instances, respectively). The training samples were selected randomly from each dataset, and the remaining 15 percent did not feed to the network. When we gathered the data, a labeling process to determine the classes, i.e., Iranian vehicle license plates, was needed. We have employed YOLO BBox Annotation Tool [47] for the  annotation process. Firstly, the annotation/labeling stage was performed manually for a set of 1200 random instances of the training dataset. The LPD network was trained based on this small portion of data. After generating a primitive LPD model, the remaining data were fed to the network to produce annotations on the images. With this, the labeling process was performed automatically for the remaining images, and a validation check was applied to them. Finally, all of the training samples, including 9853 images, were fed to the LPD network to provide the final model. To train the network, we have employed a personal computer with specifications presented in Table 3. The whole system was implemented in Python 3.7, and we have utilized CUDA, TensorFlow-GPU, and CUDNN for implementations.
YOLO v.3 uses three YOLO layers, each with 255 outputs as default, makes it able to cover 80 classes of COCO dataset. The relation between the number of outputs and anchors of the deep network is shown in Equation (2): where BC refers to box coordinates, OC refers to object confidence, and CC shows class predictions. In this regard, the default configurations of YOLO v.3 are BC = 4, OC = 1, and CC = 80, whereby multiplying the sum of these values in the number of anchors, i.e., 3, the output = 255 is obtained. For tuning the deep networks, since the number of classes in the proposed method is different, we have configured these values based on our class count to filter the output results. For LPD, there is only one class, so that the output becomes (4 + 1 + 1) * 3 that is equal to 18. Similarly, in CR, there are 28 classes, which results in (4 + 1 + 28) * 3 or 99 outputs. The learning rate for the LPD phase was 1 × e-4, and the number of epochs for this task was 100. However, the training step stopped after 34 epochs, where there was no considerable improvement in loss function (loss did not improve from 1.65037).
The training process was similar to the LPD network training and was performed on the same hardware. The number of classes in this phase was 28. Thus, at first, we selected at least 100 samples of the license plates containing each of the classes for manual annotation. Similar to the previous stage, we recursively perform the automatic annotation phase along with manual validation. Finally, the dataset presented in Table 2 was produced and has been fed for training to the second deep network. In this stage, the training process for CR was done in 45 epochs, and loss did not improve from 2.46 after that. Fig. 7 shows some samples of manual/automatic labeling outputs.

IV. PERFORMANCE EVALUATION
To validate the effectiveness of the proposed method in both Iranian vehicle license plate detection and Persian character recognition, we have employed the datasets mentioned in  the previous section. For LPD, as it has been noted before, we have used 85 percent of Fashtam complex and Roudbar toll road datasets samples for training, and the remaining 15 percent of the sample was utilized for evaluation. This portion includes 1737 images where 782 instances are from the Fashtam complex, and 955 of them belong to the Roudbar toll road dataset. The third dataset was gathered by the authors of this paper for evaluation purposes only and did not participate in the training phase. On the other hand, and for CR, the same segmentation of train and test data was considered, and 15 percent of the data introduced in Table 2 were used for evaluation.
In order to evaluate the performance of the system, we have used Precision and Recall measures. Precision can be calculated by dividing the number of correctly classified samples (license plates in LPD and characters/digits in UCR) on the total number of classified instances. Similarly, Recall can be calculated by dividing the number of correctly classified samples by the total number of ground truth data. Precision and recall are mathematically defined as follows: where TP, FP, and FN represent true positive, false positive, and false negative, respectively.

A. LICENSE PLATE DETECTION
The performance of the LPD network was evaluated on the proposed datasets. In this experiment, we checked the  detected bounding boxes referring to the license plates against the coordinates available on the ground-truth labels. Table 4 shows the performance of the system for Iranian vehicle license plate detection on different datasets. As an evaluation criterion, the Intersection over Union introduced in Equation (1) is used to determine the area of overlap between the ground-truth labels and generated bounding boxes were calculated. According to Table 4, the precision value is much more accurate compared to the realistic LPD systems introduced in the previous section.
Additionally, Fig. 8 shows some sample outputs of the LPD stage. The second row of the figure indicates some gathered highway cameras grayscale images, in which the system was able to detect them as well, while there were no grayscale images in the training set. It shows that the proposed LPD module can also work on the outputs of infrared cameras. Table 5 indicates the performance of the CR deep network in brief. The experiment was performed on 15 percent of the remaining data from the Persian characters dataset introduced in Table 2. According to this table, the precision value for both Persian digits and characters are nearly equal, while the recall factor for digits is better than for characters. The results indicate that the CR deep classifier can correctly recognize the characters in most of the cases.

B. PERSIAN CHARACTER RECOGNITION
Besides, Fig. 9 shows some instance outputs of the character recognition stage in brief. As can be seen, the system can find the characters and numbers in variations of Iranian vehicle license plates effectively. The angle shot and resolution of the license plate images did not affect the accuracy to a large extent. It is worth noting that for the license plates of the free trade zone vehicles, only the Persian digits have been classified, and their translated English sequence is ignored due to lack of data for training.

C. END-TO-END RECOGNITION ACCURACY
This experiment indicates the performance of the proposed system for commercial usage. The system should be able to correctly detect and crop the existing license plates and extract the correct text from there. In this regard, if the LPD stage fails to detect all the existing license plates or the CR deep network fails to correctly classify all the characters and digits compared to the ground-truth classes, the end-to-end accuracy degrades. For evaluation, the LPD stage test data presented in Table 4 was fed to the system, and the number of correctly extracted character sequences calculated. Table 6 illustrates the end-to-end experiment results in detail. The test data contain vehicle images captured in different illumination and brightness conditions, various resolutions, and different shooting angles. Since there are different sizes of the image in this dataset, we claim that our method is not dependent on image resolution and can cope with low quality small images to a large extent.

D. EXECUTION TIME
To prove the real-time performance of the proposed method, we have calculated the time distribution for a random sample of 8000 images from the introduced datasets. Fig. 10 shows the evaluation results in detail. Accordingly, the license plates of most vehicles have been detected in 50 milliseconds, while characters and digits are recognized in 80 milliseconds in  most of the cases. We obtained these execution times automatically where the outputs of each deep networks were produced. In this regard, the LPD network requires an average of 54.18 milliseconds to detect license plates inside an image. The number of available license plates, the resolution of the input image, and the challenging factors like low illumination have significant impacts on this stage. Furthermore, an average time of 73.78 milliseconds is needed for the CR network to classify all the characters inside a given license plate image. The end-to-end average time to extract a character sequence out of a given vehicle image is 119.73 milliseconds, which proves the real-time performance of the proposed method.

E. COMPARISON OF OUR FINE-TUNED DNN WITH OTHER DNNS
In this experiment, we have analyzed the speed and accuracy of the proposed two-stage method by employing different deep learning-based object detectors. Hereby, we have utilized some well-known deep networks, including Fast R-CNN, Faster R-CNN, YOLO, and YOLO v.2 instead of each of the proposed fine-tuned YOLO v.3 architectures for both LPD and CR phases. The experiment was an endto-end recognition analysis, and the test data was the same as what was introduced in Section IV part C. Table 7 indicates the accuracy and speed of the proposed method compared to other deep learning architectures. Accordingly,  the proposed method provides the best end-to-end recognition accuracy due to its 106-layered structure and processes the input image in real-time, with an average processing time of shorter than 120 milliseconds for end-to-end recognition. However, by employing YOLO v.2, the execution time dwindles to some extent, but the accuracy decreases as well. Due to the trade-off between real-time performance and excellent accuracy, choosing YOLO v.3 seems reasonable as it has been employed in the proposed method.

F. DISCUSSION
Although the proposed method provided significant accuracy and performance for the task of Iranian vehicle license plate detection, some challenging scenarios lead to a failure in localizing the license plates or misclassification of the characters. The system failures are due to the low variety of data for training in some cases, problems in tuning the network, and the impacts of brightness and illumination. In this regard, Fig. 11 shows some of the images in which the LPD deep network failed to localize the license plates. As can be seen in sample A, if the image acquisition stage provides a partially visible license plate, the LPD network fails to detect it. Additionally, intense illumination along with the reflection of sunlight from the license plate negatively affects the performance in both color (samples D and F) and grayscale (sample B) images and results in failure or misclassification of the license plates.
In night-time situations, low illumination (sample E) may reduce the performance of this stage as well, and the images may even have some blurred noises (sample C) due to the vehicle's headlights beams. Furthermore, the proposed method failed to localize the license plate in sample G because the border of the license plate is occluded with snow. Finally, sample H shows a fake license plate that has dissimilar descriptions compared to a real one.
On the other hand, and in the Persian character recognition stage, most of the misclassifications were due to the low resolution of the input cropped license place, which leads to a failure in recognizing the character. As it has been mentioned in Section I, the accuracy of the LPD stage directly impacts the CR model's precision. Thus, if the cropped license plates belong to the vehicles far from the camera, even if the LPD model detects the license plate, the inner characters may not be legible. Fig. 12-A shows an instance of a small license plate with unreadable inner text, which leads to CR output failure. The second important factor for correct classification in this stage is the similarity among some of the individual digits and characters of the Persian language. In our experiments, the digits ''2'' (labeled as ''2'') and ''3'' (labeled as ''3'') are incorrectly classified instead of each other in some cases, especially when there is a blurring effect in the license plate. Similarly, the characters '' '' (labeled as ''Sin'') and '' '' (labeled as ''Sad'') face the same challenge. Another problematic case is ''0'' (labeled as ''0'') which is very tiny. In this regard, this digit could not be recognized in some cases, or it could be mistaken with the screws used to install the license plate on the bumper. These challenging scenarios have been illustrated in Fig. 12-B.
We believe most of the mentioned challenges can be handled by data augmentation and dataset clean-up. By widening the range of correctly labeled samples for training through reviewing the labels and providing even more variations of vehicle images and cropped license plates, an increase in accuracy for both LPD and CR models is expected. Additionally, by re-tuning the deep models through adjusting or resetting the existing hyperparameters and applying some other techniques like learning rate decay, we can prevent overfitting problems as well.

V. CONCLUSION
This paper proposed a unified License Plate Detection (LPD) and Character Recognition (CR) system for Iranian vehicle license plates that has the advantages of real-time performance and high accuracy. Our method utilizes two sequential YOLO v.3 deep networks for this purpose. Hereby, we have trained the system with realistic condition data acquired in various weather, noise, and illumination conditions. These data were collected from functioning driveway security cameras installed for surveillance goals. Accordingly, a vast range of images captured in both challenging and straightforward realistic conditions has been covered in the training process.
The proposed approach has been tested on separate realistic data, resulted in an accuracy of for the LPD stage, while the precision and recall measures for the Persian character recognition stage were 0.979 and 0.991, respectively. Furthermore, the system works in real-time with an average processing time of 119.73 milliseconds for each image or video frame to extract the license plate character sequence.