License Plate Detection Using Convolutional Neural Network–Back to the Basic With Design of Experiments

Automatic License Plate Recognition (ALPR) is one of the applications that hugely benefited from Convolutional Neural Network (CNN) processing which has become the mainstream processing method for complex data. Many ALPR research proposed new CNN model designs and post-processing methods with various levels of performances in ALPR. However, good performing models such as YOLOv3 and SSD in more general object detection and recognition tasks could be effectively transferred to the license plate detection application with a small effort in model tuning. This paper focuses on the design of experiment (DOE) of training parameters in transferring YOLOv3 model design and optimising the training specifically for license plate detection tasks. The parameters are categorised to reduce the DOE run requirements while gaining insights on the YOLOv3 parameter interactions other than seeking optimised train settings. The result shows that the DOE effectively improve the YOLOv3 model to fit the vehicle license plate detection task.


I. INTRODUCTION
Automatic License Plate Recognition (ALPR) has been an active field of research in computer vision applications. With the emerging Machine Learning (ML) method, specifically Convolutional Neural Network (CNN), ALPR has become much more robust and reliable than traditional hard-coded image processing techniques. Recent innovations and research focus on real-time CNN inferencing benefited ALPR applications, such as YOLOv3 [1] and SSD [2] methods. Meanwhile, much ALPR research focuses on custom CNN models or post-processing methods to tackle different ALPR problems by the format or geo-specific conditions of license plate (LP). ALPR can be classified into LP detection and character recognition, each with application implementation challenges. In the LP recognition task, the characters on the LP could have different languages, such as Arabic and Chinese characters, or other formats such as Italic The associate editor coordinating the review of this manuscript and approving it for publication was Wei Wang . or non-standard fonts. Meanwhile, for LP detection (also known as localisation), vehicle LPs could have different sizes, shapes, orientations, conditions, and colours. The process is similar to You-Only-Look-Once (YOLO) or Single-Shot detector (SSD) algorithm that improves CNN localisation performance and could be effectively transferred to the ALPR task with some efforts on data engineering. A complete ALPR system relies on both LP detection and character recognition, with some works only focusing on the LP detection stage. Newer research attempts to eliminate the traditional cascaded processing, i.e. LP detection then character recognition. A CNN model for achieving one-pass end-toend LP detection and character recognition is favourable for real-time processing. There are several performance metrics in the ALPR task. Average precision (AP) is the primary interest for bounding box regression in LP detection to localise the LP. Character recognition performance is based on the usual accuracy or recall metrics by comparing the ground truth LP labels. The algorithm's execution time is another comparable performance metric for real-time processing, but VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ it is highly dependent on the computing hardware. Many ALPR techniques are not precisely apple-to-apple comparison due to geo-specific LP datasets, not to mention the variety of proposed CNN models and processing techniques further limit the comparable metrics. Despite the major innovations in the ML model specifically for the ALPR task, no one has attempted to investigate the effect of the ML model training parameters, which could play a significant role in the ALPR performance itself. This article intends to bring the ALPR research back to the basics with Design of Experiments (DOE) by understanding the training parameters' correlation and optimising the training in the DOE process. YOLOv3 algorithm would be utilised, and the train parameters would be studied for the LP detection task without modifying the backbone CNN model. It is shown that the DOE effectively tunes the YOLOv3 algorithm to fit the LP detection task across a wide variety of LP conditions.

II. RELATED WORKS A. TRANSITION OF ALPR TO DEEP LEARNING ALGORITHM
The early day ALPRs are mostly on hand-crafted algorithms. Image processing techniques such as edge detection [3] and coefficient correlation [4] were common, as well as ML algorithms such as k-nearest neighbour [5], sparse autoencoder method [6], support vector machine (SVM) and artificial neural network (ANN) [7]. The paradigm had shifted when a subset of ML algorithms, i.e. CNN deep learning, started performing and CNN computation became more viable. Unlike ANN, CNN can process multi-dimensional data such as images. Initial findings from [6] concluded that the accuracy would improve with more train data since CNN is a data-driven algorithm that differs from traditional handcrafted coding. Then, [8] attempted to recognise LPs and its characters with single CNN by retraining AlexNet, which is one of the most popular CNN algorithms in 2017. They trained the CNN with custom cropped images of car LPs and achieved 95.24% of accuracy. Another similar research also uses CNN-based character classification to replace traditional OCR, proving that CNN can classify characters from blurry LP images [9]. Vehicle ALPR research in [10] massively deployed deep learning algorithms with CNN and long-short term memory (LSTM). CNN was used to extract features, then LSTM were trained to process the features to recognise the characters. They can discriminate both private and public car plates of different colours and recognise the characters. Similarly, [11] uses simpler parallel CNNs to identify the nature of the car plate, such as types, dimensions and colour, then used LSTM to recognise the car plate characters, achieving 99.8% of precision. Reference [12] combined both edge detection and CNN as hybrid processing pipelines to enhance ALPR performance.
It is realised that although CNN is superior in performance, the cost of computing is prohibitive in a real-world scenario. Image sizes are limited to the CNNs input size. Its classification performance relies on cropped image batches. However, it does not translate into the ability to localise and identify car plates in a huge image area which is the more practical use case in ALPR application. Thus, [13] tried to speed up the recognition process with regional CNN (R-CNN) but achieved a precision of 0.4 out of 1 on a single huge image, primarily due to R-CNN's limitations. Newer research from [14] also showed that a much better variation of R-CNN called masked R-CNN is capable of a complete ALPR task at comparable 98% precision and recall.
Data processing pipeline unification has been done by [15] to fully utilise CNN to detect and recognise the LP characters, bypassing any unnecessary architectures for different tasks but showing that CNN favours detection tasks but not character recognition. Another research exploited big data (about 250k images), namely Chinese City Parking Dataset (CCPD) with one pass CNN much like SSD to recognise and localise the LP. It is proven to be effective and robust for various environments (blurry, angled, tilted LP), avoiding recurrent CNN computation like R-CNN, which is the reason for the high computing cost for CNN inferencing [16].
YOLO algorithm has been the interest for ALPR in recent years. YOLO algorithm was introduced by [17] in 2015 and achieved one-pass CNN object classification and localisation. Further revisions of YOLO improve the detection capability and speed. The first use of YOLO CNN was attempted by [18] to detect LPs of vastly different plate orientations, yielding 99.5% F1-score. YOLOv2 algorithm with modified ResNet50 CNN was proposed by [19] to localise and detect the nature of multi-national LP (country, size, and languages but did not work on recognising the characters on LPs), achieving 99.57% detection precision. Reference [20] also used YOLOv2 because they claimed YOLOv3 has more layers that slow down the training, which is not entirely true depending on which CNN model to utilise in the YOLO workflow. They only compared the metric of motorcyclist LP of riders with or without helmets, in which the comparison might not be significant. Nevertheless, they also achieved 95 to 97.5% precision score with YOLOv2 algorithm. Similarly, [19] extended the dataset for multi-national and multilanguage LP, reaching 99.57% of AP in LP detection. Reference [20] further enhance datasets by synthesising LPs to overcome small dataset size and train a custom CNN model ported to Fast-YOLO to perform ALPR. Reference [21] utilised YOLOv3 for both LP detection and recognition stages with 95-97% accuracy. Overall, the YOLO-based algorithm is very promising to be repurposed for LP detection. Another work from [22] is very similar to YOLO, but they used a branching method at the end of CNN to detect LPs.
Besides the YOLO algorithm, CNN could also do image segmentation with up-sampling layers, giving [23] an idea of using an entirely semantic segmentation method to detect and recognise the LP. Their work is specifically on Arabic LP, so it is hard to compare to the CCPD dataset. Reference [23] also uses the CNN segmentation method to extract features and perform complete ALPR with parallel CNNs. Some researchers customised CNN for a particular function instead of a full ALPR task. Reference [24] used CNN to predict the originated states of the vehicle by LPs. R-CNN was also used for customised ALPR, such as detecting LP on non-motor illegal vehicles [25].
The importance of clean data in the CNN application could not be ignored when [11] combined traditional image processing techniques to filter out unnecessary noises and used CNN at the final stage of car plate recognition, achieving 99.6% accuracy. With that acknowledgement, [26] identified that rain streaks might be one of the big problems of ALPR in a real environment. Thus, they first pre-processed images of noisy rain streak with dictionary learning, then only processed the vehicle LP with CNN.

B. THE NATURE OF DATA-DRIVEN ALPR
Deep learning is one of the data-driven programming approaches. Instead of a hard-coded feature extraction algorithm, feeding as much data as possible will ''code'' the necessary feature map. Thus, the reliance on big data is one of the key factors for ML implementation. CNN model architecture plays a vital role in ML, but it heavily depends on the methods or preference of the data processing pipeline.
The ALPR approach could be classified into one-staged and two-staged processes. The two-staged process is more straightforward because the data classes are separated, i.e. LP itself and its characters in optimising the coding for each data processing pipeline. This approach is especially true from traditional image processing, where images are considered complex data. A two-staged process usually detects and crops the region of interest (ROI) around LP to eliminate any unwanted background details, then only attempt to recognise the characters on the cropped LP images with optical character recognition (OCR). OCR could be based on hard-coded algorithms such as connected component analysis, local binary pattern, temporal matching [27], or CNNbased classification. However, the image processing pipeline of ALPR had been shifting to a one-staged process with the emergence of ML, extracting both LP and its character in one pass. One pass processing is possible with some innovations in CNN model designs and post-processing techniques. In the one-staged process, the CNN model mostly only acts as a feature extractor to preserve the spatial information of the object of interest, i.e. LP and its characters in ALPR. The classification and bounding box localisation are passed to other post-processing techniques such as non-max suppression (NMS) and intersection over union (IoU) to compute the confidence level and the ROI of the object class within an image.

C. THE CHALLENGES OF MALAYSIAN LP
Several ALPR works on Malaysian's LP exist, but none are up to the global trend of ML-based ALPR. There are some unique challenges to implementing ALPR on Malaysian LP. First is the availability of the dataset because there are no known open-source LP images for Malaysian vehicles. The LP images are confidential or owned by specific authorities, which could not be accessed easily and openly. Secondly is the inconsistency of the LP format. Many on-road Malaysian LP characters could have different fonts, spacing and placements, even with non-standard stickers or labels, violating the official LP guideline. There also exist many valid LPs with unique characters such as ''XXIV'', ''SUKOM'', ''1M4U'' and Putrajaya''. Newer LPs also located characters after the numbering with increasing new on-road vehicles. Those varying LP standards render most overseas ALPR techniques inapplicable because foreign LPs have fixed character numbers and spacing, suitable for processing with the character segmentation method.

III. METHODOLOGY
Many ALPR algorithms had been proposed in previous research, but they hardly discussed the relationships and the reasoning of the related training parameters. In this work, multi-level (2-level and 3-level) factorial DOE would be utilised to study the YOLOv3 training parameters' interactions and optimise the LP detection performance for stage one of ALPR, i.e. LP detection only. Self-prepared Malaysia's vehicle LP dataset will be used since this research has to tackle ALPR problems on Malaysian vehicles specifically. Stage two of ALPR, i.e. character recognition, will not be part of the research for the time being because LP labels are geo-specific and highly dependent on dataset labelling and algorithms.

A. MALAYSIAN VEHICLE LP DATASET
The datasets are obtained by several methods. One is taking photos from the federal highway, which consists of multiple vehicles in a single 32MP image with a DSLR telephoto lens. Some photos are taken with hand-held cameras or smartphone cameras; thus, the photos have mixed sources of sensor noises and qualities. Images with clear local LP labels were also downloaded from social websites and some local car auction websites. Some of the downloaded images will have watermarks and were eliminated by manually cropping the ROI of the images. Overall, a total of 10k images were manually collected, processed and labelled. Only the LP spatial locations are labelled for (x,y,h,w), where 'x', 'y', 'w', and 'h' are the horizontal and vertical locations of bounding box centre, width, and height of the LP bounding box, respectively, as illustrated in Figure 1. The LP characters are simply recorded, and no character-wise bounding box labelling had been done. The (x,y,w,h) values are in a ratio relative to the image size so that the images can be resized without affecting the relative location of the bounding boxes. The relative point value could be converted to pixel value before feeding the label to the YOLOv3 algorithm.
The cropped images of Malaysian LP are of approximate square shape due to manual cropping, and the height and width of the cropped images are slightly inconsistent. Meanwhile, the YOLOv3 pre-processing is designed to feed images of video format aspect ratio where the image width  is bigger than its height. The Malaysia LP images have to be resized to fit into the existing process pipeline, but that would result in feature loss. Thus, the square images were resized minimally so that the image height is slightly smaller than its width to minimise feature loss but still able to be fed to the YOLOv3 algorithm.

B. THE YOLOv3 PARAMETERS
YOLOv3 is a state-of-art real-time object recognition algorithm that could accept various CNN model designs as the backbone feature extractor (with few CNN backend requirements). The principal technique of YOLOv3 is on the backend feature pyramid network (FPN) [28] and anchor box layer as a post-processing pipeline to retrieve spatial information and class confidence from the extracted features. The FPN extracts the spatial information of the convoluted feature maps at different scales for multiscale object detection, and the output consists of anchor box features. The output will be post-processed with IoU and NMS operations to resolve object bounding boxes. The backbone CNN model is ''SqueezeNet'', originally proposed by [29] and remains untouched to isolate the parameters specific to the CNN model. The overall structure of the ''SqueezeNet'' based YOLOv3 algorithm is illustrated in Figure 2. YOLOv3 does have a few training parameters with default values to be adjusted in the MATLAB native code, as shown in Table 1. Some parameter values are limited to the original example datasets and not strictly tied to the YOLOv3 algorithm.

C. MULTI-LEVEL FACTORIAL DOE
The purpose of the DOE is to study the correlation of the training parameters and their effects on the YOLOv3 LP One consideration of DOE on CNN is that a complete epoch of CNN training could take minutes or even hours depending on the CNN algorithms complexity and computing hardware. The number of runs increases exponentially with the number of levels and factors. There are nine possible parameters for the YOLOv3 training that contribute to nine factors, referring to (1), a 2-level full factorial DOE would require 512 runs, or 19683 runs in the case of 3-level full factorial, which is impractical. Thus, reducing either the number of levels or factors is necessary.
Several research papers performed DOE on tuning CNN model parameters, albeit not specifically on ALPR tasks. Reference [30] utilised a new class of 3-level definitive screening design (DSD) proposed by [31] to tackle many factors to identify significant main effects while estimating some of the interaction effects. Standard 2-level fractional factorial design is also used by [32] to reduce the number of runs to optimise the CNN model. However, it is unknown whether the residual data point is normally distributed given such a small 16 runs on seven factors for the analysis of variance (ANOVA) to be valid.
Fortunately, the training parameters could be partitioned into two categories, data-specific and training-specific, as shown in Table 2. Data-specific variables change with data and label size, whilst training-specific variables will manipulate the CNN training behaviours. Partitioning the parameters for DOE would reduce a considerable number of runs. Instead  of nine factors at once, it would be more practical to isolate a few factors once at a time.
The DOEs are rather an iterative process. The first experiment will be executed on data-specific parameters since it only has three variables. Then, the best performing dataspecific settings would be transferred for the subsequent experiments to eliminate the data-specific factors. Even so, there are six training-specific parameters, which might result in the long run. Thus, the next DOEs were designed iteratively, i.e. examining only three parameters at a time in the hope of discovering the interaction of the parameters and optimizing a few of them at a time. The factors and aliases for each DOE are listed in Table 3. Some parameters could be sensitive to value changes and cause overfitting, failing the CNN epoch training. The parameter operating ranges is further described for each DOE in Section IV.
The DSD proposed by [31] seems to fit the experiment requirements, but the interaction of parameters is also of interest in this research. Also, the ''numberofEpochs'' might not be the interest which will be explained in Section IV part A, resulting in a total of five factors only. Thus, DSD is not used. There is another D-optimal Designs [33] technique to reduce the DOE runs, but it is not in the scope of this article.  Table 4, whilst its ANOVA and interaction plot output are shown in Figure 3.

IV. EXPERIMENTS
The default number of epochs was 80, but it was reduced to 10 for all DOEs. The purpose of the DOE is to study the interaction of the parameters and their relative convergence capability. It also helps minimise the CNN training  time to have a faster design cycle for DOE since it takes several minutes to complete an epoch. Thus, a comprehensive CNN training epoch is not required. However, a complete 80 epochs of training would be carried out at the end of all DOEs to validate the DOE findings. ''imageAspectRatio'' is limited to near square ratio as described in Section III part A. More ''numberofAnchors'' could improve the mean intersection union of the localization, thus improving the AP for more object classes. Hence, its upper limit is set to double the lower limit. ''trainTestRatio'' is dependent on the size of the supplied dataset. A bigger dataset could allocate more data for training. A 70% training ratio is general for most ML approaches.
It is found that the image aspect ratio plays a significant role in performance outcomes. One apparent reason is that the images of 4:3 aspect ratio have fewer LP-based features after image resizing. Similarly, near-square images have relatively more pixel-level features than the one of 4:3 aspect ratio, better utilising the CNN feature map for performance convergence. A higher train-test ratio also contributes to higher performance as more images are available for training, which explains the slight interaction between the image aspect ratio and the train test ratio. Unlike the traditional CNN classification task, the YOLO algorithm does not require a validation set since the CNN only does feature extraction and does not directly classify the object. The number of anchors is statistically insignificant in the LP detection task because VOLUME 10, 2022 Malaysian LP only has a few possible shapes and sizes. LP is the only class of interest in the ALPR.

B. DOE II
DOE II is a 3-level full factorial experiment with three factors and one replicates, resulting in 27 runs. It is found that 3-level factorial could provide more insight into the interactions of the factors and normally distributed data points with just one replicate. An extra 11 runs compared to the previous DOE I is a good trade-off for having additional information for the 3-level factorial interaction plots while ensuring normal data distribution for ANOVA outputs to be valid. Since the previous experiment provided an insight into the parameter's interaction, it will not be redone as a 3-level factorial DOE. The settings of each factor, their aliases, and the default values of the other parameters for DOE II are listed in Table 5, whilst its ANOVA and interaction plot output is shown in Figure 4. The best parameter settings from DOE I were utilised in this DOE II, i.e. ''imageAspectRatio'', ''numberofAnchor'' and ''trainTestRatio'' are 1:0.98, 6 and 0.7, respectively.
The ''miniBatchSize'' is set as 16 and 32 since the computing memory is the only limiting factor. Although the original value for ''warmupPeriod'' is 41.6% of the total epochs, a lower value will have a faster ramp to the target learning rate, thus having better initial convergence but might risk CNN overfitting. The ''penaltyThreshold'' is a ceiling for applying penalty function to the CNN model. A higher threshold will improve the object detection confidence score but decrease the anchor box detection overlapping tolerance, lowering the AP.
One common assumption for CNN training is that image batch size could be larger with a higher learning rate given enough hardware memory space on a computer. The change of learning rate is indirectly adjusted by the warm-up period. A lower warm-up period will result in a faster learning rate increment. According to the factorial plots, the interaction of both ''miniBatchSize'' and ''warmupPeriod'' has no significance to the overall AP. Or rather, the batch size itself has a significant effect, whereby it is the number of images to be fed forward to the CNN model in every feedforward training. After each feedforward will have a loss gradient update, in which the loss function tries to converge the CNN weight. A smaller batch size could achieve higher AP because it could have better convergence to the local minima.
Conversely, a higher number of image batch could result in excessive feature generalisation, converging to global minima only. Large-batch methods tend to converge to sharp minimizers of the training and testing functions, and sharp minima lead to poorer generalisation [34]. The ''warmupPeriod'' have some significance according to the Pareto chart in Fig. 4, but the loss curve shows that it has the risk of overfitting for a lower value. It had been pushed to the value of 0.01 but rolled back to 0.15 because overfitting occurred, i.e. the loss curve increases although it is supposed to converge to zero. It is important to generalise the newly initialized CNN feature map with a slower learning rate at the initial stage before pushing for a higher learning rate for a faster CNN model convergence. The ''penaltyThreshold'' is shown to have the most significant influence on the AP score, although it has no interaction with other parameters.

C. DOE III
DOE III is also a 3-level full factorial experiment with three factors and one replicates, resulting in 27 runs. From the DOE I, it is found that increasing in train-test ratio could have statistical significance to the performance outcome. However, it is still possible to push the ratio to 80% instead of limited to 70%. Also, it is unknown whether the train-test ratio has any correlation with the L2Regularization and learning rate. Thus, the train-test ratio was included in this DOE to extend its behaviour study to 3-level factorial. The settings of each factor, their aliases, and the default values of the other parameters for DOE III are listed in Table 6, whilst its ANOVA and interaction plot outputs are shown in Figure 5.
The ''L2Regualrization'' is the magnitude for weight update gradient, modifying the weight update rate. Whilst the ''learnRate'' is the global multiplier for updating the CNN trainable parameters. CNN training is sensitive to ''L2Regualrization'' and ''learnRate'' values, so they are only adjusted in small margins.   The interaction plot B * C shows that they have strong interaction. The ''learnRate'' is favoured at the value of 0.001, and a lower value would result in a performance drop. Whilst ''L2Regularization'' is best at the value of 0.0005. Higher ''L2Regualrization'' than the value listed in Table 6 had caused random train failure because a higher value could have a bigger weight update gradient, indirectly leading to CNN overfitting. Also, higher ''learnRate'' with low ''L2Regularization'' adversely caused a significant AP reduction, which explains the strong interaction outcome. Higher ''trainTestRatio'' has been shown to contribute to higher AP even though it has almost no interaction with ''L2Regualrization'' and ''learnRate''.

D. FINAL TEST
The purpose of the final test is to validate the findings of the DOE. A full 80 epochs were executed on the YOLOv3  algorithm to validate whether the DOE effectively tunes the performance outcome. There are five tests. Setting E is of optimum settings from the DOEs, while Setting D is of original parameters before the DOEs. The rest of Setting A, B and C are non-optimum settings. Each test is repeated for 3-fold cross-validation so that the tests are less likely to be data-dependent. The average of three runs for each set was taken as the final score of each test. In the end, the results are listed in Table 7. Figure 6 shows the precision-recall curve of each setting for the first-fold run to compare the fitness of the YOLOv3 to the Malaysian LP dataset. Some output LP detection samples are shown in Figure 7.
The DOE optimised Setting E delivered the highest 99.00% AP score while the default Setting D has 98.53% AP, or 0.47% improvement only because the train test ratio and the warm-up time are the only difference. Other settings changes were relatively underperformed, ranging from 95.47% to 98.12% AP scores. The precision-recall curve also shows that the optimised setting has the best average curve fitness of all settings with minimum precision of 0.993. It shows that the DOEs tuned the performance of the YOLOv3 to detect the LP location. However, the setting could be specific to the Malaysian LP dataset. Other open datasets like CCPD and UFPR are yet to be tested.
SSD is an alternative algorithm for ALFR, but it has a poorer performance than YOLOv3 from initial training with the default setting, only achieving 87.75% AP in addition to ten times longer training epochs. Also, SSD utilised RestNet-50 CNN model, so it is not a fair comparison. SSD has some different classes of parameters and might require different DOE strategies.

V. CONCLUSION
A series of simple DOEs had been shown to improve the ALPR performance of YOLOv3 with the CNN model untouched, specifically on the LP detection task. An AP of 99% is achieved for Malaysian vehicle LP detection by strategically tuning the YOLOv3 training parameters. A minor change relative to the stock parameters did improve the performance. Adjusting other settings in the DOEs also provide insights into the interactions of the YOLOv3 parameters. Images with more pixel areas are generally better because more features are available for CNN feature extraction. It is also found that a smaller mini-batch size has a better fitting to local minima, improving the overall AP. The warm-up period is useful in generalising the initial feature map across all image batches before increasing the global learning rate, but a longer learning rate ramp-up time will decrease the overall AP.