License Plate Detection via Information Maximization

License plate (LP) detection in the wild remains challenging due to the diversity of environmental conditions. Nevertheless, prior solutions have focused on controlled environments, such as when LP images frequently emerge as from an approximately frontal viewpoint and without scene text which might be mistaken for an LP. However, even for state-of-the-art object detectors, their detection performance is not satisfactory for real-world environments, suffering from various types of degradation. To solve these problems, we propose a novel end-to-end framework for robust LP detection, designed for such challenging settings. Our contribution is threefold: (1) A novel information-theoretic learning that takes advantage of a shared encoder, an LP detector and a scene text detector (excluding LP) simultaneously; (2) Localization refinement for generalizing the bounding box regression network to complement ambiguous detection results; (3) a large-scale, comprehensive dataset, LPST-110K, representing real-world unconstrained scenes including scene text annotations. Computational tests show that the proposed model outperforms other state-of-the-art methods on a variety of challenging datasets.


Fig. 1. Detection in wild scenes and an illustration of license plate (LP) vs non-LP class.
A typical image in our LPST-110K, showing unconstrained settings. The first column (a, c) is detection results for the state-of-the-art RetinaNet [4]. The second column (b, d) shows the our results, indicating fewer detection errors and better regression. The last column (e) is an illustration of scene text relation.
traffic-related applications [1]- [6]. A variety of methods have demonstrated high accuracy in detecting license plates (LP) under controlled settings.
While existing detectors successfully applied to the LP detection problem, many key challenges still remain in unconstrained wild scenarios. For example, real-world LP detection causes the following problems: modifications of prior settings to adapt to wild, incorrect detection results, ambiguity in classifying objects associated with scene text, low-quality visual data, uneven lighting, motion blur, and others. However, such scenarios are becoming increasingly common and gaining significant popularity in a variety of applications, including civil security, crowd analytics, law enforcement, and street view images. Despite being the most common scenario, LP benchmarks still do not consider real-world cases, and therefore many problems are not adequately addressed. As a result, state-of-the-art detectors struggle with these images.
To clearly ascertain what makes LP detection difficult, some common cases in the wild must be considered where LP and scene text appear at the same time as multiple instances (see Figure 1). Based on this basic observation, we identify two major drawbacks in two aspects. First, LP and the scene text (not LP) are not correctly distinguished, which in return may cause false detection of each other. In fact, the LP is a child class that belongs to the scene text, so they must be distinguished and there must be enough variability to distinguish class categories. The existing LP benchmarks, however, did not include scene text in the sample, nor were they explicitly addressed in learning and evaluation. Secondly, This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the detected bounding box does not contain all the characters in the LP. Basically, LP detection is necessarily linked to continuing tasks related to recognition or de-identification; therefore, sophisticated localization is essential for identifying information. Yet, for detailed extra tasks, it is still challenging to localize enough information contained in LPs. Interestingly, as shown in Fig.1 (a, c), the state-of-the-art detector exhibits prominent negative results for scenarios in the wild.
A well-designed LP detection framework should tackle the problems above (see in Figure 1(b, d)). In this paper, we propose an end-to-end framework which is composed of a single shared feature encoder and two parallel detection branches. The single shared encoder learns a global feature across all detection tasks (LP and non-LP respectively). More specifically, due to non-LP objects (scene text but not LP), our framework is divided into 1) LP detection network and 2) non-LP detection network. Different from traditional LP detection models, we explicitly prevent learning of non-LP objects. To this end, we bring a novel information-theoretic loss to minimize mutual information between the embedding feature and non-LP distribution that interferes with LP detection. Prior to the unlearning of non-LP distribution, we hypothesize that the existence of non-LP is known and that the relevant metadata, such as additional labels corresponding to the semantics of the non-LP instances are accessible. In this scenario, the discrimination problem between LP and non-LP based on mutual information can be formulated in terms of an adversarial problem. One network has been trained to detect the non-LP instances. Instead, the other network has been trained to detect only LP instances, which is the ultimate goal of the overall architecture, while maximizing the discrimination between LP and non-LP based on mutual information. Therefore, we adopt an adversarial training strategy, which is achieved by minimizing mutual information while estimating optimal LP detection independence. Furthermore, we propose a localization refinement module with a sharing block. This module provides valuable information on the quality of bounding box regression for sophisticated localization.
To summarize, this paper makes the following novel contributions: • A novel information-theoretic loss for LP detection.
We propose a new framework that is discriminative to detect LP even in unconstrained scenes. We note that our approach to calculating mutual information could likely exclude non-LP, resulting in high accuracy (Sec. III.C). • Localization refinement module. We generalize the bounding box regression network to complement ambiguous detection results. As far as we know, there has been no other previous work to utilize regression networks for refinement of localization (Sec. III.D). • A novel LP detection dataset. We collect a new largescale dataset, LPST-110K, containing images captured from unconstrained scenes. To the best of our knowledge, LPST-110K is the first dataset to address LP and scene text simultaneously for LP detection. By evaluating stateof-the-art detection models on LPST-110K, we demonstrate the accuracy improvement of our proposed model compared with other approaches (Sec. IV).

II. RELATED WORKS
In this section, we review the deep learning algorithms in intelligent transportation systems (ITS) and the LP detection methods related to our methods. The deep learning in ITS, the license plate detection and license plate detection benchmarks are included in this section.

A. Deep Learning in ITS
In recent years, deep learning algorithms have achieved impressive results in computer vision [7]- [10]. In many modern transportation systems, deep learning has begun to play a critical role as a means to acquire more robust recognition or surveillance, by learning from existing task-specific benchmarks. It is performed to solve more complex traffic conditions by designing a non-linear model based on a data-driven paradigm with existing benchmarks. Many traditional problems such as road detection [11], [12], street scene labeling/ recognition [13], [14], crowd counting [15], [16], traffic flow estimation [17], [18], or license plate detection [19], [20] and recognition [21]- [23] can be investigated to utilize these techniques. Specifically, depending on the existing benchmarks and detection algorithms, robust license plate detection can help take to help guide a more comprehensive understanding and control of traffic conditions. While researchers have utilized limited benchmarks and universal detection algorithms, we have found that conventional algorithms are not always the solution in every situation. Developing a more robust solution is a non-trivial task, but is required to outperform current capabilities. We therefore investigate what efforts and trials have been made in prior works for license plate detection algorithms and benchmarks in the following subsections.

B. License Plate Detection
Early works have devoted much effort to improving LP detection performance based on the framework of image binarization model [24], [25], segmentation model [26], edgebased model [27], and region-based model [28]. In this way, several approaches have remarkably shown the use of different hierarchical schemes for detecting a vehicle region as part of extracting the LP region. Nevertheless, these methods cannot perform well on complex backgrounds and in unconstrained settings.
More recently, as Deep Convolutional Neural Networks (DCNN) [29], [30] have shown good classification performance, researchers have begun to deal with some complicated situations. Particularly, as deep feature-based object detectors [6], [31] have been developed, many studies have started to detect LP under difficult situations. Prior knowledgebased methods based on vehicle detection [19], [32]- [38] have greatly reduced false positives despite background clutter. Data-driven methods [35], [39]- [42] have been used to increase the detection accuracy by exploiting useful deep representations with the augmentation transforms. Specifically, [20], [35], [41] may be the most similar to ours, because they also focus on unconstrained environments. However, these studies still do not consider the existence of non-LP, thus have not reached a wide diffusion. Our work is distinguishable in that we try to address the non-LP instance in unconstrained

C. License Plate Detection Benchmarks
Many benchmarks for LP detection were designed for training and testing simultaneously and a few surveys are shown in Table I. Representative LP detection datasets include AOLP [43], SSIG [44], PKU [45], CD-HARD [35], UFPR [33] and CCPD [41]. Surprisingly, none of these provide scenetext annotations, even though they are the main cause of the erroneous detection.
As evident in Table I, our new LPST-110K dataset, described in Sec. IV, provides all text annotations that exist in the image that have not been attempted in any datasets. Moreover, our datasets, which focused on rough scenes in uncontrolled environments, were challenging and particularly related to motion blur, uneven lighting, large slope angle and low resolution. The exceptions are UFPR [43] and CCPD [41], which consist of many of the aforementioned non-constraining conditions. In particular, the CCPD [41] provides a huge number of samples that cannot be compared with other benchmarks. Despite this fact, these images provide only one to three samples per image, but LPST-110K provides as few as three to as many as 20 LP annotations per image. More importantly, classification of LPs and non-LP texts on the LPST-110K gets confused between each other, making them a challenge for detection. To our knowledge, LPST-110K is the first dataset to provide text annotations as well as enormous numbers of instances (LP and non-LP) in an image, even collected from unconstrained scenes.

III. PROPOSED METHODOLOGY FOR LICENSE PLATE DETECTION
In this section, we first introduce the problem settings, which will be discussed in Section III-A. We then present the license plate detection architecture used in our experiments in Section III-B. In addition, we formulate the loss functions for each part of the whole architecture in detail (Section III-C-III-D) and define the overall training procedure, described in Section III-E. Finally, we illustrate how to perform the inference for the proposed model in Section III-F.

A. Problem Settings
In order to make the descriptions clear, we introduce several notation prior to the introduction of the overall idea of the study. Unless noted otherwise, all notations refer to the following terms. All the symbols and notation used in this paper are summarized in Table II. As shown in Fig 1, our goal is to detect LP from each image example x ∈ X , where X denotes an input space for images. Then, the input image x contains an LP y(x) ∈ Y and a non-LP scene text n(x) ∈ N classification and 4-tuples bounding box coordinate labels. Let X and Y be two random variables. In this paper, we consider X and Y include the value of x and y(x) respectively. We also represent N and Y as an non-LP class that interferes with LP detection and a LP class respectively. In addition, we define a latent function n : X → N , where n(x) denotes the target non-LP instance of x.
As already mentioned, our proposed network takes the input image x and outputs both LP detection y(x) and non-LP detection n(x) results simultaneously. Thus the input image x is fed into the encoder (ResNet + FPN) for feature extraction f : X → R K , where K is the number of the features extracted by f , parametrized as θ f . Additionally, we replace the original RPN structure with two parallel RPN structures: RPN for LP g : R K → Y and RPN for non-LP h : R K → N . The parameters of each network are denoted as θ g ∈ [θ gloc , θ gcls ] and θ h ∈ [θ hloc , θ hcls ], assuming the regression and classification sub-network parameters, respectively.

B. Architecture Design
As discussed in Section 1, we propose to utilize information-theoretic learning to improve the performance of LP detection, which aims to construct rich feature representations for complex and challenging scenes. As shown in Fig 2, our overall architecture is divided into three parts: 1) a backbone network f , 2) a LP detection sub-network g, and 3) a non-LP detection sub-network h. Existing twostage detectors like Faster RCNN consist only of f and g, but our method additionally utilizes h to further maximize the discrimination between LP and non-LP in feature representation learning. Specifically, we include a localization refinement module (LRM) while learning g and h. It is worth mentioning that proposed architecture provides the complementary information to minimize mutual information between the embedding feature and non-LP distribution and boost LP-specific detection performance.
The input of our proposed architecture is the image x, the output is the LP and non-LP detection results for training and the only LP detection results for inference. A standard deep learning-based detection network is designed, motivated by [4], [31], [46]. First, the backbone network of ResNet-50 [47] is established by building FPN [46] with three upscaling-layers for feature extraction as an encoder f . Subsequently, our task-specific detection networks, well-known RPN [31], includes two parallel structures (i.e. one for LP g and the other for non-LP h), which provide two fully convolutional sub-networks. These sub-networks in RPN structures are attached to each feature map of the encoder network in parallel to each other.
The first is a regression sub-network which performs a bounding box regression for sophisticated localization around the object in the image using the encoder's output f (x), represented as the x and y-axes coordinates in the upper-left corner and the x and y-axes coordinates in the lower-right corner of the rectangle. Secondly, classification sub-network produces a class-specific confidence score C i , i denotes the number of classes including the background (assuming multi-class cases). Therefore, each anchor box has i numbers indicating the class probabilities.

C. Mutual Information Maximization via Adversarial Loss
In constrained scenes, one-class object detection task with only an LP class can improve the precision and localization accuracy with small false-positive rates and high-IoU scores simultaneously. In unconstrained images, however, there are scene-texts that look like LPs and arbitrarily shaped LPs. Thus, such phenomenon has produced unsatisfactory results in terms of LP detection performance. Ideally, LP-discriminative features should explicitly ignore non-LP related features inside the learned network. Therefore, for maximizing inter-class variance, the objective is to perfectly remove the following characteristics from detection network: where I(·) denotes the mutual information between two random variables. To handle this problem, our ultimate goal is to learn the network with the following characteristics: We decide to add the mutual information term to the objective function for training networks. To be specific, during the training process, we should explicitly define a classification stage for non-LP, which aims to confuse non-LP data distribution from the extracted features. We hope the LP-specific detector is trained to maximize to inter-class variations related to non-LP images. A good LP detector would, therefore, have characteristics that are close to the characteristics which are irrelevant for all non-LP visual representation, especially scene-text without LP. Therefore, we replace g( f (X)) with f (X) because g, the RPN network that determines detection output, receives f (X) as its input. This means that if the entire network recognizes n(X), which is non-LP information, as disrupted information for LP detection, it already has that property from f (X) extracted from the input image X. In this case, we derive the following objective function: where L lp is the standard detection loss [4], [31], [46] including Euclidean loss for regression L gloc and cross-entropy loss for classification L gcls . ι obj is a trade-off hyper-parameter to control the relative importance of the two terms.
In information theory, the mutual information term in Eq. (3) can be explicitly expressed as follows: where H (·) and H (·|·) are the marginal and conditional entropy, respectively. Here, the marginal entropy H (n(X)) can be eliminated from the objective function because it is a constant that is completely independent of θ f and θ g during the optimization process. However, the entropy term in Eq. (4) can be changed to the problem of calculating the posterior distribution. To be specific, we can instead calculate the negative conditional entropy −H (n(X)| f (X)) with the posterior P(n(X)| f (X)) explicitly. However, the posterior distribution in objective function is still intractable. We can instead approximate posterior with a parameterized distribution, Q, with an additional desideratum (mutual information constraint): The objective is directly calculated with Q in Eq. (5). Hence, the backbone network f can be trained under the additional desideratum with no change to the basic training procedure. It is difficult to calculate or optimize while satisfying the constraint in Eq. (5). The intuitive meaning of mutual information constraint is clear: the smaller the KL divergence between P and Q, the greater the closeness between Q and P, indicating that Q gets more information from P as learning gradually continues. Therefore, an approximation of the posterior distribution, the parameterized model Q, can be achieved through KL divergence. Modeled with tractable distribution, the novel regularization loss, L I T , can be written as follows: (6) where D K L denotes the KL divergence and μ is the balancing parameter for the two terms. We can instead approximate the auxiliary distribution, Q, with the non-LP RPN network h, thus the KL divergence in Eq. (6) is minimized. Approximating P(n(X)| f (X)) with the additional network h will minimize D K L , making the problem in Eq. (6) tractable.
By making D K L (P(n(X)| f (X))||Q(n(X)| f (X))) as small as possible, we employ the cross-entropy loss between n(X) and h( f (X)) with parameters θ f , θ h . Here, loss of the additional network h in operation h • f can be obtained as where L hcls w.r.t. θ hcls and L hloc w.r.t. θ hloc are classification and localization losses for h RPN sub-networks, respectively. We note that the mutual information term in Eq. (3) is related to classification and not to sophisticated localization. For example, embedded features extracted via f rely heavily on non-LP's classification features, regardless of the results of localization. In an extreme case, even if localization is inaccurate, it is enough to perceive only non-LP information in the image. We can rewrite the formulation of Eq. (6) by relating to Eq. (7) in an adversarial manner. Ideally, the LP-invariant features of f should confuse h which aims at detecting the non-LP. Conversely, the f leverages a model g to detect the only LP by minimizing the detection loss. Namely, we adopt a minimax problem on the θ f and θ h , encouraging f to encode only LP-specific visual features into the representations, in which case the classification capability of the non-LP might be harmful. Here, we define the last D K L term of Eq. (6) as the L I T and can be rewritten as follows: Specifically, we train the detection network to minimax Eq. (3) by alternating information theory terms into Eq. (8), and primal detection loss can be further expressed as: Optimizing this loss function requires adversarial learning strategy [48], [49] of the networks, f , g and h. In addition, we apply gradient reversal layer (GRL) [50] after f (X).

D. Localization Refinement Module
In order to make the regressed bounding box coordinates by the localization sub-network easier to predict, we also introduce the process of localization refinement. To provide the complementary information of the bounding box in the training process, we employ a sharing block S(·) for refining the localization feature.
We are given a set of feature maps l by the localization subnetworks, where {l = [l gloc , l hloc ]} contains the last feature maps for g loc and h loc respectively. Then, l is fed into the proposed S for the localization refinement and output l , where {l = [l gloc , l hloc ]} the refined feature maps corresponding to g loc and h loc respectively. Figure 3 shows the process of localization refinement. The architecture for the sharing block S of the localization information follows three consecutive operations: Batch Normalization (BN) [51], followed by a PReLU [52] activation function and a 1 × 1 convolution layer. The sharing block S connects the concatenated feature map between the last feature map of the localization sub-networks, l gloc and l hloc , respectively. This gives rise to the following layer transition: l S = S(l gloc , l hloc ), where l S denotes the output of the S. Motivated by [47], we add a skip-connection between the output of the sharing block and the last feature map l in each localization sub-network: Our refinement module plays two roles, where the first is to complement each localization information of sub-network by maximizing the opportunities for useful conjunctions. In fact, optimizing Eq. (9) will make the localization network h loc of the non-LP detector h more stable by L N . That is, h loc is likely to have the ability to accurately locate not only LP, but also scene-text that looks like LP. Thus, it is likely to complement g loc . The other role is to promote the localization sub-networks to regress precise objects.

E. Training
A pre-trained CNN model [53] is employed as the backbone network. For stable gradient calculation, we optimize the objective function Eq. (9) in an alternative way [48], [54] instead of a straightforward way and the modified optimization objective in terms of g • f and h • f can be represented as Eq. (11) and Eq. (12) respectively: and At the beginning of training, g • f was trained to detect the LP including non-LP information. h from feature extractor with non-LP information also learned to detect non-LP adequately. As the learning progresses, f is led to extract as much LP-specific features excluding non-LP information as possible, and the h increasingly struggles to detect non-LP because f gradually leverages to make h a poor performing network. At the end of learning, f extracts only LP-invariant feature embedding while ignoring non-LP information completely, given enough capacity. Due to the embedded f , g detects only LP and h is guided to the detector with poor performance, as shown in Fig 4. Further analysis on the proposed method is presented in Section. V.C-E.

F. Inference
At the testing phase, the h(·) task is removed. Given a test image X test , the g • f output is the detection result via feature extractor f and LP detection network g. Then, the output result is represented as L P R result follows: IV. NEW BENCHMARK: LPST-110K There are many datasets of LP detection [33], [35], [41], [43], [44] which are available mainly for LP detection. However, these datasets do not provide annotation of the scene text (not LP) bounding box.
We collected images of LP and scene-texts to make the new dataset and the benchmark. The dataset is focused on images taken from moving and static cameras as it is meant to be useful for real-world applications. LPST-110K collected images from hundreds of dash and surveillance cameras are being mounted in driving vehicles and building respectively, including locations in East Asia and Europe. We include the scene texts, such as non-LP (e.g. traffic sign, wallpaper text, banner, commercial advertisements, etc.), and also includes LP. By doing so, we do not restrict that the instances are taken from the uncontrolled settings (Table I). Each correctly detected scene texts is captured in 5 images, as it is passing by the camera or themselves. The dataset contains 110,000 scene text instances of 9,795 images. The scene texts are divided into two classes: 51,031 LP instances and 58,969 non-LP instances. The properties in the dataset are shown in Table I and samples from the dataset are in Figure 5-7 and 9-10. The data include information about the 2D bounding box for each instance and recognition annotation with letters extracted manually.
Our proposed dataset is very challenging in diverse ways: density, image quality, illumination, angle, distance and complex background, and so on. For example, density (How objects densely indicated in image?, LP/LP + nonLP) is closest to real-world scenarios, that frequently appear on the scenes of all images. We reflect such property to LPST-110K as follows: AOLP -1/1, SSIG -4.34/4.34, UFPR -1/1, CD-HARD -1/1, CCPD -1/1, LPST-110K -5.21/11.00. Besides, our dataset is also unique and difficult due to the existence of non-LP, because their presence is the biggest obstacle to LP detection. As we analyze, the non-LP instance will cause more false-positive errors. The resolution of each image is 1280 (Width) × 720 (Height) × 3 (Channels). Specifically, this resolution is enough to leverage LP-related tasks. Also, the images in LPST-110K are compressed by h264 codec setting, and unlike most existing LP detection datasets, our tilt degrees, distance, illumination, and blur degrees are diverse and not just frontal or rear. LPST-110K is representative of real-world scenarios where LP detection may be desired.

A. Implementation Details
All the reported implementations are based on Pytorch as learning framework, and the method was done on the NVIDIA TITAN X GPU and one Intel Core i7-6700K CPU. For stable training, we use a gradient clipping trick and the Adam optimizer [55] with a high momentum. All models are trained for the first 10 epochs with a learning rate of 10 −4 , next 11-20 epochs with the learning rate of 5×10 −5 , and then for the remaining epochs at the learning rate of 10 −5 . For f , we used the ResNet-50 as the backbone, which is pre-trained on ImageNet [53] except for the last fully connected layer. It was then fused with the upsampled result from the deeper Fig. 4. The training process of g • f . g • f (black, solid line) are trained to detect the LPs using f (·) as input so that it can classify between samples from the LP data distribution (red, dotted line) and non-LP data distribution (blue, dotted line). The horizontal line below is the feature extraction from which f is sampled. The upper horizontal line is part of the multi-data distribution of X L P (LP data distribution) and X N L P (non-LP data distribution). The upward arrows indicate how the mapping (X L P , X N L P ) = (g • f ). (a) The initial state before learning randomly is mapped regardless of the distribution of the data. (b) At the beginning of the training, (g • f ) learns both LP and non-LP information. (c) After several steps of training, (g • f ) will be guided to intensively learn LP and will gradually ignore non-LP. (d) Lastly, at the end of the training, the LP distribution will reach a point at which sampled LP data distribution because it is learned to ignore non-LP information.
FPN layer. Finally, we apply a 3 × 3 on a 256 feature size convolutional layer with the same padding as the feature for object detection. Subsequently, this applies two additional 3 × 3 on 256 feature size, /2 convolution on the deepest layer of the backbone to detect extremely large objects.
For classification sub-networks (g cls and h cls ) and localization sub-networks (g loc and h loc ), a fully convolutional network is employed, consisting of four times 3 × 3 on 256 feature size convolutional layers with the same padding and PReLU [52] activation. Each sub-network is trained with CCE loss [56] for classification and L1 smooth loss [2] for 4-axis box coordinates regression. The experimental results are presented in the following sections.

B. Datasets and Evaluation Metrics
We test our method on five LP detection benchmarks AOLP [43], UFPR [33], PKU [45], CCPD [41] and newly collected dataset, named LPST-110K. The first four benchmarks are collected for addressing license plates, while the last one targets at providing not only LP but also non-LP scene text. In existing datasets, all except LPST-110K are the annotated dataset only for LP. Since non-LP detection network h requires non-LP data, we initially train the proposed model using only LPST-110K except them. To provide more kind comparisons for its performance, we also retrain g • f during freezing h using existing datasets.
AOLP [43] can be split into three categories: AC, LE and RP. Testing images of each subset consist of 581, 757, and 611 images.
UFPR [33] images are partitioned into train, validate, and test splits. Training consists of 50% of the images (1,800 images); 20% of the images (900 images), are used for validation. The rest, 1,800 images is used for testing.
CCPD [41] consists of 150K images for testing. Most images in this dataset are extremely distorted.
LPST-110K contains 9,795 images and their associated 110,000 scene text bounding boxes, which are divided into 5,795/4,000 images for training and testing, respectively. In addition, LP and non-LP instances consist of 29,891/29,078 and 21,065/29,966 bounding boxes (training/ testing), respectively.
Evaluation Metrics As for our proposed model, precision, recall, F-measure, AP are utilized as evaluation protocols. For AOLP, UFPR, CCPD benchmarks, we employ precision and recall metrics that have been widely used in LP detection evaluation. Define precision as: where T p and F p are the correctly estimated bounding box and the incorrectly estimated bounding box. The precision is the ratio of the quantity of the correctly detected bounding boxes among all the acquired bounding box candidates. The more the detection network produces more non-GT bounding boxes as true positives, it will acquire higher precision. Define recall as: where F n is the quantity of the undetected ground truth. The recall is the ratio of the correctly estimated bounding boxes among all the ground truths. The more the detection network fails to detect the GT bounding box, the lower the recall. The IoU is defined as follows: where R det and R gt are area of the detected bounding box and the ground truth respectively. The detected bounding box is considered correct when its IoU overlaps the ground truth region by more than 50% (IoU > 0.5).
In addition, we adopt an F-measure that has been used in the PKU benchmark for LP detection evaluation. The F-measure is calculated as follows: For LPST-110K, we adopt the average precision (AP) at IoU = .50:.05:.95 (standard challenge metric) and AP at IoU = .75, AP .75 (STRICT LP detection metric).

C. Comparisons With State-of-the-Art Methods
For the AOLP, PKU, UFPR, CCPD and LPST-110K, our proposed method can significantly improve the performance of detection, including challenging real-world images as shown in Fig 5 and 7. The results assure that our method consistently enhances the LP detection performance in various datasets. For the AOLP dataset, Table III shows that precision and recall values are nearly as accurate as recent methods. In AOLP, our method generally outperforms the existing state-of-the-art methods. In Table III, [59] has partially better results than our method (e.g. 100 vs 99.71 in the AC subset Precision). However, [59] creates very unrealistic synthetic images that cannot be found in a typical traffic scene to improve this performance, which consists of 450,000 images. In AOLP, using 450,000 datasets for a slight performance improvement requires excessive training time and is inefficient than our method in terms of hardware efficiency. More importantly, our approach leads to better performance in precision, which implies that our method decreases the false positive error regardless of non-LP. This indicates that our method is most suitable as a backbone for our approach both in terms of performance and hardware. Table IV summarizes the performance of the detection improvement of our approach over the baseline on the three datasets. Specifically, our method obtains the highest performance (99.17%) and (96.1%) in UFPR and CCPD, and   outperforms other state-of-the-art methods by more than 0.5% and 1.6%. Partially, the performance in PKU is lower than other method [58] (e.g. 100 vs 99.65 in G4 subset) However, in all subsets except for the G4 subset, our method outperforms the others, even on the overall average. In addition, in the more unrestrained and challenging UFPR and CCPD, the performance outperforms any other methods. Please note that UFPR and CCPD are much more challenging than PKU. UFPR and CCPD are more diverse and complex in terms of both geometric and semantic views. It is worth addressing that the new method can benefit from the proposed information loss because it prevents non-LP detection even the wild scenes. Table V reports the results for the newly collected LPST-110K. Still, we can see the same pattern that our method non-trivially increases detection accuracy in both experiments: 1) targeted only LP and 2) targeted all of scene texts. Our approach robustly improves the performance regardless of the presence of non-LP as shown in Figure 5-7.

D. Ablation Study
We perform an ablation study about the effect of the proposed information-theoretical loss and localization refinement module. In the baseline, the results of detection often find the non-LP objects. On the other hand, our approach can improve detection performance, because it provides LP-invariant features around unconstrained scenes. Table V shows how much detection accuracy is improved by the proposed method with ablation manner. When employing information-theoretical loss and localization refinement module (LRM) to the baseline, the LP detection performance is further improved by 0.42% and 0.48%. Especially, GRL [50] is used in both LP and non-LP modules before the feature extraction network f . Although the GRL was originally proposed to solve domain discrimination problem, we obtained the performance improvements. Figure 5 and 6 shows the qualitative results. Consequently, all the components improves LP detection performance noticeably, and clearly ignores non-LP information.
To further investigate the effect of the proposed model, we apply the non-LP detection condition to identify the information-theoretical loss from affecting the avoidance of non-LP. The results are shown in the last column (non-LP) of Table V and Figure 5-6. Surprisingly, the precision and recall decrease by 17.1% and 16.8% compared to the baseline.
In addition, Figure 8 shows a PR curves on LTSP-110K with A P .75 , which demonstrate our method proves that each of our components is more effective than the baseline. These results assure that both modules are profitable.

E. Model Analysis
We discuss some model analysis, including "LP recognition results," "Error study," and "the impact of additional network," are discussed in the following: 1) LP Recognition Results: The LP detection and recognition (LPDR) task aims at assessing the overall, end-to-end, LPR system performance. For this task, we define a true positive LP detection and recognition as 1) the LP has been precisely localized within the image with IoU > 0.5 and 2) all the characters in the LP have been precisely recognized. The LPDR performance is also measured in terms of accuracy, as defined in the LP detection task.
For character recognition (CR), we utilize a CNN-LSTM encoder and decoder. In the encoder, the input is an output from the proposed detector. In the same vein, the area of the LP is mostly very small relative to the input image. Therefore, only seven lower convolutional layers of the encoder are used to extract features with two 2 × 2 max-pooling operations. The encoder network is followed by Bi-directional LSTM [70] each of which uses 256 hidden units that explicitly control data flow. For the decoder, we employ the attentional mechanism with GRU [71] and LSTMs. In the inference phase, the decoder predicts an individual text class y k at step k until the last step of scene text, where k is the number of predicted characters. Additionally, we show the LPDR results of the images acquired on the LPST-110K as shown in Fig 10. The AOLP [43] dataset is challenging because the LP's angle contains oblique samples in terms of distortion. On the other hand, in terms of resolution, all images are relatively easy to recognize because they consist of high resolution samples rather than other datasets. Throughout the experiments, we compared our method with other state-of-the-art LPR methods. Overall, our method obtains the highest performance (97.36%/99.09%/98.63%), and outperforms others in LE and RP subsets.
PKU and UFPR datasets samples are far from the camera, causing an issue in terms of resolution. However, they are  BOLD   Fig. 9.
Error study on PKU (first row), CCPD (second row), and LPST-110K (3rd-4th rows) dataset. In the first column, Green bounding boxes are ground truth annotations of LP. In the second column, Red bounding boxes are our detection results. In the last column, the red bounding boxes are false-positive errors and the green bounding boxes are false-negative errors.
almost invariant in terms of distortion because the captured environments are hardly affected by the tilted LP angle or lighting. Under such conditions, the proposed method achieves a competitive performance over most state-of-the-art LPR methods, as shown in Table VI. Specifically, we note the role of localization refinement module, where tiny-LPs often appear in these dataset, and are likely to be unclassified as non-plates because they contain minimal pixel information. Nevertheless, our method produces high-performing localization that can be further adapted from LP, thereby reducing the false-positive and false-negative error. In Table VI, last two rows (baseline and ours) show that results of our method.
2) Error Study: We tested our approach on LPST-110K and four existing benchmarks for LP detection, and show how it to surpasses existing detection methods achieving remarkable performance. However, even the best results on LPST-110K are far from being saturated, suggesting that these unconstrained scenes remain a challenging frontier for future work. Figure 9 shows some cases of failure, including some false recognition results. These results identify that more progress is needed to further improve detection performance. From Figure 9, it can be observed that the overall imaging conditions are low-quality images collected in unconstrained environments. For example, the image in the first row contains uneven illumination from the night and image in the second row is taken at very tilted angle. Specifically, the cases of the LP images in the 3rd to 4th rows are captured at very low-resolutions.
The probable causes of failure include low-quality images and severe interferences. In the first row, a false-positive error occurred, and they have a background and form very similar to LP. Then, since the LP in the second row is very tilted and low quality, not only did it fail to detect correctly, but it also caused another false detection by the logo. Finally, the last two rows show false detection due to banners and occlusion. Considering the failure cases of errors, most errors can be solved by prior knowledge related to text recognition information, and if not, our proposed method is almost close to the human-level.
3) Impact of Additional Network: In this section, we further perform experiments to analyze the performance of our proposed method. We compare the structure of our additional network h with other types of networks to demonstrate the efficiency of a dual network with different purposes. The objective of the LP detector is to detect as many LPs as accurately as possible. Our ultimate goal is to provide the possibility to be able to recognize even the hard positive LPs contained in the unconstrained image. In Table VII, the performance of detection is shown to depend on how the structure is designed. We can see that additional network h with different objectives show better performance among them. The existing method [4] that focuses too much on LPs tends to ignore the characteristics of hard-positive LPs, and does not even provide a chance for recognition (see the Baseline). Most importantly, when a two-class object detector simultaneously detects both LP and non-LP, we can identify that the results exhibit fairly high performance. This implies that the two-class detector can detect LP quite accurately. Although it may work well for us to find the right candidate for the target we want, it still causes too many errors and only shows the same or slightly better performance than our method (24.5%/21.1% and 20.2%/15.3% in IoU = .5 and 22.1%/21.1% and 9.3%/7.7% in IoU = .75). This confirms that  BOLD   Fig. 10.
Qualitative LPDR results of our proposed method. Green bounding boxes are ground truth annotations of LP and red bounding boxes are the results from our method. the proposed method can effectively perform discriminative feature learning and filter out unnecessary candidates.

F. Speed
The training speed is about 7.9 iterations/s, taking less than 2 days to reach convergence. In terms of inference, compared to other methods, the proposed model shows a good accuracy-speed trade-off. It is designed for highly accurate LP detection, running at 14 FPS for the input scale 1280 × 720. Though being a little slower than the fastest method [41], it overcomes [41] accuracy by a large margin. Besides, the speed of ours could be boosted with greater batch size.
VI. CONCLUSION In a controlled environment, the performance of modern LP detectors is amazing, but still limited. This study focuses on unconstrained real-world scenes, including scene text samples, and provide LPST-110K, a new benchmark for such real-world images, for training and testing with detection annotations. In many emerging state-of-the-art detectors, our experiments on this benchmark show their performance is not guaranteed in a complex environment. To solve this problem, the LPST-110K is used to provide two techniques for robust LP detection in these environments. The first is novel information-theoretical learning that takes advantage of three networks for exploiting LP oriented information. The second technique is a localization refinement for generalizing the bounding box regression network to complement ambiguous detection results. Extensive experiments on diverse benchmarks demonstrated the effectiveness of our method when detecting challenging LPs accurately. This study is helpful for recognition compared to other contemporary approaches.
Future work will address a number of challenging cases identified by this work, in particular the wide variation in how well a combination of text detection and text recognition process improves performance of a license plate detection. Further research could investigate how to complementary connect the text recognition result of a single image to license plate detection, and in turn develop a unified license plate detection and recognition framework.