A Multilayer Pyramid Network Based on Learning for Vehicle Logo Recognition

In this paper, we present a novel learning-based scheme for vehicle logo recognition (VLR). This scheme is termed Multilayer Pyramid Network Based on Learning (MLPNL) and is based on the principle that considering multiple resolutions is helpful for extracting valuable features that benefit the final recognition performance. The innovations of this scheme include (1) a multilayer pyramid network, with pixel difference matrices (PDMs) as its input and output and feature parameters mapping one PDM to another; (2) an objective function and a corresponding optimization method designed to facilitate the learning of the feature parameters of the proposed multilayer pyramid network; and (3) a multi-codebook-based encoding method that makes best use of the features extracted from PDMs corresponding to different resolutions. Extensive experiments conducted with an open dataset, HFUT-VL, demonstrate that the proposed MLPNL scheme outperforms state-of-the-art handcrafted descriptors and non-deep-learning-based learning methods when fewer training samples exist. Experiments conducted with a benchmark dataset, XMU, demonstrate that MLPNL outperforms existing state-of-the-art VLR methods. Experiments conducted both on HFUT-VL and XMU demonstrate that MLPNL is faster than most deep-learning-based learning methods while maintaining nearly the same recognition rate. Code has been made available at: https://github.com/HFUT-CV/MLPNL.

surveillance images or videos, such as the license plate, logo, color, and model, for subsequent decisions and data analysis. License plate recognition is well known as one of the most effective means to identify vehicles and has been widely and successfully implemented. However, license plates can be easily removed, obscured or tampered with. Thus, recognizing a vehicle's logo, color and model is becoming increasingly important for vehicle identification. The vehicle logo, as a key indicator of the vehicle manufacturer, carries important information for vehicle identification and is difficult to counterfeit. Thus, vehicle logo recognition (VLR) has attracted increasing attention in recent years.
Many VLR methods have been proposed in the past decade. In the early studies, researchers usually designed handcrafted descriptors for the VLR task. However, handcrafted descriptors suffer from the following shortcomings: (1) the design of handcrafted descriptors requires strong prior knowledge, and such descriptors are heuristic in nature; and (2) the generalization ability of handcrafted descriptors is weak for complex object recognition tasks.
To overcome these shortcomings, researches tried to exploit learning methods. For example, some deep learning methods have been applied for VLR. Furthermore, researchers focused on designing suitable neural network models for specific VLR tasks. However, most deep learning methods require a large number of training samples and extensive training time, and inevitably lead to a high computational burden. Therefore, some other researchers have instead tried to extract descriptors based on non-deep-learning-based learning methods, such as Local Quantized Patterns (LQP) [1], Discriminant Face Descriptor (DFD) [2] and Compact Binary Face Descriptor (CBFD) [3]. Compared with deep learning methods, non-deeplearning-based learning methods can learn effective parameters with small training samples using less training time, and they can achieve promising recognition performance. So far, nondeep-learning-based learning methods are mainly used for face recognition, and have not yet been applied to VLR.
In this paper, we aim to design a suitable non-deep-learningbased learning method to solve the VLR problem. To this end, a novel method named Multilayer Pyramid Network Based on Learning (MLPNL) is proposed. MLPNL uses pixel difference matrices (PDMs) as the input data for training. The learned feature parameters are used to extract the feature information and encode the feature vector with an unsupervised learning method.
The MLPNL model can automatically learn robust features for VLR, thus, it can achieve higher recognition accuracy This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ compared to handcrafted descriptors. However, compared to a Convolutional Neural Network (CNN), substantially fewer feature parameters are learned in MLPNL, and fewer training samples are required, thereby accelerating the learning process. Experimental results obtained for the HFUT-VL dataset demonstrate that MLPNL can achieve higher recognition accuracy than most state-of-the-art handcrafted descriptors by at least 2% and non-deep-learning-based learning methods by at least 1% when fewer training samples exist (5 samples are used). Experimental results obtained for a benchmark XMU dataset demonstrate that MLPNL can significantly reduce the training time (up to 80%) compared to other deeplearning-based methods while maintaining almost the same recognition accuracy.
The contributions of this paper are as follows: (1) To the best of our knowledge, this is the first work to use a resolution-reduced multilayer pyramid in a learning network, where both the input and the output are PDMs and the parameters are learned by mapping one PDM to another. Thus, features of different resolutions can be obtained in different layers. (2) This is the first work to propose a multi-codebook-based classification method that uses binary matrices extracted through binarization of the PDMs as its input. Multiple codebooks are used to encode features extracted from PDMs of different resolutions to represent different characteristics, thus increasing the robustness of the method. (3) A valid objective function and a corresponding optimization method are designed for network parameter learning. The feature parameters are obtained by optimizing the objective function. Compared with other feature learning methods, our method converges more quickly and is more effective for VLR. The remainder of this paper is organized as follows. Section II is a brief review of the related work on VLR. In Section III, the details of the proposed MLPNL method and its application for logo recognition are described. Section IV reports the experimental results, and Section V concludes the paper.

II. RELATED WORKS
VLR methods can be divided into two categories: handcrafted methods, and deep-learning-based methods. In this section, we will briefly review these two types of methods. VLR is also useful for vehicle make and model analysis. Therefore, we also briefly reviewed several vehicle make and model recognition (VMMR) methods related to VLR.

A. VLR Methods Based on Handcrafted Descriptors
Several popular handcrafted descriptors such as Scale-Invariant Feature Transform (SIFT) [4]- [6], Histogram of Oriented Gradients (HOG) [7], [8], Local Binary Patterns (LBP) [9], and their variations have been exploited for VLR tasks. Ou et al. [10] proposed a VLR framework that first uses an AdaBoost-based detector to obtain the regions of interest (ROIs) and then utilized the SIFT descriptor for recognition. Chen et al. [11] extracted vehicle logo features using SIFT features combined with logistic regression (LR). Sun et al. [12] first adopted LBP features combined with the AdaBoost method to detect vehicle logos and then exploited HOG features combined with a support vector machine (SVM) classifier for recognition. Chen et al. [13] proposed to combine Pyramid HOG (PHOG) and Multi-scale Block Local Ternary Patterns (MB-LTP) to recognize vehicle logos. Yu et al. [14] proposed descriptors combining Oriented Edge Magnitudes (OEM) and overlapping LBP features for VLR.

B. VLR Methods Based on Deep Learning
Huang et al. [15] proposed a CNN model for VLR, which consists of two convolution layers, two pooling layers and two fully connected layers. In addition, a pretraining strategy is applied to accelerate the training procedure to render the VLR system suitable for real-world applications. Li and Hu [16] also proposed a method with a pretraining strategy. They used the Hadoop framework for data processing and trained a CNN model for VLR. Xia et al. [17] combined a CNN with the multitask learning (MTL) approach and used an adaptive weight training strategy to accelerate the convergence of the model. Soon et al. [18] presented a method that aimed to automatically search and optimize a CNN architecture for VLR. Huan et al. [19] used the Hough transform to achieve accurate vehicle logo detection based on the locations of a vehicle's logo and license plate. Then, vehicle logo classification was performed with deep belief networks (DBNs). Liu et al. [20] proposed a single shot feature pyramid detector (SSFPD) based on a reduced ResNetXt model for small-sized vehicle logo detection and classification. Yuxin and H. Peifeng [21] designed a highway entrance VLR system based on VGGNet, in which some simplified VGGNet and fine-tuned VGG16 were used to complete the work.

C. Vehicle Make and Model Recognition Methods Related to VLR
A logo is an important part of the vehicle. Thus, the research of VLR also benefits some other vehicle-related studies. Awan et al. [22] proposed a methodology for recognition of vehicle make and type by identifying logo images, which adopted a supervised learning process of a maximum average correlation height (MACH) filter, and ensures correct classification when acute angular shifts exist. Tafazzoli and H. Frigui et al. [23] presented a two-stage framework for VMMR, in which vehicle logos were used in the second stage to verify the recognition results obtained from the first stage. Sridevi et al. [24] performed vehicle make recognition based on VLR, in which the gray level co-occurrence matrix was extracted as a feature and a probabilistic neural network was used for classification. Lu and H. Huang [25] used the vehicle logo, grill and headlights to classify the vehicle models at the brand level, and the well-normalized dense HOG feature is used for feature extraction, followed by intrabrand classification of vehicle models of the same brand.
Vehicle logos are usually very small, and only consist of a slight proportion in vehicle images. Thus, in the field of VMMR, many studies ignore the vehicle logo and try to take the whole vehicle image or the frontal/real image of the vehicle as a whole, which are then recognized based on traditional descriptor-based methods [26], [27] or deep-learning based methods [28]- [30].

III. METHODOLOGY
The main idea of our methodology is illustrated in Fig. 1. First, we extract P DMs from training images. These P DMs are input to our multilayer pyramid network. All layers of the pyramid are P DMs with different resolutions that are mapped to binary matrices through binarization. Then, we extract codebooks from binary matrices using the clustering method. Lastly, a multi-codebook-based classification method is applied to solve the VLR problem. The details will be described in the following subsections.

A. Design of the Multilayer Pyramid Network
The structure of the multilayer pyramid network is illustrated in Fig. 1. From this figure, we can observe that the network contains a large number of layers (denoted by M). The first layer corresponds to the P DM extracted from the original images. The i -th layer (1 ≤ i < M + 1), i.e., the output of the (i − 1)-th layer and the input to the (i + 1)-th layer, corresponds to a P DM extracted from a preceding layer. The last layer is the output of the penultimate layer. We use P DM 0 to denote the P DM extracted from the original image in the training dataset. P DM 0 is mapped to P DM 1 , which is then mapped to P DM 2 and so on up to P DM M by the feature parameters ( as the data pass through the network. The P DMs generated in each layer are all useful for the subsequent mapping and classification processes. The size of each successive P DM decreases proportionally, forming a pyramid structure.

B. Extraction of the P DMs
The P DM extraction process is depicted in Fig. 2. First, the vehicle logo images are divided into multiple overlapping regions. Then, we extract a P DM from each region as follows. For each pixel in the region, a P DV is extracted from the neighbors in a (2R + 1) * (2R + 1) area, where R is the neighborhood radius. The P DV consists of the difference between the center pixel and each neighbor pixel. If the region contains P pixels, then P PDV s are extracted; thus, the P DM can be expressed as (P DV 1 , P DV 2 , P DV 3 . . . , P DV P ).
Consider an image dataset with C classes, where each class contains N images. In this case, C * N PDMs can be extracted for all regions with the same coordinates, which can be expressed as (P DM 11 , P DM 12 , . . . , P DM 1N , P DM 21 , . . . , P DM C N ).

C. Learning Process of the Network 1) Design of an Objective Function and Its Optimization:
The process of feature parameter learning is essentially an optimization process. First, an objective function is designed; then, the feature parameters are learned by optimizing the objective function.
During the P DM generation process, P DV s are first calculated based on neighborhoods. We wish to find a suitable mapping transformation to reduce the dimensionality of each P DV . This mapping transformation is expressed in equation (1): where V is the mapping matrix. V transforms P DV ∈ R D into P DV ∈ R d , which is more compact and discriminative. For the entire matrix, the mapping transformation can be expressed as follows: To learn V , two important criteria are enforced: after the transformation, the within-class distance should be minimized, and the between-class distance should be maximized. Thus, we formulate the following optimization objective function: where J 1 represents the within-class distance and J 2 represents the between-class distance. C represents the number of classes. N represents the number of samples in each class. γ 1 and γ 2 are two weighting parameters that are used to balance the effects of J 1 and J 2 , respectively. M c is the mean of all matrices that P DM extracted from the c-th class, and M is the mean of all matrices that P DM extracted from the entire sample set. M c and M can be calculated as follows: When P DM is replaced with a P DM, equation (3), (4), and (5) can be rewritten as follows: Here, M c is the mean of all matrices that the P DM extracted from the c-th class, and M is the mean of all matrices that the P DM extracted from the entire sample set. M c and M can be calculated as follows: By optimizing the objective function in equation (6), the feature parameter matrix V ∈ R D * d is learned. Using V , P DM ∈ R P * D is transformed into P DM ∈ R P * d , which is more compact and discriminative.
Each P DV plays a different role during the feature extraction process. To extract more valuable information, we want to give each P DV a different weight to enhance the useful information and reduce redundant information. In traditional handcrafted descriptors, an image filter is usually adopted for this purpose. For example, the Gabor filter is used in the (LGBP) [31] method to select different spatial orientations and scale information from face images. However, designing a valid filter by hand is difficult. Therefore, we want to obtain an effective filter W through learning, with which to map P DM ∈ R P * d to P DM ∈ R p * d : where W ∈ R P * p . When we replace P DM with a P DM using equation (2), equation (9) can be rewritten as follows: When we apply the same constraints for the filter W , equation (6) can be rewritten as follows: Through the optimization of the objective function in equation (11), a pair of feature parameter matrices is learned. With {W, V }, P DM ∈ R P * D is mapped to P DM ∈ R p * d ( p < P, d < D), which is proven to contain more valuable information than the P DM. Therefore, this is not only a dimension reduction process but also an information extraction process. Furthermore, we transform P DM into a binary matrix using equation (12) to increase the separability of the feature information and reduce the computational burden.
Our aim to make the variance of the binary matrix B as small as possible, such that the distribution of learned information will be as even as possible, so that the feature bins are more compact and informative to enhance the discriminative power and to avoid the influence of individual extreme noise on the overall distribution of data. Therefore, we propose the following objective function: whereB ∈ R p * d is a mean matrix consisting of repeated column vectors of the mean of each column vector in B. However, equation (13) is an NP-hard problem due to the nonlinear sgn(x) function. Therefore, we relax equation (13) using the method presented in [32] as follows: where P DM is a mean matrix consisting of repeated column vectors of the mean of each column vector in the P DM. When we combine the constraints in equations (11) and (14), the final objective function is obtained: To compute the extremum of J , we must optimize the objective function in (15) with a descending iterative algorithm. First, we compute the partial derivatives of J with respect to W and V : Here, where C is the number of classes and N is the number of samples in each class; P DM cn ∈ R P * D . Output: W ∈ R P * p and V ∈ R D * d , where p is the column dimension of matrix W and d is the column dimension of matrix V .
Step 1 (Initialization): W = I, V = I ,where I is the identity matrix. Calculate J with equation (15).
Update J using (15). while J old > J ; Step 3 (Output): Output the matrices W and V .
Let ∂ J/∂ W = 0 and ∂ J/∂ V = 0. We obtain the following equations: The closed-form solution for W is the matrix formed by the eigenvectors corresponding to the p largest eigenvalues of A −1 2 A 1 from equation (22). The closed-form solution for V is the matrix formed by the eigenvectors corresponding to the d largest eigenvalues of Q −1 2 Q 1 from equation (23). The process of learning the parameter matrices W and V is described in Algorithm 1.
2) Learning of the Network Parameters: As described in Section II.C.1), after building an objective function J i for each layer of the pyramid network, we can obtain a pair of feature parameter matrices {W i , V i } for the i -th layer. Accordingly, to learn the feature parameters of the whole pyramid network, we propose the following objective function, where M is the number of layers: To optimize equation (24), a simple solution is proposed. If the objective function J i of each layer is minimized, then the Calculate J sum using equation (24).
Step 2 (Optimization): Obtain the input data P DM i . Calculate A 1 and A 2 using (18) and (19), respectively. Solve (22) to obtain W i+1 0 . Let W i+1 = W i+1 0 . Calculate Q 1 and Q 2 using (20) and (21), respectively. Solve (23) (15) and P DM i+1 using (10). i++; end while J sum old > J sum ; objective function J sum will be minimized. Thus, we propose an optimization method for the whole pyramid network based on this idea. This method is described in Algorithm 2.
The above optimization process is the training process for the pyramid network. By applying Algorithm 2, we can learn M pairs {W, V } in the pyramid network. In this manner, P DMs with different resolutions are obtained in different network layers. When features are extracted from these P DMs, important edge and texture information will not be lost, and more comprehensive and richer information can therefore be obtained for VLR. These P DMs are then mapped to binary matrices for further processing.

D. Multi-Codebook-Based Vehicle Logo Classification
In this paper, we propose a multi-codebook-based classification method, in which codebooks extracted from each layer of MLPNL are used as the encoding method to express the extracted feature vectors, and then classifiers are used for their classification.
The whole feature extraction process is illustrated in Fig. 3 and can be summarized as follows: 1) Codebook Generation Process: First, we obtain a group of binary matrices (B 1 , B 2 , . . . , B M ) which we later refer to as a binary matrix group, from each region through the MLPNL network using equation (12). M is the number of layers. The rows of each m-th layer binary matrix are used as training vectors to learn a codebook based on the K-means clustering method. Thus, M codebooks (codebook 1 , codebook 2 , . . . , codebook M ) are learned, grouped together and referred to as a codebook group. Then, if each image is divided into H regions, we can obtain H codebook groups and each codebook group contains M codebooks.
2) Feature Vector Generation Process: When the test image is divided into H regions, we can obtain H binary matrix groups through the MLPNL network. Each binary matrix group contains M binary matrices (B 1 , B 2 , . . . , B M ). We match the binary matrix group h with codebook group h to obtain the feature vector group h. In the h-th group, we match each row of the binary matrix B m with codebook m , and the nearest match in the codebook is adopted to represent this row. We count the occurrence frequency of each codebook entry and generate the histogram as a feature vector for B m , denoted by V ec m . Then, the V ec m are concatenated to form the feature vector of this group, denoted by (V ec 1 , V ec 2 , . . . , V ec M ). Last, H feature vector groups are concatenated to form the final feature vectors.
To perform VLR once the image feature vectors have been obtained, we choose the collaborative representation-based classifier (CRC) as the classifier because it uses similarities between classes to achieve a co-operative representation, different from other traditional classifiers that focus on discriminative categorization [33], thus CRC runs much faster than other classifiers. The experiments reported in Section IV.B.5) will prove that a CRC classifier yields better results than those of other classifiers.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
The proposed VLR method was implemented using Visual C++ 2017 and the OpenCV 3.2.0 library. We conducted comprehensive experiments on the HFUT-VL [14] and XMU dataset [15] to evaluate the performance of the proposed method for comparison with several state-of-the-art methods. The evaluation was performed on a machine with an Intel Core i7 CPU, 16 GB of memory and 1080 Ti GPU.

A. Dataset
HFUT-VL is a large-scale vehicle logo dataset constructed by our team in [14]. In this paper, we evaluate our method using the accurately located sub dataset, which includes 80 categories of vehicle logo images with dimensions of 64×64, with 200 images in each category. Compared with most existing vehicle logo datasets, HFUT-VL contains more categories and a larger total number of samples. Thus, we evaluate the parameters and compare our method with some state-of-theart hand-crafted descriptors, non-CNN-based learning methods and CNN-based learning methods based on this dataset.
The XMU dataset [15] is a benchmark vehicle logo dataset that includes 10 categories, with each category containing 1000 training images and 150 test images with dimensions of 70 × 70. A sufficient number of samples is available in each category; thus, we compare our method with some stateof-the-art vehicle logo recognition methods and CNN-based learning methods based on this dataset.

B. Parameter Evaluation
To achieve the best performance, we adjusted several parameters through experiments based on the HFUT-VL dataset. The parameters adjusted included the number of network layers, the dimensions of W and V , the neighborhood radius and the region size. Further descriptions are provided in the following subsections.
1) Network Depth: The depth of the network influences the performance of our MLPNL method. To achieve the best results, experiments were designed to evaluate the impact of different numbers of network layers on the VLR results.
First, we set the number of pyramid network layers to 1. In this case, our method is reduced to a traditional learningbased descriptor. As shown in Fig. 4(a), the performance is better when the number of network layers is greater than 1 than that when the number of layers is 1, proving that the multilayer pyramid network performs better than a one-layer network, that is, a multilayer pyramid structure can improve the recognition rate.
Then, we wish to determine how many layers our MLPNL network should contain to achieve the best performance. From Fig. 4(a), we can see that as the number of layers increases from 3 to 6, the performance of networks with P DM 0 dimensions of both 49 * 24 and 49 * 48 remains at almost the same level. Accordingly, as shown in Fig. 4(b), when the depth of the pyramid network is increased from 3 to 4, the increase in the recognition rate rapidly decreases, and when the depth of the pyramid network is increased from 4 to 5 and from 5 to 6, the increase in the recognition rate is almost 0 because for a given network, a higher layer number corresponds to lower dimensionality of the feature matrix. The feature matrix with low dimensionality contains too little information and is therefore less effective in improving the recognition rate; thus, we do not think that it is necessary. From Fig. 4(a) and Fig. 4(b), we can see that both 3 and 4 layers are good choices for P DM 0 dimensions of 49 * 24 and 49 * 48. However, if we consider the increase in time complexity incurred for one more P DM, 3 is the better choice.
2) The Size of the Initial Feature Map: In addition to network depth, the size of the initial feature map P DM 0 also influences the recognition rate. Suppose that the size of the initial feature map is P * D, which depends on the region radius L and the neighborhood radius R, i.e., P = (2 * L +1) * (2 * L +1) and D = (2 * R+1) * (2 * R+1)−1. Thus, to evaluate the influence of the size of the initial feature map, we need only to evaluate the influence of L and R. Experiments are carried out with different L and R values, and the results are presented in Table I. As depicted in Table I, when L is increased from 2 to 3, the recognition rate increases, but when L is increased from 3 to 4 and then to 5, the recognition rate decreases because for a certain image size, once L reaches a certain level, the performance of the local feature descriptors will degrade. Therefore, we set the value of L to 3. From Table I, we can observe that when L = 3, R = 3 is the best choice. 3) The Decreasing Rate of the Feature Parameters: In our multilayer pyramid network, once the size of P DM 0 has been selected, the P DM sizes of the following layers are determined by the parameter matrices W and V . If we denote the input data for the i -th network layer by P DM i−1 ∈ R P * D and the feature parameter matrices for this layer by W i ∈ R P * p and V i ∈ R D * d , then according to equation (10), the output data from this layer are represented by P DM i ∈ R p * d . The dimensions of P DM i are proportionally reduced relative to those of P DM i−1 . Suppose that p = P * δ and d = D * σ , where δ and σ are proportionality coefficients. Therefore, the influence of δ and σ must be evaluated.
Experiments are carried out with different δ and σ values, and the results are shown in Fig. 5. When δ is fixed, the best performance is achieved when σ = 3/4. When σ = 3/4, the best performance is achieved when δ = 5/6. Thus, we select σ = 3/4 and δ = 5/6 as the final values.

4) Level of Region Overlapping:
Overlapping regions are proved to be effective for VLR in [14] because of the local anisotropy property of vehicle logo images. When the region size is fixed, an increase in the overlap level will increase  the number of regions. On the one hand, when an image is divided into more regions, more valuable information will be retained; thus, better final recognition results can be achieved. On the other hand, a high overlap level will increase the similarity between regions; thus, more redundant information will also be retained, which will adversely affect the recognition results. Therefore, we chose the most suitable overlap level based on the experimental results. From Fig. 6, we can see that when the overlap level is L, the best performance is achieved.
5) The Choice of Classifier: A suitable classifier is important for achieving good results. In this section, we present experiments conducted using common classifiers such as K-nearest neighbors (KNN), SVM, CRC [34], sparse representation classifier (SRC) [35], and large margin nearest neighbor (LMNN) [36] under different numbers of training images. The number of test images is 150 images per class. The experimental results are shown in Fig. 7.
From Fig. 7(a), we find that the CRC classifier yields better results than the other classifiers. Moreover, as shown in Fig. 7(b), the CRC classifier is much faster than others. Therefore, when we use a small number of training samples, the CRC classifier achieves the best performance; thus, we selected the CRC classifier for our method.  Table II. Table II lists the recognition rates of our method and other state-of-the-art methods with this small number of training samples. From the results in Table II, we can observe that our network with 3 layers achieves the best performance. Even when we set the number of layers in our network to 1, it still achieves better performance than the other state-of-the-art handcrafted descriptors and non-deep-learningbased methods with just 5 training images. This finding demonstrates that MLPNL can use small training samples to achieve satisfactory performance similar to that of most handcrafted descriptors and non-deep-learning-based methods.
Then, we compare the performance of our method with that of deep-learning-based methods based on the HFUT-VL dataset. The CNN models are all referenced from the GluonCV  Table III; when the training number is 100, the recognition rate of our method is higher than that of most deep-learning-based methods but slightly lower than those of VGG16_BN, DenseNet12 and SeNet154. However, in both cases, the training time and testing time of our method are at least 7 and 4 times faster than other deep-learning-based methods. Thus, our method is more suitable in situations where time is the primary consideration. Moreover, our method has less requirements for the number of training samples than deep-learning-based methods and can still have good performance when fewer training samples exist.

2) Results on the XMU Dataset:
The XMU dataset is an open and popular vehicle logo dataset for VLR. The XMU dataset has recently been used to evaluate the performance of numerous VLR methods. Therefore, we also carry out experiments on the XMU dataset to compare our method with other VLR methods. Like other VLR methods, we also use 1000 training samples per class and 150 testing samples per class. The experimental results are shown in Table IV. The recognition rates of other VLR methods are directly referenced from the published papers. In Table IV, the first 7 methods are the state-of-the-art vehicle logo recognition methods, and the next 8 methods are deep-learning-based methods. The deeplearning-based methods are referenced from the Gluon library of Mxnet without pre-training and are trained to achieve the best performance on the XMU dataset. To further prove the advantage of our method, which can learn faster and still maintain good performance, we also use the XMU dataset to train the MLPNL model.
From Table IV, we observe that the recognition rate of our method is 99.98%, better than that of other VLR methods, and with the same perfect performance as the other state-of-the-art CNN models. When the recognition rate of MLPNL reaches 99.98%, the training time is only 217 s with CUDA acceleration, which is faster than other CNN models. MobileNet is a fast CNN model for classification; however, this model still requires 408 s to reach a recognition rate  III   THE EXPERIMENTAL RESULTS OF OUR METHOD AND SOME STATE-OF-THE-ART DEEP-LEARNING-BASED METHODS ON THE HFUT-VL DATASET  (NUMBER OF TRAINING IMAGES = 50 AND 100, DENOTED AS NT50, NT100 SEPARATELY) Table V, which greatly saves the computing and storage costs, and improves the generalization ability of the network. For the HFUT-VL dataset and XMU dataset, because of the difference of object classes (80 for HFUT-VL, and 10 for XMU), we adjust the output layer, which leads to a slight difference in parameter sizes. The size of parameters differs greatly for CapsNet because CapsNet only has three layers; thus, the parameter change of the last layer considerably influences the whole parameter size. Our MLPNL method uses the CRC classifier, and the parameters of CRC are related to the number of object classes in the training set; thus, for the XMU dataset, the parameter size is larger than that for the HFUT-VL dataset.

V. CONCLUSION
In this paper, a multi-layer pyramid network based on a learning method is proposed for VLR. In this network, multi-pair feature parameters are learned by optimizing the objective function, which are used to obtain features of different resolutions. Then, through binarization and clustering, codebooks are learned for each layer. Next, a multi-codebookbased classification method is used for the final recognition process. The experimental results demonstrate that our method exhibits better performance than most state-of-the-art handcrafted descriptors, non-deep-learning-based learning methods and the existing VLR methods and is faster than most CNN models with large training samples while maintaining almost the same recognition rate.
The disadvantage of MLPNL is that during the training period, file reading requires too much time, which influences the performance of the whole algorithm. MLPNL can be further improved after solving this problem.