Are We Ready for Accurate and Unbiased Fine-Grained Vehicle Classification in Realistic Environments?

Fine-grained vehicle classification from images, also known as Vehicle Make and Model Recognition (VMMR), has become an important research topic in the last years, with a growing number of scientific contributions in multiple application areas, such as autonomous vehicles, surveillance systems, traffic monitoring and management, among others. Recent techniques based on deep learning have proven to be very effective in addressing this problem. So effective that, based on the state-of-the-art results (above 95% accuracy), it would seem that the problem is practically solved. However, our main hypothesis is that the existing datasets to date have limited variability, which precludes good and unbiased generalisation of the models trained with them. In particular, it is observed that the test datasets are very similar in nature to those used for training and validation which makes these benchmarks prone to dataset bias and to overfitting. When these systems are tested with more challenging data or data from different datasets performance degrades considerably. In this paper, on the one hand, we evaluate state-of-the-art deep learning models to perform fine-grained vehicle classification and explore multiple training techniques, such as curriculum learning or weighted losses, to mitigate the bias between different makes and models and to assess the limits of current approaches. On the other hand, we analyse the existing datasets, present an additional dataset from a challenging scenario, and merge all the data into a cross-dataset that includes common samples and classes from the existing datasets. In this way, we can evaluate geographical, make and model biases, and performance and generalisation capabilities from a more realistic perspective. The obtained results suggest that we are still far from accurate and unbiased vehicle make and model recognition in realistic traffic and driving scenarios.


I. INTRODUCTION
Fine-grained vehicle classification consists in the classification of vehicles according to make and model and even differentiating between different versions of a particular model (ultra-fine-grained classification). This task is especially The associate editor coordinating the review of this manuscript and approving it for publication was Shaohua Wan. useful when used in combination with other applications such as license plate recognition systems to detect if a vehicle is driving with a fake number plate, or in a public car park to detect an attempted theft. In conjunction with keypoint detection methods [1], it is also possible to project 3D structures of a known model obtaining distance, size and perspective information in 2D images. Regarding number plate recognition systems, this information can be used to obtain vehicle data FIGURE 1. Fine-grained vehicle classification from images has achieved above 95% accuracy on validation. Those same models experience a drop in performance when faced with different datasets making them not suitable for real world applications. and solve the vehicle classification task, but this approach is vulnerable to recognition errors, license plate swap, and license plate information is not always available. For this reason, a robust system that is able to classify make and model efficiently could be extremely useful.
There are three main problems when working with finegrained classification. First, multiplicity, i.e., the same model has different shapes and/or appearance depending on the year of manufacture (different versions of the model). Second, ambiguity, i.e., two models from different or the same manufacturers have similar appearance. Third, bias, i.e., distribution of makes and models is not representative of the actual study population. These issues make the problem of fine-grained vehicle classification a major challenge in which a correctly constructed dataset is of vital importance.
There are a considerable number of existing datasets to deal with the task of fine-grained vehicle classification, which can be divided into two categories. First, specific datasets, created to solve a particular task or limited to a given scenario, such as surveillance [2]- [4]. They tend to be smaller, offering little flexibility and little generalisation potential. Second, general purpose datasets, that aim to advance the state-of-the-art of classification and are intended to be multipurpose, as for example [5]- [8]. The difficulty of constructing a general dataset that accurately represents the reality usually makes them biased, with poor variety of viewpoints, lighting or scenarios, making them less suitable for real world applications.
Most work focuses on solving a specific problem or obtaining raw results, either on previous datasets or on a new general dataset. This leads to the current situation of performance saturation, with datasets such as CompCars [6] saturating at around 98% performance in validation, which suggest that the problem of fine-grained classification is mostly solved. However, when we use these models in more challenging scenarios or analyse performance by each individual class, the results are not entirely satisfactory.
Dataset bias is not usually taken into account, yet it clearly materialises as a class imbalance problem. In datasets with hundreds of classes, one can report 95% average accuracy, even if multiple classes report very low performance. This is because the number of samples for these classes is so low that it barely affects the overall results. We have empirically observed this behaviour when trying to use one of our models, that achieves state-of-the-art results in CompCars [9], in a more realistic scenario (see Fig 1).
In line with the above statements, the aim of this paper is not to present a new method and compare it with previous approaches. The difference in performance between current methods in current datasets is practically negligible. Instead, we focus on the empirical assessment of current limitations and problems, proposing several solutions to address them. In particular, the main contributions of our work can be summarised as follows: • We study the applicability of curriculum learning techniques to fine-grained vehicle classification problem and evaluate its performance.
• We analyse the effect of bias on the per-class performance of fine-grained models and explore techniques, such as weighted loss, to mitigate its effects and improve performance and generalisation capabilities.
• We propose a test set built from the PREVENTION dataset [10] to externally evaluate performance and generalisation capabilities in realistic scenarios both for makers and models.
• We present a cross-dataset to mitigate biases, and assess the complexity and generalisation capabilities of existing datasets. This cross-dataset is publicly available. 1 The remainder of the paper is organised as follows. Section II briefly summarises the state-of-the-art and the most relevant datasets. The methodology and data augmentation techniques are presented in Section III. An extensive experimental evaluation is provided in Section IV. Conclusions and future work are finally discussed in Section V.

II. RELATED WORK A. EXISTING DATASETS
Despite the existence of a significant amount of vehicle make and model classification research, most of the existing datasets are small or medium in size, with only a few of large datasets being publicly accessible. This has led researchers to work with their own datasets which, as we have said, are small in size given the high cost of acquiring and labelling thousands of images. Because of this, it is extremely difficult to compare the different approaches as each use a different dataset.
Cars-196 dataset is the first large-scale fine-grained vehicle classification dataset. It contains 16,185 images from multiple viewpoints of 196 classes of cars with labels for make, model and year. Being the first one with a relevant size, many later works make use of it, but even so, it still has a limited number of images, these images have professional quality, making them far from real world application and a large number of vehicles are from the same year (2012), which implies poor diversity.
CompCars dataset is probably one of the most relevant vehicle datasets, proposing three different tasks: fine-grained classification, attribute prediction and car verification. It contains data from 2 scenarios, one of web-nature and other of surveillance-nature. The web-nature data was collected from internet forums, websites and search engines with a total of 136,727 images from 163 makers and 1,716 models with different viewpoints. The surveillance-nature data was collected from road surveillance cameras with a total of 44,481 images, all of them from the frontal view. Focusing on the fine-grained classification task, they propose the use of a web-nature subset composed of 52,083 images of 431 different car models. Overall, the CompCars dataset is of reasonable quality. However, there is a considerable geographical bias since most of the vehicle makes and models are specific to the China region, which can be a problem for applications in other regions. The images have professional quality, with even some renders, making them far from a real world situation. The multiplicity problem has been ignored, grouping all versions from a model in one class, even though they have different appearances.
BoxCars is a vehicle dataset focused on surveillance applications. The images were taken from surveillance cameras and, for each vehicle correctly detected, there are 3 images from different viewpoints. The dataset contains 21,250 different vehicles with a total of 63,750 images, 27 makes and 126 models. They also provide make, model, submodel and model year classes annotations and 3D bounding box information. This is a robust dataset, with real-world quality and diversity of views, but low image size and quality. It also suffers from geographical bias as all the images were recorded in the city of Brno in the Czech Republic.
The VMMR-db is probably the most ambitious vehicle classification dataset ever created. It contains a total of 291,752 images of 9,170 different classes. These images were taken by different users and cameras, ensuring a great variety of views, lighting and quality making it realistic. To build the dataset the images were gathered from online vehicle selling web pages and automatically annotated using the title and description provided by the sellers. They provide a subset of 51 classes overlapping with CompCars and a subset of 3,036 classes containing all of those with more than 20 images. Unfortunately, although the dataset does not suffer from the multiplicity problem, having used automatic annotation, for the same model of vehicle there are several classes for different years, even if in those years the specific model was the same. In addition, although with less impact, it also suffers from a certain geographic bias.
Frontal-103 dataset is, to our knowledge, the most recently published vehicle dataset. It is comprised of a total of 65,433 frontal view web-nature images from 103 makers and 1,759 models tackling the multiplicity problem with a different class for each version of a model. Although Frontal-103 is promising, some shortcomings can be found. As in CompCars, the images have professional quality with a non despicable amount of renders. Many of the images are very similar (almost repeated in some cases). A significant number of vehicles have been found to be mislabelled. No training/validation/test split is provided. This is important, as many of the images are very similar or repeated, so many of the images seen in training can also appear in validation and testing. In spite of all this, they face the problem of multiplicity in a competent and effective manner. Finally, as it happens in all other cases, there is a considerable geographical bias. Most of the car manufacturers are from China, which makes a model trained on this data not applicable in other regions in the absence of such vehicles.
To perform an independent evaluation of the different models and assess their generalisation capabilities, we have created a test set based on the PREVENTION dataset [10], which is designed for vehicle intention prediction and contains images from real driving scenarios. The viewpoint and nature of the images is very different from those found in most fine-grained datasets, which is suitable for assessing generalisation capabilities. The PREVENTION dataset has a total of 356 minutes of records for a distance of 540km. Images were obtained from two cameras (front and rear view). We manually selected a total of 2,685 vehicles, 1,452 are front facing images and 1,233 from the back. A total of 33 different makers have been labelled. From these 2,685 vehicles, a total of 1,113 have been labelled at model level, 618 front facing and 515 from the back. A total of 87 different models have been obtained. The reasons why not all vehicles have a model label are the impossibility of obtaining the model reliably or the lack of consensus among annotators.
A summary of all the datasets can be found in Table 1.

B. FINE-GRAINED VEHICLE CLASSIFICATION
Fine-grained vehicle classification is a widely explored task. Before CNNs became the standard, classification tasks laid on hand-crafted features. Some of the most remarkable works of the pre-CNN era focused on the inherent characteristics of the vehicles by modelling their geometry and appearance. This approach was used in [12]- [14]. Santos and Correia [14] proposed an automatic car recognition system composed of two recognition methods, both relying on the external features of the car. One makes use of the rear view shape, dimension and edges, while the other makes use of features of the back lights. Llorca et al. [13] also used rear view images. They applied a license plate recognition module and  a previously developed vehicle make recognition system [15] based on the logo to predict the car make and, after that, learn the geometry and appearance of rear car emblems to predict the model. In [12], Gu and Lee proposed a method to deal with severe pose variation. They presented a mirror morphing scheme exploiting the symmetry of cars to normalise any orientation image into a typical view.
Looking deeper into the existing works of the CNNs era, different approaches have been taken to tackle the fine-grained vehicle classification problem, such as focusing on location, appearance and/or parts, working in 3D space or using different networks, modules or training techniques, among others.
In the first group we have those that focus on location, appearance and/or parts [16]- [24]. In [16], Lin et al. proposed a novel end-to-end trained CNN architecture for fine-grained visual recognition called Bilinear CNNs. The idea is to have two networks that extract location and appearance related features and then, combine them as a pooled outer product, obtaining localised feature interactions invariant to translations. They also proved that these bilinear features are highly redundant and that can be reduced an order of magnitude keeping performance practically unaltered. They report results on Cars-196, and although they do not surpass the state-of-the-art, they are close to it. In [17], Krause et al. proposed a method that, instead of using part annotations (like in their previous work [5]), generates parts using co-segmentation and alignment in combination with R-CNN. They show that this approach achieves state-of-theart results in their dataset , outperforming methods that use part annotations during training. One interesting approach is the taken by Fang et al. [18], in which they tackle the fine-grained vehicle recognition problem by locating discriminative parts where the differences are more evident. To do so, they propose a coarse-to-fine method that makes use of CNNs to extract feature maps and locate these discriminative regions. The feature maps are then used to detect refined regions and extract their features until there are no regions left. Then, all the features (global and local) are used together on a one-versus-all SVM classifier obtaining state-of-the-art results in CompCars surveillance subset. Following the discriminative region approach, Fu et al. [19] presented a novel framework that uses a recurrent attention CNN to recursively learn discriminative region attention and region-based feature representation at multiple scales obtaining similar results to [17] in Cars-196, but without human defined bounding boxes. In [20], Zhao et al. proposed a Diversified Visual Attention Network (DVAN) that is able to gather discriminative information using multiple attention canvases from which it extracts convolutional features. An LSTM recurrent unit is then used to learn the attentiveness and discrimination of these canvases. In [21], Tian et al. followed an approach similar to the one taken by Fang et al. obtaining local and global features too. They proposed an iterative discrimination CNN based on selective multi-convolutional region feature extraction. Two types of features are extracted (local and global), and then used to iteratively localise deep pivotal features and feed them to a fully-connected fusion layer. They report near state-of-the-art results in Cars-196 and in CompCars. In [22], Elkerdawy et al. proposed the use of a co-occurrence layer to discover parts in a unsupervised way, avoiding the use of parts or 3D bounding boxes annotations. VOLUME 9, 2021 They report state of-the-art results in BoxCars and competent results in CompCars. In [23], Du et al. proposed a novel method that adds new layers in each training step exploiting information of the last step and a jigsaw puzzle generator to enhance network input by forming images that contain information from different granularity levels. They report results on several fine-grained classification datasets obtaining stateof-of-the-art results on Cars-196. Recently, Ding et al. [24] outperformed these results using enhanced feature representations and discriminative regions. To do so they presented the Attention Pyramid Convolutional Neural Network (AP-CNN), consisting of two feature and attention pathways used to learn high-level semantic features and low-level detailed features. Following this, they use a ROI-guided strategy that refines features and eliminates background noise.
Among those that work in 3D space we have [3], [5], [25], [26]. One of the limitations of 2D recognition models is that their ability to generalise across different viewpoints is limited. In [5], Krause et al. upgrade two 2D methods to 3D, outperforming its 2D counterparts. To do so they first estimate the 3D geometry of the object and then represent the appearance of local features and their locations in 3D space. In [25], Ramnath et al. proposed a method to recognise make and model from an arbitrary view. They first create a 3D hull from the image an then project 3D space curves and refine them using three-view curve matching. These 3D curves are then matched to 2D image curves using an alignment technique. Lin et al. [26] proposed to optimise 3D model fitting and fine-grained classification jointly. First, they use Deformable Part Models (DPM) to extract initial part locations. Second, they use regression techniques to estimate landmark locations. Then, they fit the 3D model landmarks of a deformable model to the predicted 2D landmarks. With this information they extract part-based features and use them on a SVM classifier. Finally they use the prediction to refine the landmark fitting. In [3], Sochor et al. proposed an enhanced input to a CNN. Instead of using the plain image, they obtain a 3D bounding box used to ''unpack'' the vehicle image, the shape and orientation, boosting performance both for classification and recognition. The main problems with 3D methods are their high complexity and the need for much denser labeling. If 3D information is not relevant, 2D methods are more efficient and provide, in general, better results.
Finally, there is a plethora of existing work that makes use of different networks, modules or training techniques [9], [27]- [32]. In [27], Anderson et al. used a modular approach combining pretrained networks with new untrained ones. In this way, they get new modules to learn complementary features to those of the pretrained ones. They used Cars-196 to prove their approach. Instead of a new network or training technique, Hu et al. [28] proposed the use of a Spatially Weighted Pooling (SWP) layer to improve the robustness and effectiveness of CNNs feature representations. This novel pooling layer contains a predefined number of spatially weighted masks that are learnt to pool the extracted features in a discriminative way. They obtain state-of-the-art results in both Cars-196 and CompCars. Other approaches focus on the loss function instead of the CNN structure, as in [31], where Li et al. proposed a new regularisation term to cross-entropy loss. The resulting loss function, Dual Cross-Entropy Loss, can help alleviate the vanishing gradient problem and demonstrates good performance with small datasets. They use Cars-196 to prove their approach and obtain state-of-the-art results. In [9], Corrales et al. presented an end-to-end training methodology for fine-grained vehicle classification. By applying diverse techniques like data augmentation, learning rate policies and fine-tuning strategies they achieved state-of-the-art results in CompCars. In [32], Buzzelli et al. revisited CompCars, defining a new more challenging and realistic train/test split and propagated the existing type-level annotations to the whole dataset. They also designed and implemented three different methods: one that directly predicts make-model-year, a two-step approach that first predicts vehicle type and then make-model-year and a multilabel approach that predicts both type and make-modelyear. They show interesting results, with a new baseline that goes down from ∼90% to 61% accuracy and achieving 70% accuracy with the two-step method.
As we have seen, there are multiple datasets and approaches to address fine-grained vehicle classification. However, there is a clear tendency to increase the complexity of the models to improve the overall results, neglecting other key aspects such as class imbalance and generalisation capability.
A summary of the CNNs era fine-grained vehicle classification approaches can be found in Table 2.

C. IMBALANCED CLASSES
One of the key problems when working with large classification datasets with a large number of classes is class imbalance. In our experience, we have empirically found that models that perform well on average can have poor generalisation capabilities, reporting very poor results for underrepresented classes. Typically, there are two re-balancing approaches to address this problem, one is re-sampling the data (over-sampling under-represented classes or undersampling over-represented ones) and the other is to use weights to balance the training. In the case of re-sampling, over-sampling seeks to artificially increase the number of samples of under-represented classes (dataset bias problem). The initial way to solve it was to add repeated samples, at the cost of increasing the risk of overfitting. To prevent it new samples can be either interpolated from existing samples [33], [34] or synthesised [35]- [37]. But, although these new samples prevent overfitting, they could also be noisy, negatively conditioning model performance. The other resampling technique, under-sampling, has the risk of leaving behind relevant data, which still seems preferable to oversampling [38]- [40].
Regarding weight-based methods, a common approach is to use the inverse frequency of each class [41]- [43]. Other approach is to focus on the difficulty measured by the loss of each class [44], use cost-sensitive weighting [45], [46] or use a meta-learning algorithm that learns to assign weights based on the gradients like the one used by Ren et al. [47].
A technique that has recently gained special attention is Focal Loss [44]. They proposed a modification of standard cross entropy loss by adding a new term that reduces the relative loss of well-classified data and focuses on the harder misclassified ones.
Recently, Cui et al. [48] presented a novel framework to measure data overlapping and compute the effective number of samples for each class. After that, they use a re-weighting scheme to apply the effective number of samples previously computed and re-balance the loss obtaining significant increases in performance on long tailed datasets.

III. METHODOLOGY
In order to tackle the fine-grained vehicle classification problem and evaluate generalisation capabilities, multiple experiments will be carried out. For this purpose, a variety of strategies and methods have been adopted. This section describes the different architectures used, data augmentation techniques, learning rate policies, curriculum learning methods and different loss weighting strategies.

A. ARCHITECTURES
Many years have passed since AlexNet [49]. During this time, CNNs have evolved and today there are countless different models. From VGG [50], the direct evolution of AlexNet, through Inception [51], [52], ResNet [53] or ResNext [54], to Google's EfficientNets [55]. In [56], Bianco et al. presented an in depth analysis of the main Deep Neural Networks (DNNs) used for image recognition reporting multiple performance indices. In this paper, we propose to use the ResNet50 and InceptionV3 models due to two main reasons.
First, these two models have a good balance between performance and complexity ratio, with a very efficient use of their parameters [56]. Second, these two models are perfectly capable of addressing the fine-grained vehicle classification problem allowing us not only to obtain a good overall performance, but also enabling the study of the impact of different learning techniques on per-class performance and generalisation, as well as to analyse the quality of the datasets.

B. DATA AUGMENTATION
It is widely accepted by the community that data augmentation is essential to improve model performance and prevent overfitting [57]. In our previous paper [9] we empirically proved the benefits of using data augmentation and tested various techniques: • Horizontal Flip: an horizontal flip (over y axis) with a probability of 50% is performed over the image.
• Salt and Pepper: each pixel of the image is set to 0 or 255 with a probability of 2%.
• Blurring: gaussian blur operation is performed over the image with a random kernel size between 3 and 11 and standard deviation of 6.
• Color Jittering: the image is converted to HSV color space and saturation and value are independently randomly modified.
Our data augmentation strategy is applied in each epoch to the training data and performed as follows. First, we randomly apply the flipping operation to each image. Second, we randomly select one of the other six data VOLUME 9, 2021 augmentation operations and apply it to the resulting image. Finally, we apply ImageNet [58] normalisation.

C. LEARNING RATE POLICIES
As with data augmentation, multiple learning rate policies are extensively used by the community. After several experimental validation we selected the following: to keep the learning rate constant (constant lr), and to reduce it by an order of magnitude every n epochs in a stepped pattern (step-n). The initial learning rates that we use are 0.01 and 0.001 along with Stochastic Gradient Descent (SGD), with 0.9 momentum and 0.0001 weight-decay.

D. CURRICULUM LEARNING
The fact that learning processes can be much more efficient when information is presented in an organised way, progressively expanding the different concepts and difficulty, rather than presented randomly, is an intuitive and reasonable approach that has not yet been sufficiently applied to the domain of deep learning. This is particularly interesting for the fine-grained vehicle classification problem due to the hierarchical structure of the data (makes → models). This idea was first proposed in 1993 by Elman et al. [59] and subsequently explored in 2009 by Bengio et al. [60], showing solid improvements in performance for multiple tasks. In this paper, we conduct a series of experiments to assess the feasibility and impact on per-class performance of two different curriculum learning techniques. The first consists in training an easier, more general problem and then retraining for the desired task. In our case, it seems reasonable to first train the network to classify vehicle makers (general task) and, after that, refine the network to classify models (desired task). In our experiments we refer to this approach as incrementallearning. The second technique is to start training an easier problem and, at each epoch, gradually increase the difficulty. For example, in a multi-class classification problem one starts with the easier classes and gradually adds the most difficult ones. We start with the fully connected layer initialised for all classes and show to the model only a subset of the dataset (the classes with the best performance) to gradually add the rest of the classes. We apply two slightly different versions of this technique by adding 5 and 10 new classes every epoch respectively until all the classes are in use. After this, we continue the training for a few more epochs to ensure that the last classes added to the model are trained for more than 1 epoch. We refer to these techniques as progressive5 and progressive10 in our experiments.

E. WEIGHTED LOSS FOR CLASS IMBALANCE
The class imbalance problem occurs when one or more of the classes present in the dataset have a weight (number of samples) several orders of magnitude below the rest of the classes. This often means that, in the training process, these classes are irrelevant during back-propagation, so that, although the overall performance of the model is apparently good, these particular classes perform well below average. When these classes appear in real world conditions, we have a bias problem in the dataset. This effect can be mitigated by using loss weights to favour under-represented classes or penalise over-represented ones.
We evaluate up to three different sets of weights. The first one, which we refer as standard, is defined in Eq. 1: where i represents the specific class. This way, all weights are less than 1. However, when no weights are used (all equal to 1), the sum of the weights is the number of classes. We can therefore maintain the proportions by normalising the weights so that adding them together gives the number of classes. This is how the second set of weights is defined, as a modification of the standard technique in which weights are normalised so that they add up to the number of classes.
We refer to this set as standard normalised. For the third and last of the sets we modify the weights with a non-linear function, as defined in Eq. 2. First, we calculate the percentage of representation in the dataset of each class and, after that, we use the non-linear function −log(x). Additionally, we use the number of classes normalised version as in the second set.
We refer to this set as log.
where i represents the class. We also use focal loss and evaluate various values for α and γ . The definition of focal loss is given by Eq. 3: where p t is given by the following equation: where y = 1 means that the class has been correctly classified, and p is the predicted class probability.

A. CURRICULUM LEARNING
First, we aim to study the applicability of curriculum learning techniques to the fine-grained vehicle classification problem and evaluate its performance. All the experiments have been made using the fine-grained classification subset of Comp-Cars, with 431 classes and 52,083 images. We have chosen CompCars, a widely used and well known dataset, because of its large number of classes and images.

1) INCREMENTAL LEARNING
In Table 3 we compare the performance of a standard trained ResNet50 and InceptionV3 with its counterparts trained using incremental-learn (first we train an easier problem -makersand then retrain it for models). All these models have been trained for 50 epochs using a learning rate of 0.001 and constant policy. We can observe consistent performance obtaining a slight improvement with the incremental-learn technique. This indicates that the incremental-learn technique is working and, although the training time practically doubles, it can be useful to enhance generalisation. Given that the results obtained for InceptionV3 and ResNet50 are very similar, for the remaining of this section we will only show the results for ResNet50.
These tests alone do not allow us to properly interpret the results. As we have said, we first trained makers and after that models. How has incremental-learn method affected the performance? A comparison of ResNet50 models per-class performance can be seen in Fig. 3. In order to visualise the data more clearly, we decided to subtract the original per-class performance, thus obtaining results centred on zero (same performance), above zero (standard model performs better) and below zero (incremental-learn performs better). We also applied a color coding with a threshold of 2.5% difference in performance to divide the classes in three groups. The group below −2.5% in green (incremental-learn). The group above 2.5% (standard model). And the group in between (similar performance with both models). We can see that most of the values are at 0 or very close with some outlayers going to differences of more than 10%. Analysing the data, both trainings are balanced having practically the same number of improvements and losses in performance so we wondered if these variations could be caused by the number of samples in the classes.
In Fig. 4 we can see the per-class difference in performance between the two models depending on the number of samples in each class. This gives us valuable information. There is a clear tendency to obtain similar results the more samples a class has with the greatest differences concentrated in some of the classes with the least number of samples. We can see a homogeneous distribution between improving and worsening performance so we can say that the variations are related with the number of samples, but, we think that the main reason for this behaviour is the fact that the classes with fewer samples are more exposed to the random variations of each training.

2) PROGRESSIVE LEARNING
Continuing with the curriculum learning experiments, in Table 4 we compare the performance of two standard trained ResNet50 (one with constant 0.001 learning rate and the other with step-10 policy and 0.01 as initial learning  rate) with a set of ResNet50 models trained using both progressive variants. All these models have been trained for 50/80 epochs (progressive-10 or progressive-5) using learning rates of 0.01/0.001, constant policy and none of theme use the 2-step fine-tuning technique.
If we take a look at the results we can see several things. First, we have consistent results with better performance for all the progressive-10 models when compared with the progressive-5 ones. When comparing the different runs of ResNet50 we can see that the 0.01 learning rate seems to work better. The progressive ResNet50 models match the VOLUME 9, 2021 performance of the standard one with a slight improvement for the 0.001 learning rate progressive-10 model. Talking about times, the progressive-10 ResNet50 models take less time than the standard ones achieving similar results. With these results, the progressive technique looks like a good option, as it obtains equivalent performance in less time and could be an useful resource to add new classes to an already trained model instead of training it again.
As we have been adding classes from best to worst performance is interesting to analyse the per-class performance. In Fig. 5 we can see a comparison of per-class performance and per-class performance depending on the number of training samples of the ResNet50 progressive10 trainings and their standard counterparts. Once again, we have applied a threshold of 2.5% difference in performance to divide the classes into three groups. On top, we have the comparison of the models with 0.001 learning rate and on the bottom those with 0.01 learning rate. On the left side we have the per-class performance and on the right side per-class performance by number of training samples. Focusing on per-class performance we can see that the 0.01 learning rate models are more compact (fewer differences), which is consistent with the results (the 0.01 lr models obtain practically the same results while the 0.001 have a greater gap). Focusing on the per-class performance by number of samples we have the same behaviour and the expected pyramidal pattern, with fewer differences the more samples a class has.
Seeing these results, with virtually identical performances using the progressive methodology and the standard, we wondered whether gradually increasing the difficulty of the classes is really helping or not. To test this we have trained 2 additional ResNet50 models using 0.01 learning rate, progressive-10 method, one with random class order and the other with inverse (decreasing difficult) class order.
In Table 5 we can compare the progressive models trained with alternative class order with the standard and the progressive with increasing difficult ones. As can be seen, all the performances are practically the same which refutes the theory that progressively increasing the difficulty improves performance. These results are somewhat counter-intuitive, as the data structure of the classification problem suggested a potential for improvement. Even so, although no significant performance gain is obtained, it has been shown that the models can be trained progressively, allowing new classes to be Differences below −0.025 (green circles) mean better performance for the progressive method. Differences above 0.025 (red squares) mean better performance fort the standard model. Values in between (yellow triangles) mean similar performance in both models. added to already trained networks, and achieving equivalent performance with less computational resources, i.e. less time and energy spent on training processes.

B. FINE-GRAINED MODELS
In this section, we analyse the performance of fine-grained classification models comparing them with the baseline results reported by their creators. It could be interesting to compare them with other state-of-the-art methods, but since our aim is not to obtain the best model, we have considered that it does not provide relevant information. We will focus on the results obtained with CompCars, VMMR-db and Frontal-103 and their subsets.
For CompCars we have evaluated 2 subsets. One of makers and other of models, with 73 and 431 classes. For VMMR-db we have evaluated 3 subsets. One of makers and other of models built with the data provided and other called 3,040 built in the same way as the authors built its 3,036 (all the classes with more than 20 images). The number of classes is 43, 472 and 3,040 respectively. For Frontal-103 we have   For clarity, the Table 6 shows the different subsets with the number of classes and the total number of images.
All experiments were performed with a 70/30 train/val split. We trained both ResNet50 and InceptionV3 models with step-10 policy and 0.01 learning rate for 50 epochs. As InceptionV3 was the best performing option we will only report its results. Table 7 shows the results of the InceptionV3 models for each of the subsets and compares them with the ones reported by their creators. As expected, the best results are obtained in the simplest task, classifying makers, followed by fine-grained models and finally ultra fine-grained models. Analysing the makers results we can see that the best performance is achieved with Frontal-103 as is the easiest one having only images from the front of the vehicles, followed by CompCars and finally VMMR-db as its the most complicated and extensive of the datasets. Focusing on the performance of fine-grained classification, it can be seen that this time the best performance is achieved by CompCars, as it has the least amount of classes, followed by Frontal-103, which, although it has more classes than VMMR-db, is, as we have said, easier having a single view-point. Finally, in the case of ultra fine-grained classification, we can see a big difference between the results obtained by VMMR-db and Frontal-103. While Frontal-103 still achieves a good performance with 95.62% of top1 accuracy, VMMR-db drops to 42.16%. As we have previously said, one of the key problems of VMMR-db dataset is that the labelling contains a class for each year of the same model. Therefore, the actual number of classes is much lower. If we take a look to the top5 accuracy we can see an important leap to 91.58%. In [7] the authors explain this drop in performance by the increased difficulty of going deeper in the hierarchy. However, this statement does not sufficiently hold. As we have seen with Frontal-103, although the ultra classification does indeed have a higher level of difficulty it still has a good performance. This shows that the year-based labelling for models in VMMR-db is not the most appropriate.

C. WEIGHTED LOSSES
As we have said in the introduction, most articles focus on reporting global results, trying to improve accuracy, without analyzing per-class performance. It is of little use to have spectacular accuracy if a non-negligible number of classes have been somewhat ignored. In this section, we analyse the per-class performance of maker and model classification and explore techniques such as weighted loss to improve its performance and generalisation capabilities. We are going to use VMMR-db Makers and VMMR-db Models for the training and the PREVENTION dataset to externally evaluate performance and generalisation capabilities in realistic scenarios. Fig. 6 shows an histogram of the per-class precision of VMMR-db Maker subset. We can see that even though the top1 accuracy is 97.34% we still have one class performing below 10%. If we look at the results of VMMR-db Models in Fig. 7, we can see that this problem is considerably greater and, even though the top1 accuracy is 94.46%, there is a considerable number of classes with poor performance.

1) WHY RAW PRECISION IS NOT ENOUGH?
To address the performance problem in particular classes, we first checked the relationship of per-class performance to the number of samples of each class and found that the classes with this problem are among those with the fewest samples. Having verified that there is indeed a problem with under-represented classes, we have employed various weighted losses techniques and focal loss to try to mitigate this problem. To evaluate generalisation capability of the different solutions, in addition to the training performance in VMMR-db Makers and Models, we will use the two test sets (Makers and Models) created from the PREVENTION dataset with rear and front view images in real traffic situations. Of the 33 makers present in the Makers test set, 25 are present in VMMR-db with a total of 1,523 images. And from the 87 models present in the Models test set, 50 are present in VMMR-db with a total of 780 images.
As defined in section III-E, we are going to test three different weighting schemes and the focal loss.   Table 8 shows the accuracy for the non-weighted, the weighted and the focal loss models trained with VMMR-db Makers and tested on the PREVENTION Makers dataset. Looking at these results, we can see that, in terms of accuracy in the training phase, the best performance is obtained by the model without weights, but closely followed by the other approaches, with the standard normalised weights (standard 43) being the best performing of the weighted models. If we look at the test results, we can see that the normalised weights and focal loss outperform the weightless model with the standard 43 being the best again followed by focal loss.

3) WEIGHTED LOSSES FOR MODEL CLASSIFICATION
Focusing on VMMR-db Models, we can see the accuracy comparison between using and not using weights and focal loss in Table 9. Looking at these results, we have a similar behaviour as with VMMR-db Maker. The best performance in the training phase is again achieved by the weightless model and the weighted models follow closely behind. However, this time the best performing model is the one trained with focal loss, even though is the worst performing in validation. In the test results we can see that all the weighted models outperform the weightless one. It is worth noting the large drop in test performance compared to makers. This is most likely due, on the one hand, to the increased difficulty and, on the other hand, to the smaller number of samples in the Models test set, which makes it more biased.

4) PER-CLASS PERFORMANCE ANALYSIS (MAKERS)
But again, we are focusing only on raw performance. It is particularly interesting to look at per-class performance. Fig. 8 shows a comparison of per-class performance for each of the previous models trained with VMMR-db Makers. We can appreciate the effect of the weighted models, with all of them having solved the poor performing class problem of the weightless model. Apart from that, the results are pretty much the same, with standard 43 being the best of the weighted models. Fig. 9 shows a comparison of per-class test performance for each of the previous models trained with VMMR-db Makers and tested on the PREVENTION Makers test set. It can be seen that none of the models have classes below 10% precision and a fairly homogeneous performance, with standard 43 being the best one with a solid performance when compared with the rest of the models, and, even though it has one more class with performance below 20%, it also has a noticeable improvement in the range 20-60% when compared with the weightless model. For all of this, it achieves an improvement of almost 2% for front images, 3.18% for rear images and almost 2.5% on all images.

5) PER-CLASS PERFORMANCE ANALYSIS (MODELS)
Continuing with the fine-grained results, Fig. 10 shows a comparison of per-class performance for each of the previous models trained with VMMR-db Models. At first glance we can see that the results are very similar, which makes sense as the performance is almost identical. We can see that the per-class precision distribution is pretty balanced, with all models compensating better performance in one section with worse performance in another.
With these results, it may seem that the use of weights is not justified, as almost identical results have been achieved and there is no clear benefit in terms of the number of poorly performing classes. However, if we look at the test results we can see a considerable improvement, with an increase of more than 2% for front images, almost 2% for rear images and 1.8% on all images. Fig. 11 shows a comparison of per-class test performance for each of the previous models trained with VMMR-db Models and tested on the PREVENTION Models test set. We can see equivalent or better performance for all the weighted models in terms of number of classes with precision greater than 0.8. In the range of 0.2 to 0.8 the results are pretty balanced, with the weightless model having more classes above 0.7 but the focal loss one less in 0.2 to 0.4.
Regarding the poor performing classes, the number of classes below 0.1 is worrying but is practically the same regardless of the model. It is important to remember that the number of images of the Models test set is half that of the Makers test set, making the results more susceptible to variability. However, the results are promising, with a clear improvement in overall test performance, and results that point to an improvement in the generalisation ability of weighted models.
With these results the use of weights is justified, at least in part. The weighted models achieve comparable results to the weightless ones both for makers and models while improving test performance. The claim that weighted models help to reduce the amount of poor performing classes is eclipsed by the worrying amount of them when testing for models. However, results point to an improvement in the generalisation capabilities of the models, as test performance improves by 2.49% and 1.8%. As previously stated, we believe that models test results has a lot to do with the test dataset. It is necessary to conduct further experimentation and build a more extensive and adequate test dataset to properly evaluate fine-grained performance.

D. COMPLEXITY AND GENERALISATION CAPABILITIES
To consider a dataset as a quality dataset, it must capture the real world as reliably as possible, with as few deviations and bias as possible. Therefore, a good dataset will result in VOLUME 9, 2021   models with better generalisation capabilities. As previously mentioned, most datasets are either designed to solve a specific problem, so they are biased, or they are intended for general use. A general-purpose dataset, should capture the world in a reliable way, but when it comes down to it, most of them tend to be conditioned.
Thus, we have decided to build a cross-dataset composed of the common classes between CompCars, VMMR-db and Frontal-103. In this way, we will be able to evaluate the complexity and generalisation capabilities of the models trained with each dataset by performing cross tests. Additionally, we will also test the models with the test set extracted from the PREVENTION dataset.
We have built two sets, one of makers and other of models. The Fusion-Makers set has 27 different manufacturers and a total of 265,833 images, 28,960 from CompCars, 198,644 from VMMR-db and 38,229 from Frontal-103. The Fusion-Models set has 75 different vehicle models and a total of 101,335 images, 13,211 from CompCars, 72,142 from VMMR-db and 15,982 from Frontal-103. It may seem curious, or even a mistake, that the number of images in the set of models is much lower than in makers. The reason is that the makers set is much less strict, allowing models from the same manufacturer that are not present in the three source datasets to be included. In contrast, in the case of models, the requirement that a particular model has to be present in all three datasets brings the total number of classes and images down considerably. Additionally, we have had to group some source classes into a single target class, e.g. different equipment levels that were considered as different classes like BMW 320 vs BMW 325 as BMW 3 Series. As mentioned in the introduction, the correspondence between classes will be publicly available. 2 A summary of the different Fusion sets can be seen in Table 10.
To perform these experiments we used InceptionV3 architecture. First, we are going to analyse makers performance. Table 11 shows the results of the cross-tests with the different makers sets. We can see that the best performing model for all the sets is the one trained with the full Fusion dataset followed by the model trained on the tested set. This demonstrates that the joint use of the datasets brings more variety, resulting in better generalisation capabilities and mitigating the impact of dataset bias. The only model, other than the Fusion one, that is capable of obtaining reasonable results on the other datasets is the one trained with VMMR-db. It is important to notice that VMMR-db almost represents 75% of the Fusion dataset, which could partly justify the good results when testing with Fusion, but not its good performance in the rest of the subsets. When we look at the other two, both CompCars and Frontal-103 obtain poor performance when tested with Fusion. Com-pCars seems to work well when tested with Frontal-103, outperforming the VMMR-db model, which tells us that they are very similar. Probably, Frontal-103 could get good results in CompCars as well, but it is strongly conditioned by having only frontal images, hence its poor performance. If we take a   Fig. 12 shows some examples of top3 predicted classes of InceptionV3 Fusion-Makers model in the PREVENTION Makers test set. The first row shows correctly predicted front view images. The middle row shows correctly predicted rear view images. The bottom row shows misclassified images from both views. We can see that the model has practically total confidence in the correctly predicted makers (the mean confidence for the correct predictions is 97.78%). This is not the case for the misclassified ones, which have confidences much lower with the exception of the Suzuki predicted as Mitsubishi (the mean confidence for the wrong predictions is 68.05%). Table 12 shows the results of the cross-tests with the Fusion-Models dataset. Once again, the best performing model is the one trained with Fusion. We have the same differences for the rest of the tests but this time with performances much lower than when using makers. This may be due to the increase in the number of classes making the problem more complex. Taking a look to the PREVENTION Models test results, we have the same order in performance (Fusion, VMMR-db, CompCars and Frontal) with 78.47% accuracy for the Fusion model. As expected, the performances are lower than with makers, as it is a more complex problem. However, it should be noted that in this case the Fusion model has a larger performance gap compared to the other models, which supports the importance of having a good dataset that allows better generalisation. Lastly, it is worth mentioning the very poor rear performance of Frontal-103 model (4.94%). Fig. 13 shows some examples of top3 predicted classes of InceptionV3 Fusion-Models model in the PREVENTION Models test set. The first row shows correctly predicted front view images. The middle row shows correctly predicted rear view images. The bottom row shows misclassified images from both views. We can see that the model has practically total confidence in the correctly predicted models (the mean confidence for the correct predictions is 95.68%). This is not the case for the misclassified ones, which have confidences much lower (the mean confidence for the wrong predictions is 69.91%). Compared to the makers results, the average confidence has gone down for correct predictions (−2.1%) and up a for incorrect ones (+1.86%). This is perfectly normal as it is a more complex problem. In any case, the differences are minimal.
With these results it is clear that of the three datasets, the most complex and the one with the greatest generalisation capabilities is VMMR-db. CompCars is the next, with a significant step down in cross-performance, finally followed by Frontal-103, which is the simplest of all, and is strongly conditioned by having exclusively frontal images. It is also clear that the joint use of the three datasets improves the generalisation capabilities, obtaining reasonably good results in the external test performed with the PREVENTION test set both for makers and models.
The results obtained by making use of the existing datasets suggest that the fine-grained classification can be addressed. However, in cross-testing it is clear that with the exception of VMMR-db, the rest of the datasets are highly biased, and in the case of VMMR-db, although it performs better, it is also biased. It is only when performing these cross-tests and with an external test set that we realise that the problem is not completely solved, and not only for a complex problem such as fine-grained classification, but in a simpler problem such as maker classification. It is necessary to create a sufficiently large and varied dataset, with images of multiple origins, qualities and viewpoints, to be able to tackle the classification problem satisfactorily.

V. CONCLUSION AND FUTURE WORK
This paper presents an empirical evaluation of different training methods and approaches for fine-grained vehicle classification as well as an analysis and comparison of the most relevant datasets.
We have analysed the strengths and shortcomings of datasets like CompCars [6], VMMR-db [7] and Frontal-103 [8] and used them in a series of experiments.
In the first place, we have explored different curriculum learning techniques such as incremental-learn (training first an easier problem (makers) and after that retrain for a harder one (models)) or progressive-learn (start with the easiest best performing classes and gradually add the hardest worst performing) with CompCars dataset. The results show an slight improvement in overall performance for incrementallearn with similar gain/losses in per-class performances and a clear relation between per-class performance and number of samples. For progressive-learn we have a very similar behaviour, with virtually the same performance and the same per-class differences. The progressive-learn results made us consider whether the technique was working as expected, so we performed additional tests, one with decreasing difficulty and other with random order, obtaining identical results. With these results, curriculum learning techniques show a lack of improvement in performance, making it difficult to justify their use as a mechanism for improving learning. However, progressive-learn has proven useful as a tool for adding classes to already trained models without having to train from scratch again.
After this, we evaluated the results obtained with different subsets of CompCars (makers and models), VMMR-db (makers, models and 3,040) and Frontal-103 (makers, models and ultra-fine-grained). As expected, the best results are obtained in the easiest task (makers) with Frontal-103 in the first place (as it is the easiest only having frontal images), followed by CompCars and finally VMMR-db (as is the most difficult/extensive of the datasets). For the finegrained problem (models) the best performance is achieved by CompCars (less classes), followed by Frontal-103 which, although it has more classes, is easier, and finally VMMR-db. Finally, the ultra-fine-grained problem (models and generation) showed a huge difference between Frontal-103 and VMMR-db, with the first one still having a stunning performance and the second one falling bellow 45% top1 accuracy while top5 is still above 90%. This shows the poor class construction of VMMR-db and confirms that even though ultra-fine-grained classification is more challenging, it can still be tackled if the dataset is properly constructed.
Continuing with the experiments we have evaluated the impact of using weighted losses. To do so we have used various weights and focal loss showing that the best results are obtained with the normalised standard weights, with practically identical results to those obtained without weights, but with a significant improvement when testing on a new database. Our aim in this part of the article was to analyse the results beyond the raw performance. For this purpose, we have analysed per-class performance showing a clear improvement over the weightless model when working with makers. In the case of models the improvement was not so evident, with more classes over 80% accuracy and similar results in the range 20-80% but no conclusive results for the poor performing ones. While the use of weights have proven to improve generalisation capabilities, we cannot claim the same for reducing the number of poor performing classes. Further experiments and a more extensive adequate test set are needed to properly evaluate fine-grained performance.
Finally, we wanted to analyse the complexity and generalisation capabilities of the existing datasets. To evaluate these characteristics we have built a cross-dataset (Fusion) composed of the common classes between CompCars, VMMR-db and Frontal-103 and performed a series of cross tests. The results show that the best performing model is the one trained with Fusion, both in makers and models, outperforming all the other models in the cross-tests. Regarding the PREVENTION external test set, the Fusion models achieve pretty good results showing really good generalisation capabilities both for makers and models. From the three datasets, VMMR-db is the most complex, with CompCars and Frontal-103 being very similar but Frontal-103 heavily penalised for having exclusively frontal images. These results show that when using the existing datasets by their own, one can think that the fine-grained classification problem is solved. However, cross-testing shows the shortages of the existing datasets, showing a different reality. The problem does not seem to be solved, not only for a complex task like finegrained classification, but for an easier one like maker classification. It is necessary to create a sufficiently large and varied dataset, with images of multiple origins, qualities and viewpoints, to be able to tackle the classification and fine-grained classification problem satisfactorily.
As future work, we plan to create an extensive dataset, with images of diverse nature, makes and models from different geographical regions, different resolutions, image qualities and viewpoints, with an adequate class hierarchy, enabling the development of more general and unbiased systems capable of performing fine-grained vehicle recognition in multiple, realistic environments.