A Super-Learner Ensemble of Deep Networks for Vehicle-Type Classification

Automatic vehicle-type classification plays an imperative role in the development of efficient Intelligent Transportation Systems (ITS). In this paper, a super-learner ensemble is proposed for the vehicle-type classification problem. A densely connected single-split super learner is utilized to exploit the strengths and diminish the weaknesses of the individual base learners ResNet50, Xception, and DenseNet. The super learner aims to learn fusion weights in a data-adaptive manner to obtain the optimal combination of the base learners. The proposed method is simple, robust, and enhances the discrimination capabilities among the similarly-looking classes without requiring any hand-crafted features or logical reasoning. The proposed method is evaluated using two of the most challenging publicly available traffic surveillance datasets: the MIOvision Traffic Camera Dataset (MIO-TCD) and the Beijing Institute of Technology’s (BIT) vehicle classification dataset. Three variants of the super learner ensemble: RXD-CV-CW, RXD-CV-CW-NCW and Augmented-RXD, were examined on the MIO-TCD dataset with variations in applying class weights and data augmentation during training. RXD-CV-CW-NCW and Augmented-RXD share the third place among the published state-of-the-art methods reported in the MIO-TCD classification challenge. Augmented-RXD generalizes to the classes in common between the two datasets without degrading its performance on the MIO-TCD dataset. Both variants achieved an overall accuracy of 97.94%, and a Cohen Kappa score of 96.78%. In addition, the super-learner variants that we trained on the BIT-Vehicle dataset images achieved overall accuracies of up to 97.62%.


I. INTRODUCTION
Developing Intelligent traffic surveillance systems (ITSS) has become an important research area as it provides an innovative tool to improve transportation safety, efficiency, and driver satisfaction. Automatic vehicle type classification plays an imperative role in ITSS as it has various applications, such as Electronic Toll Collection (ETC), traffic control, intelligent parking systems, and traffic flow analysis.
As opposed to using intrusive installments of radars, loop detectors, or road tubes for traffic data acquisition, recent advances in machine learning gives a significant advantage to vision-based vehicle detection and classification methods. Automatic vehicle type classification is a challenging problem particularly when the images are captured by traffic surveillance cameras. Traffic surveillance images are usually low-resolution and subject to different illumination, The associate editor coordinating the review of this manuscript and approving it for publication was Amr Tolba . occlusion, and weather conditions. In addition, vehicle types introduce a lot of inter-and intra-class similarities. Although several vehicle datasets are currently publicly-available, not all of them are suitable for training traffic surveillance methods. Some datasets are targeted at autonomous driving with images taken by on-board cameras [1]- [3]. Other datasets contain high-resolution images taken by non-surveillance cameras and are typically used for fine-grained vehicle analysis [4], [5]. The Beijing Institute of Technology's (BIT)-Vehicle Dataset [6] contains 9,850 high-quality top-frontal view images that were captured by surveillance cameras. The dataset possesses many challenges, such as various lighting conditions, background confusion, and a variety of vehicle models and colors. The CompCars Dataset [5] is another surveillance dataset which contains 44,481 images. Although Yang et al. [5] used the CompCars dataset to prove the effectiveness of deep convolutional networks in classifying many car models; the dataset contains only frontal view images taken in daylight and clear weather. Furthermore, it focuses on the fine-grained model categorization of cars, mini-vans and pickup trucks excluding large trucks, buses, motorcycles, and pedestrians. The MIOvision Traffic Camera Dataset (MIO-TCD) [7] is the largest traffic surveillance dataset available to date. The classification dataset consists of 648,959 low-resolution images, divided into 11 categories: Articulated Truck, Bicycle, Bus, Car, Motorcycle, Non-Motorized Vehicle, Pedestrian, Pickup Truck, Single-unit Truck, Work Van and Background. The images were captured at different time periods during the day and under different weather conditions. The captured images contain vehicles in diverse orientations. The classification task using the MIO-TCD is extremely challenging. This is due to the high imbalance nature of the dataset, the inter-class similarity between the categories that have similar visual characteristics, and the heavy compression artifacts in some images.
Ensembles of artificial neural networks have gained popularity in many image classification and localization applications due to their exceptional adaptive prediction performance [8], [9]. Ensembles combine several baseline models that have different architectures to improve the stability and predictive capability of the model. The performance of the individual base-learners depends mostly on the data-dimensionality, model-hypothesis and the biasvariance trade-offs of the model. Consequently, it is unfeasible to know beforehand which learner would attain the best performance given a specific prediction problem and a particular dataset. Ensembles can effectively harness the complementary strength of the different base learners as some base learners might have a weak overall prediction but can be effective at discriminating specific subclasses. Different merging strategies were reported in the literature such as majority voting, unweighted average, Bayesian voting...etc. However, these methods are vulnerable to weak learners, sensitive to over-confident learners and may lead to information loss.
The super learner is a loss-based supervised-learning ensemble framework that minimizes the cross-validation risk for combination by finding the optimal combination of a group of prediction algorithms [9]. This is achieved by optimizing the weights of the base learners on the validation set in an adaptive manner. The Super Learner could be considered as an extra 1 × 1 convolution layer over the validation set stacked on the outputs of the base learners.
The main contributions of this paper can be summarized as follows: • We present a super-learner ensemble model for vehicle-type classification in surveillance frames. The super learner consists of a fully-connected layer added to the fused outputs of three base learners: ResNet50 [10], Xception [11], and DenseNet [12].
• The different networks were trained and tested using two of the most challenging and largest publicly available traffic surveillance datasets; the MIO-TCD and the BIT-vehicle datasets.
• While our method is simple, easy to train, does not include any handcrafted features or any logic reasoning components, the experimental results demonstrate its effectiveness. In terms of the overall evaluation metrics, the ensemble performs better than each of the base learners and is on a level comparable to the state-of-the-art methods.
The rest of this paper is organized as follows: Section II provides an overview of the related work. The technical details and the framework of the proposed system are presented in Section III. Experimental results of the proposed system and comparisons to existing algorithms are reported in Section IV. Finally, concluding remarks are summarized in section V.

II. RELATED WORK A. TRADITIONAL VEHICLE-TYPE CLASSIFICATION MTHODS
Traditional vehicle-type classification models integrate different types of sensors and image-processing methods that incorporate essential hand-crafted features depending on the application context and the granularity of the required classification. Cho et al. [13] applied a Kalman filter to fuse radar and LIDAR systems for object detection and classification. They switched between two motion models for tracking pedestrians, bicyclists, and cars. Thakoor and Bhanu [14] proposed a feature called structural signature to classify vehicles into sedan, pickup truck and SUV/minivan from their rear-view videos observed on highways. They used support vector machines (SVM) for the classification. Kafai and Bhanu [15] presented another rear-view based classification method using the spatial information among the landmarks of the vehicle (e.g. taillights and license plates) and a Hybrid Dynamic Bayesian Network (HDBN) classifier with multiple time slices corresponding to multiple video frames. The main limitation of the methods described in [14] and [15] was that they could not differentiate between SUV and minivan because these two vehicle categories look similar to the rearview. Theagarajan et al. [16] were able to discriminate between SUV and minivan from the rearview. They presented a method to compute the Visual Rear Ground Clearance of a vehicle from its rear-view video and classify it into two classes namely Low Visual Rear Ground Clearance Vehicles and High Visual Rear Ground Clearance Vehicles.

B. VEHICLE-TYPE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORKS AND DEEP LEARNING
Image classification started to shift towards convolutional neural networks after Krizhevsky et al. [17] introduced unprecedented performance in the ImageNet LSVRC (ILSVRC-2010) competition [18]. Dong et al. [6] used a pre-trained Convolutional Neural Network together with multi-task learning to classify the vehicles into Bus, Microbus, Minivan, Sedan, SUV, and Truck from vehicle VOLUME 8, 2020 frontal-view images. They introduced the Sparse Laplacian Filter Learning (SLFL), an unsupervised learning method, to learn the filter bank of the convolutional layer. They used their BIT-Vehicle dataset, which includes 9850 highresolution vehicle frontal-view images. Khaled et al. [19] used the BIT dataset to study the effect of color and spatial resolutions of the vehicle images on the classification results of a variety of classification methods. Huval et al. [20] used the OverFeat [21] architecture along with a mask detector to detect vehicles and highway lanes in real-time. Wang et al. [22] used CNN together with Fisher feature encoding algorithms for vehicle type classification. The datasets used in the above-mentioned approaches did not contain enough samples that can represent real-world traffic surveillance images.

C. VEHICLE-TYPE CLASSIFICIATION USING THE MIO-TCD DATASET
As emphasized in Section I, the MIO-TCD dataset is one of the largest datasets prepared for traffic surveillance purposes. The MIO-TCD traffic surveillance challenge was introduced in conjunction with CVPR 2017. Several ensemble methods were designed to address the MIO-TCD Classification Challenge. Kim and Lim [23] implemented a bagging system by training several CNN models with several random subsets of the MIO-TCD dataset. To compensate for the imbalanced data distribution, they applied weighted voting that depends on the error rate of each class. Lee and Chung [24] proposed an ensemble method that combines local and global expert networks. The local expert networks were all GoogLeNet, and they were trained using subsets of the dataset depending on the aspect ratio and the size of the input images. The global expert networks comprised of three convolutional nets (AlexNet, GoogleNet, and ResNet18) that were trained on the entire dataset. At the test time, the local experts are selected using a gating function and the network outputs are combined using a softmax layer. Jung et al. [25] proposed an ensemble model that they called Joint Fine-tuning with DropCNN that enabled them to train several ResNets simultaneously. Theagarajan et al. [8] proposed an ensemble of three ResNet models. A weighted loss function was applied to handle the imbalanced distribution of the dataset. They also implemented patch-based logical reasoning to address the genuine dual-class misclassification problem. To address the imbalanced data challenge, Liu et al. [26] proposed a method that integrates deep neural networks with balanced sampling in two stages: data augmentation with balanced sampling and an ensemble of convolutional neural networks trained on the augmented data. Their method was able to enhance the mean precision of all categories while preserving high overall accuracy. Later, Liu et al. [27] proposed a method that applied generative adversarial nets (GANs) for data augmentation. Their proposed approach consists of three stages: training several GANs on the original dataset to generate adversarial samples for the rare classes, training an ensemble of different-architectures of CNN models on the original imbalanced dataset, and finally refining the ensemble model on the augmented dataset after filtering out the low-quality adversarial samples. This resulted in increasing the mean performance of some categories while maintaining high overall accuracy. Although deep model-based methods can achieve very promising performance, a number of challenges remain such as: distinguishing similarly-looking vehicles, unbalanced datasets, false detections and small vehicles [7].

III. PROPOSED WORK
Each model has its strengths and weaknesses. The aim of ensemble learning is to supervise the strengths and weaknesses of multiple models, leading to better classification decisions in general. Our proposed method for vehicle-type classification is a stacking ensemble of three deep neural networks inspired by the super-learner ensemble method proposed by Ju et al. in [9]. As opposed to the thoughtful weighted average ensemble that was presented in [8], instead of using pre-set fusion weights, our proposed super learner aims to learn fusion weights in a data-adaptive manner. The proposed ensemble is a cross-validation ensemble framework that acquires a non-linear fusion function that can better exploit the individual base learners' strengths and reduce their weaknesses, and hence enhance the discrimination capabilities among the similarly-looking vehicles. The proposed network architecture is shown in Figure 1.

A. BASE LEARNERS
We used three powerful deep convolutional neural network models as the base learners: ResNet50 [10], Xception [11], and DenseNet [12]. ResNet introduced a residual learning framework to ease the training of deep networks. It reformulated the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. DenseNet introduced several advantages: it avoids the vanishing gradient problem, strengthens feature propagation, improves feature reuse, and substantially reduces the number of parameters. Xception introduced a novel deep convolutional neural network architecture inspired by Inception, where Inception modules have been replaced with depth-wise separable convolutions. Compared to Inception V3 [28], Xception achieved performance gains due to using the model parameters more efficiently. These three models proved to be the best individually-performing networks on the MIO-TCD as reported in [7]. Being three different powerful networks should provide the opportunity to exploit the strengths of each network through the super learner. Each of the three models takes 224 × 224 RGB input images and has an 11output softmax layer corresponding to the 11 categories of the MIO-TCD classification dataset.

B. THE SUPER-LEARNER ENSEMBLE
The proposed super learner was designed to attain a non-linear fusion function of the outputs of the base learners in order to enhance its discrimination capabilities considering the imbalanced nature of the MIO-TCD classification dataset and its inter-class similarities. Therefore, instead of applying a linear stacking of the base learners, stacking on the logit scale, or just stacking a 1 × 1 convolution layer on the output of the base learners as explained in [9], we added a fully-connected layer with ReLU activation units between the merged output of the base learners and the output softmax layer. The fully connected layer consists of 33 ReLU activation units.

C. CROSS-VALIDATION
Wolpert introduced the idea of stacking in [29]. As an extension of stacking, van der Laan et al. introduced the super learner in [30] as a cross-validation based stacking. It combines the base learners by optimizing the v-fold crossvalidated loss to compute the optimal ensemble weight vector. V-fold cross-validation is best-suited and optimal for small datasets. It was applied to a variety of topics, such as survival analysis [31] clinical trial [32] and mortality prediction [33]. For large classification datasets, optimizing the v-fold crossvalidated loss would require a huge time. Instead, we applied the single-split super learner, in which only the set-aside validation set is used to train the super-learner ensemble in addition to its original purpose of assessing and tuning the base learners. Therefore, the weights of the super learner are calculated by minimizing the single-split cross-validated loss as suggested in [9]. Ju et al. [34] show the success of the single-split super learner on three large healthcare databases.

D. DATA AUGMENTATION
We performed some of our experiments with data augmentation. In those experiments, we used the images of the Sedan and SUV classes of the BIT-Vehicle dataset to augment the Car class of the MIO-TCD dataset. We also augmented the MIO-TCD's Bus class with the Bus images of the BIT-Vehicle dataset.

A. DATASETS
We performed our experiments on 2 large traffic surveillance datasets: The MIO-TCD dataset and the BIT-vehicle dataset. We applied our super learner variants and compared them to other methods on the MIO-TCD dataset first, and then extended the idea to the BIT vehicle dataset.
The MIO-TCD Dataset is highly imbalanced. The size of each class is shown in Table 1.
Four metrics are used for the evaluation of the MIO-TCD classification challenge. The first one is the overall accuracy Acc, which is defined as follows: where TP is the number of true positive images regardless of their category, TN is the number of True negative images, FP is the number of false positive and FN is the number of false negative images. Dominating categories such as 'Car' and 'Background' have a strong influence on the accuracy metric. The other three metrics, which are the mean recall (mRec), the mean precision (mPre), and the Cohen Kappa Score (Kappa) [35] account for this imbalance. The mean precision and mean recall are defined as follows: where The Cohen Kappa Score measures the agreement between two annotators: the first annotator is a method under evaluation and the second annotator is the ground truth. It is defined as follows: where P e is the probability of agreement when the two annotators assign random labels. It is a good measure for both multi-class and imbalanced class problems. It basically measures how much better a specific classifier is performing than a classifier that guesses randomly according to the frequency of each class. That said, there is controversy surrounding Cohen Kappa due to the difficulty in interpreting indices of agreement. Stein et al. [36] applied Bradley-Terry model, suggesting that it may serve as an extension to Kappa that can provide more detail upon strength and direction of disagreement. Pontius et al. [37] suggested that it is conceptually simpler and more informative to evaluate quantity and allocation disagreement between items. The final ranking of the MIO-TCD classification methods is calculated by taking the average of the ranks of the 4 metrics: Rank (Acc), Rank (mPre), Rank (mRe) and Rank (Kappa) .
Then, during each training epoch, a randomly cropped 224 × 224 patch from each input image is extracted and used for training.

C. EXPERIMENTAL SETUP
We performed our experiments on an ASUS ROG STRIX with Intel Core i7-6700HQ CPU, 16GB of RAM, and an NVIDIA GeForce GTX 1060 GPU with 6GB of GPU memory. Keras with Tensorflow backend was utilized in the experiments.
The training set of the MIO-TCD dataset was split into 80% data for training and 20% data for validation. In addition to validating the base learners, the validation set was used for training the super-learner ensemble. The base learners were all initialized with ImageNet pre-trained weights.
To handle the imbalanced nature of the MIO-TCD dataset, we used the class-weighted categorical cross-entropy loss function for most of the training epochs of the base learners, and for training one of the super-learner ensemble variants. We set the class weights such that the weight of each class is equal to the total number of training images divided by the number of images of that class.   There is no separate testing set for the BIT-vehicle dataset, so we randomly split the data into 60% for training, 20 % for validation, and 20% for testing. The random splits took into consideration to maintain the same proportion of the number of vehicles per category as the original dataset. Table 2 shows the size of each category in the 3 splits of the BIT-vehicle dataset.
As suggested in [7], we used the Adam [38] optimizer with a learning rate of 10 −3 . However, the learning rate was reduced in the later epochs of training the base learners.
During each epoch, the data is randomly shuffled. We adjusted the training batch size used for each of the individual models as well as the ensemble model to allow data to suit the 6GB GPU memory.
To avoid overfitting, we applied early stopping. The training is stopped if the validation loss does not improve after 5 consecutive training epochs.

D. EXPERIMENTAL RESULTS ON THE MIO-TCD DATASET
We fine-tuned the Resnet50, Densenet121, and Xception networks on the MIO-TCD dataset until they reached testing accuracies of 97.13%, 97.51%, and 97.54%, respectively. Fig. 3 shows the confusion matrices of each of the 3 base learners evaluated on the testing set.  Subsequently, we trained the super learner using the validation set for one epoch with the class weights applied. We call this method ''RXD-CV-CW''. Then, we trained the super learner for more epochs without applying the class weights. We call this method ''RXD-CV-CW-NCW''.
To get more accurate predictions on the testing set, we applied the standard 10-crop method [17] for evaluating the 3 base learners as well as the super-learner methods. Therefore, after resizing each test image such that the shorter side is 256 pixels, we extracted 10 patches which are the central crop, the four corners, and their horizontal flips and averaged the predictions made by each model. Fig. 4 shows the confusion matrices of ''RXD-CV-CW'' and ''RXD-CV-CW-NCW'' evaluated on the testing set.
Although both proposed super-learner methods were trained only on the images of the validation set for just a few epochs, they attained high accuracy on the testing set images. Table 3 -A demonstrates that with the exception of the mean precision of ''RXD-CV-CW'', both of the proposed superlearner methods achieved better evaluation-metric scores compared to the base learners. Table 4 and 5 show the Recall and Precision scores of our base learners compared to those of our super-learner methods. Although the recall and precision scores of some of the base learners for few individual classes outperform the super learner ensembles, table 3 demonstrates that the super learners achieve a significantly better scores in the four overall performance metrics. This supports the statement mentioned earlier in the introduction that ensembles can effectively harness the complementary strength of the different base learners. Though some base learners might have a weak overall prediction it can be effective at discriminating specific subclasses.
Using the class-weighted loss function in ''RXD-CV-CW'' resulted in an improvement in the recall scores of some raresample classes, such as the Bicycle, Work Van, Single Unit Truck, Motorcycle, and Non-Motorized Vehicle. However, using the class-weighted loss function resulted in a relatively low mean precision score. On the other hand, training ''RXD-CV-CW-NCW'' for few epochs with un-weighted loss function considerably increased the mean precision, overall accuracy, as well as the Cohen Kappa score. Fig. 5 presents samples of the different testing images that were correctly classified by either of the proposed 98272 VOLUME 8, 2020 super-learner methods or misclassified by both. The MIO-TCD dataset contains a lot of challenging images. Due to the blurry nature, the low resolution and compression artifacts in some images, they are hard to be classified even by humans. Although our super-learner methods were robust in accurately predicting the classes of many challenging images of the MIO-TCD dataset, they still fail in classifying some images as shown in the column of the suspected misclassified images in Fig. 5. Table 6 lists the evaluation results of the proposed super learners vs. state-of-the-art methods that participated in the MIO-TCD classification challenge. ''RXD-CV-CW'' achieved the best classification accuracy of the Bicycle (91.59%) and Work Van (91.74%) classes. ''RXD-CV-CW-NCW'' achieved the second-best overall accuracy (97.94%) and Cohen Kappa score (96.78%). ''RXD-CV-CW-NCW'' comes at the third rank after the methods of [25] and [8], which got the first-and second-best mean precision scores respectively. Our mean precision score is relatively lower than those achieved in [25] and [8].

E. SUPER-LEARNER ENSEMBLES VS. WEIGHTED-AVERAGE ENSEMBLE
We compare the performance of the super-learner ensemble with the performance of a simple weighted average ensemble of the base learners. The three base learners were combined using weighted prediction vectors. We used the same weighing approach of [8]. So, the weight vectors were the average of the precision and recall of each individual class as follows: where i refers to the base learner, n refers to the class index, Pre in = TP in /(TP in +FP in ) and Rec in = TP in /(TP in + FN in ). The weights for each network are obtained by evaluating the precision and recall scores of that network on the validation set. The final prediction is then calculated by averaging W 1 X 1 , W 2 X 2 and W 3 X 3 which are the weighted predictions of the 3 base learners. We called this network ''RXD Weighted-Average Ensemble''. Table 6 demonstrates that the ''RXD-CV-CW-NCW'' ensemble achieves better scores than the ''RXD Weighted-Average ensemble'' in all the performance metrics except for the mean precision score. As a result, ''RXD-CV-CW-NCW'' achieves a better average rank than that of the ''RXD Weighted-Average Ensemble''.

F. ENSEMBLES WITH DIFFERENT FUSION METHODS
In the proposed super leaner ensembles ''RXD-CV-CW'' and ''RXD-CV-CW-NCW'' we used a simple concatenation on the outputs of the individual base learners. In [39] and [40], T. Akilan et al. explored fusion approaches other than concatenation that can improve classification accuracy. We examined the use of the product fusion and max fusion approaches that were introduced in [39]. We call them ''RXD Multiplication Super Learner'' and ''RXD Max Super Learner''. The results presented in Table 6 reveals that the product and max fusion approaches excel in the recall or the precision scores of some of the individual classes and the ''RXD Multiplication Super Learner'' achieves the highest recorded mean recall score. VOLUME 8, 2020 However, their overall ranks are low compared to the other ensemble methods listed in Table 6. As the vehicle locations in the BIT-vehicle dataset are preannotated, we did not have to apply the 10-crop method for testing. We just cropped the vehicle object at the preannotated location, resized the cropped object image so that the shorter side is 256 pixels, and then made the prediction based on the center-cropped 224 × 224 patch. Table 7 shows the confusion matrix of the 2,014 images that were randomly selected as the test sample from the BIT-vehicle dataset and how they were classified to the MIO-TCD classes without training on the BIT-vehicle dataset. Despite the obvious differences between the two datasets, the results were reasonable. None of the 2,014 BIT-vehicle test images were misclassified as Bicycle, Motorcycle, Nonmotorized vehicle or Pedestrian classes of the MIO-TCD dataset. This is a sensible result because none of these classes exist or have equivalent classes in the BIT-vehicle dataset. Around 68% of the truck test samples were classified either as Articulated truck or Single-unit Truck (40.6% and 27.3% respectively). The remaining 30.9% were misclassified as Bus. This is an expected result due to the similarity among the 3 classes from the frontal view, knowing that some of the BIT-vehicle images show only the vehicle front (or a partial view of the vehicle front) without showing the body of the vehicle. On the other hand, 29.5% of the Bus BIT-vehicle test samples were correctly classified as Bus, while 54.5% were classified as either Articulated Truck or Single-Unit Truck. 46.3% of the Microbus class was classified as Bus, and 28.8% were classified as Work Van, and these are the 2 classes that are most similar to the Microbus class which doesn't exist in the MIO-TCD dataset. 45.8% of the Minivan class were classified as Single-Unit truck, which is the most similar one to the Minivan class. Only 6.8% of the SUV test images were correctly classified as the equivalent Car class but 30.8% of the SUV test images were classified as Work Van, which is a similarly-looking class. As for the Sedan, only 15.3% were correctly classified as Car. Fig. 6 shows examples of the classification results of ''RXD-CV-CW-NCW'' on the test set of the BIT-Vehicle dataset.
In the following experiment, we considered augmenting the MIO-TCD training set with some training samples of the Sedan, SUV, and Bus classes from the BIT-vehicle dataset. We chose these 3 BIT classes because they are the classes that can map with no doubt to equivalent classes in the MIO-TCD dataset, namely the MIO-TCD Car and Bus class. The Sedan, SUV, and Bus classes comprise 78.3% of the BIT-vehicle dataset. We fine-tuned the 3 base learners as well as the RXD-CV-CW-NCW using the augmented training set. Table 3 -B shows the evaluation metrics of the 3 base learners as well as the super learner after augmentation. We called the resulting super learner Augmented-RXD. Fig. 7 shows the   confusion matrix of the Augmented-RXD super learner as evaluated on the MIO-TCD testing set. Table 6 shows that this augmented super learner achieved as good metrics scores as the un-augmented super learner ''RXD-CV-CW-NCW'', and both of them share the third rank together with the super learner of [23]. Compared to ''RXD-CV-CW-NCW'', the mean precision of the Augmented-RXD super learner decreased by 0.83%, while the mean recall was increased by 0.23%. We again tested the augmented super learner on the BIT-vehicle testing set, and the confusion matrix is shown in Table 8.
After augmentation, the super learner was able to classify 100% of the Sedan and SUV testing images into the MIO-TCD's Car class. 1,184 out of the 1,185 Sedan images were correctly classified. The remaining Sedan images were misclassified as Background. However, this incorrect prediction is in most due to erroneously annotated images that should be annotated as background. Furthermore, 96.4% of Bus images were classified correctly. However, since only the Sedan, SUV, and Bus classes were used for augmentation and fine-tuning, the augmented super learner seems to have learned that these high-resolution frontal-view images should only be one of these 3 classes. This may explain that out of the 2014 BIT test images 1803 images were classified as Car (89.5%), and 192 images were classified as Bus (9.5%). So, as for the MIO-TCD dataset, the data augmentation is not technically sound because it did not improve the performance on the MIO-TCD dataset. It just helped the model to generalize well to the Bus, Sedan, and SUV images of the BIT-vehicle dataset.

H. EXPERIMENTAL RESULTS ON THE BIT-VEHICLE DATASET
Finally, we performed a customized training on the 6 vehicle classes of the BIT-vehicle dataset. Consequently, the softmax output layers of the base learners and the super learner became 6-unit layers. Similarly, we revised the number of units of the ReLU fully connected layer between the concatenated outputs of the base learners and the softmax output layer of the super learner to be (n × 6), where n is the number of base learners.
In [6], the original paper of the BIT-vehicle dataset, Dong et al. used a large number of unlabeled vehicle images from the BIT-vehicle dataset to learn the filters of the network using unsupervised pre-training. Subsequently, they trained the softmax output layer with randomly selected 200 samples from each vehicle category. Also, they kept 200 samples from each vehicle category for testing.
Since we apply the supervised learning approach, we used most of the dataset images to train the base learners. We explained in IV-C how we split the BIT-Vehicle dataset into training, validation, and testing sets. So, we used the training set (60% of the dataset images) to train the base learners. Then, the validation set (20% of the dataset images) was used to train the super learner. The performance of the base learners and the super learner was evaluated using the testing set, which is the remaining 20% of the images.
Although the common evaluation metric of the BIT-vehicle dataset in literature is the accuracy, we evaluated the mean recall, mean precision, and Cohen Kappa scores to make the results more indicative and more comprehensive.  We fine-tuned the Resnet50, Densenet121, and Xception base learners on the BIT-vehicle dataset until they reached testing accuracies of 95.93%, 97.02%, and 97.73%, respectively. In the first BIT super learner, we combined the outputs of the 3 base learners. We called this super learner 'BIT-RXD'. As shown in Table 9, BIT-RXD achieved better mean precision and mean recall scores than the base learners. However, it had the same Cohen Kappa and Accuracy scores as that of the Xception model. It was noticed that the scores of the Resnet base learner were relatively low compared to those of Densenet and Xception. Therefore, we trained another super learner which ensembles the outputs of Xception and Densenet only. We called it 'BIT-XD'. As shown in Table 9, BIT-XD achieved better scores in the four metrics not only compared to the base learners but also compared to BIT-RXD. Tables 10 and 11 show the recall and precision scores of the base learners and super learners. Although the base learners are better than the super learners in the recall or precision scores of some classes, the improvement that the super learners achieved in the recall and precision scores of the remaining classes resulted in a better mean recall and mean precision scores.
The confusion matrices of BIT-RXD and BIT-XD are shown in Fig. 8. Examples of the classification results of BIT-RXD and BIT-XD are shown in Fig. 9. Figure 9 shows clearly the inter-class similarities between the categories of the BIT-vehicle dataset and how each of the super learners dealt with them. The BIT-RXD achieved 100% accuracy in the Bus class. The BIT-XD misclassified only 1 bus image as microbus as shown in the top left image of Figure 9. BIT-XD achieved equal or better accuracies in the remaining classes.

V. CONCLUSION
In this paper, a super-learner ensemble of deep networks for vehicle classification in traffic surveillance images was proposed. We introduced a densely connected single-split super learner and applied variants from it to two of the most challenging and largest publicly available traffic surveillance datasets, the MIO-TCD dataset and the BIT-vehicle dataset. While our method is simple, easy to train, does not include any handcrafted features or any logic reasoning, it achieved fantastic results that compare to those of the state-of-theart methods that were designed for the two datasets. Three variants of the super learner ensemble: RXD-CV-CW, RXD-CV-CW-NCW and Augmented-RXD, were examined on the MIO-TCD dataset with variations in applying class weights and data augmentation during training. RXD-CV-CW-NCW and Augmented-RXD share the third place among the published state-of-the-art methods reported in the MIO-TCD classification challenge. The applied data augmentation did not yield a significant performance improvement on the MIO-TCD dataset. However, it helped the network to generalize well to the Bus, Sedan and SUV images of the BIT-vehicle dataset.
In addition, the super-learner variants that we trained on the BIT-Vehicle dataset images performed very well and achieved overall accuracies of up to 97.62%.
In our future work, we will consider extending our work to the MIO-TCD localization challenge as well.