Deep Convolutional Neural Network Ensembles Using ECOC

Deep neural networks have enhanced the performance of decision making systems in many applications, including image understanding, and further gains can be achieved by constructing ensembles. However, designing an ensemble of deep networks is often not very beneficial since the time needed to train the networks is generally very high or the performance gain obtained is not very significant. In this paper, we analyse an error correcting output coding (ECOC) framework for constructing ensembles of deep networks and propose different design strategies to address the accuracy-complexity trade-off. We carry out an extensive comparative study between the introduced ECOC designs and the state-of-the-art ensemble techniques such as ensemble averaging and gradient boosting decision trees. Furthermore, we propose a fusion technique, that is shown to achieve the highest classification performance.


I. Introduction
The improved generalisation capability of ensembles, compared to their constituent classifiers (also known as the base classifiers), has long been established theoretically and experimentally.This capability is attributed to the complementarity of the individual classifiers in an ensemble, which jointly offer an error correcting mechanism, and is manifested in low prediction bias and variance [1]- [4].The combination rules employed by these systems can be as simple as taking a vote between the base classifiers, or more complex, where the classifiers are trained to compensate for the weaknesses of each other.
Several fusion techniques such as averaging, majority voting [5], bagging [6], stacking [7], random forests [8], error correcting output coding [9], [10] and their variants have been widely used in traditional machine learning.However, extensions of some of these approaches to deep learning (DL) systems have been deemed inefficient and challenging, due to the computational complexity associated with the training of deep networks, as well as the difficulty in securing diversity among the base classifiers.Therefore, most of the state-ofart DL ensembles are either formed of simple averaging (or voting) frameworks comprising only a small number of base classifiers [11]- [15], or weak decision tree ensembles based on boosting deep features that have been already extracted [16]- [19].
Averaging ensembles are composed of base classifiers which are mainly obtained by modifying various DL elements such as the types of network architectures, and their parameters, data augmentation techniques, and learning meta parameters.An example is the DeepFace [11], where Taigman et al. construct a face verification system of 7 deep networks and achieve 97.35% accuracy, compared to 97.0% obtained using a single face verification network.In another work, Szegedy et al. [12] increase the accuracy from 40% (single network) to 43.9% by averaging 6 GoogLeNet networks in the ILSVRC 2015 detection challenge.Yet another example is the winner of the PlantCLEF2017 competition [20], which is formed of 12 networks that are trained with an emphasis on complementarity and achieved a top-1 accuracy of 88.5% in classifying 10,000 different plant species.Similarly, Gessert et.al. in [15] employ multi-resolution EfficientNets [21] for skin lesion classification based on an ensemble of 15 deep networks, where the area under the curve (AUC) is increased from an average of 94 per classifier to 95. 4.
Despite the performance gain achieved by the deep averaging ensembles, the increased time complexity, which scales linearly with the addition of each base classifier network, comes out as the main drawback.In the literature, gradient boosting decision tree (GBDT) methods are proposed to address this shortcoming, by operating on the deep features obtained from one base network (contrary to generating many deep networks as base classifiers) and constructing a sequential ensemble of trees which are trained to correct each other's errors, using these features.
There are three commonly used GBDT variations in the literature: extreme gradient boosting (XGBoost) [22], light gradient boosting machine (LightGBM) [23], and categorical boosting (CatBoost) [24].As an example of XGBoost, Pang et al. [18] propose a subcellular localisation method by integrating the Convolutional Neural Network (CNN) and XGBoost, where CNN acts as a feature extractor and XGBoost acts as a classifier to identify the protein subcellular localisation.In another literature review, Torres-Barrán et al. [25] study the application of XGBoost to global and local wind energy prediction and solar radiation problem, exploiting gradient boosting regression methods.As for LightGBM, Ju et al. in [19] overcome the limitation of the single-convolution model in predicting the wind power by integrating the LightGBM algorithm to improve the robustness and accuracy of the forecasting.
Although GBDT is a powerful ensemble technique, the major disadvantages are its inability to deal with a high number of classes and the high number of hyper-parameters that need to be tuned to obtain the desired performance.It is important to note that the improved time complexity obtained with respect to the averaging ensembles is at the expense of a reduced ensemble performance.In this article, we address the drawbacks of deep averaging ensembles (time complexity) and GBDT (accuracy), and propose an efficient DL framework based on error correcting output coding (ECOC).
ECOC, borrowed originally from the communication theory [26], [27], is a multi-class classification ensemble, in which a given multi-class problem is decomposed into several twoclass problems, whose simpler decision boundaries are then combined to give the final, more complex decision boundary.The errors of the base classifiers that implement the twoclass decision boundaries are corrected to a certain degree [28].Several data-dependent and independent approaches can be used for guiding the decomposition process [29], [30].In [31], it has been theoretically and experimentally demonstrated that ECOC frameworks formed using random class splits obtain close to Bayes performance, if there are infinitely many such splits, and the associated base classifiers achieve good accuracy.In practice, the performance reaches the optimum very rapidly [32] as the number of the classifiers increases.The superiority of this method, which we refer to as randECOC, over the rest of the data independent and dependent ECOC approaches is demonstrated in [28], [31], [33].
Although ECOC has been commonly employed in traditional machine learning applications [34]- [37], to date, to the best of our knowledge, its potential as a method of constructing deep convolutional neural network ensembles has neither been exploited nor analysed.The only work addressing ECOC in DL research is [38], where it is utilised for the adversarial robustness of the networks.
In this work, by operating on the base network features, we propose and analyse efficient implementation strategies for randECOCs.We investigate three different design procedures: i) the straightforward approach of training the base classifiers independently, ii) multi-task learning (MTL) for faster training, and iii) MTL with embedded error correction.It is expected that the selection of the most appropriate design procedure will be carried out by the user, depending on the specific requirements of an application, and the time complexity versus accuracy trade-off.
We show that for all the proposed design techniques, randECOC almost always surpasses the GBDT performance at comparable time complexity, when MTL based implementation strategies are considered.When compared with averaging ensembles, a degradation in performance has been noted, due to the end-to-end training nature of averaging ensembles as opposed to the feature-based training of randECOC.However, for the users who have enough resources to accommodate averaging ensembles, we propose combining randECOC with averaging, and show that this setup guarantees the best performance with the highest accuracy in all scenarios.
The main contributions of this study can be summarised as follows: • We propose three different designs for randECOC ensembles to be used with convolutional deep neural networks and analyze these approaches in terms of accuracy and time complexity trade-off, using several different deep network architectures and 4 different datasets.Section II provides a background information on the state-of-the-art deep ensemble classification techniques as well as the ECOC framework.In Section III, different ECOC training strategies using features extracted by deep convolutional neural networks are presented.This is followed by their experimental analysis in Section IV and a discussion of the results obtained.Finally, conclusions of this study are presented in Section V.

II. Background
In the literature, averaging and majority voting are the most commonly used classifier combination approaches where the ensemble output is calculated based on the (weighted) average of the base classifier outputs or their mostly voted prediction.Bagging [6] is a special case of majority voting for which the base classifiers are trained on differently versions [42] of the same data obtained by resampling.More complex combination rules include methods such as boosting [43], where the classifiers are trained sequentially to compensate for the weaknesses of those already selected and stacking [44], where the outputs of all classifiers are fed into a new model to generate the final prediction.Another commonly used ensemble technique is random forests [8], which are composed of multiple decision trees trained on bootstrapped training data with an additional step of feature-bootstrapping to allow for a random selection (with replacement) of features at each tree node.The final decision is based on the (weighted) average of the outputs or the majority vote of the individual tree decisions.
The fusion rules most commonly applied in the state-ofthe-art deep learning ensembles are based on averaging or majority voting.These ensembles consist of a small number of deep neural network architectures as base classifiers, which differ from each other in terms of the data augmentation techniques used during training and / or network architectures and / or learning parameters (such as learning rates, training and validation set partitions, weights initialisation and data batches).Due to the costly training of these ensembles, they typically are composed of only a handful of base classifiers.
Overcoming the time complexity of the averaging / voting ensembles of deep neural networks, the second most common combination strategy, gradient boosting decision trees (GBDT), depends on extracting the bottleneck features of one base network and using them for training a sequence of decision trees.However, the gain in time complexity of this approach is compromised by reduced accuracy, especially for high number of classes.Moreover, the method requires a high number of hyper-parameters to be tuned to obtain the desired performance.
In Section II-A, we analyse three state-of-the-art variants of the GBDT method found in the literature; namely, extreme gradient boosting (XGBoost) [22], light gradient boosting machine (LightGBM) [23] and categorical boosting (CatBoost) [24], in detail.Later on in Section II-B, we provide the background for error correcting output coding (ECOC) ensembles, on which we build our novel design strategies for designed ensembles of deep learning networks, presented in Section III.

A. Gradient Boosting Decision Trees (GBDT)
Gradient boosting is a machine learning technique for regression and classification problems that creates an ensemble of weak prediction models to achieve powerful prediction.When decision trees are used as the base classifiers, the method is referred to as gradient boosting decision trees (GBDT).
Unlike random forests, where the decision trees are constructed in parallel prior to combination, GBDT employs a boosting approach, in which each tree is sequentially trained with the aim of correcting the error produced by its predecessor.In particular, every tree is trained to learn the residual between the desired output and the output of the previous tree, using gradient descent.The most important parameter in GBDT is the number of base classifiers which controls the model complexity.The most recent and efficient GBDT methods developed are XGBoost [22], LightGBM [23], and CatBoost [24].These algorithms differ from each other in terms of the mechanism used for splitting the tree nodes.
Extreme gradient boosting (XGBoost) [22], is a highly extensible tool mainly designed to overcome the overfitting limitations of the traditional gradient boosting methods.It uses pre-sorted and histogram-based algorithms for computing the best split, which continues until the maximum level, predefined by the "max depth" hyper-parameter, is reached.Once at the maximum level, the splits are pruned backwards until there is no positive gain.
Light gradient boosting machine (LightGBM), proposed and developed by Microsoft [23], uses gradient-based one-side sampling (GOSS) to filter out data instances on the basis of their contribution to the gradient of the loss function.The best split is obtained by using all of the instances with large gradients and a random sample of instances with small gradients to maintain a balance between the training data reduction and accuracy.LightGBM uses a leaf-wise tree growth mechanism which allows the growth of an imbalanced tree.
Categorical boosting (CatBoost) [24] focuses on categorical features by using minimal variance sampling (MVS) which is a weighted sampling method at the tree-level.Unlike LightGBM, CatBoost grows balanced trees which makes this method less prone to overfitting, and uses combinations of categorical features as additional categorical features to capture high-order dependencies.As it is infeasible to process all of the possible combinations, CatBoost solves the exponential growth of the features combination by constructing the combinations in a greedy way.

B. Error Correcting Output Coding (ECOC)
Error Correcting Output Coding (ECOC) is a generic ensemble classification framework designed for multi-class classification problems [28], where the aim is to decompose a given multi-class problem into several two-class problems.The final decision boundary is formed by combining the boundaries of the base classifiers trained on these simple decompositions while providing a scope for error correction.
The way the decomposition is carried out in ECOC is defined by a design code matrix.Accordingly, a base classifier may be assigned the task of separating a particular class from all of the others, or learning a random dichotomy of the classes.The commonly used ensemble approaches such as one-vs-one or one-vs-all can therefore be considered as special types of ECOC systems.
Let us consider a problem with The design (encoding) of the code matrix can be carried out in several ways.These include problem-independent approaches such as one-vs-one or one-vs-all [28], or problemdependent methodologies where the aim is to split the classes in the given data domain [33], [45] meaningfully.
In decision making (testing), firstly, a given test instance x is classified by each base classifier to obtain the output vector Y = [y 1 , ..., y L ] where y j is the hard or soft output of the classifier h j for x.Then, the distance between Y and the codeword M i of class c i , ∀i, is computed using a metric such as Hamming, Manhattan or Euclidean distance.The class c * associated with the minimum distance is chosen as the predicted class, such that While choosing the closest codeword during the target prediction, the system is able to correct some of the base classifiers mistakes.Specifically, up to (e − 1)/2 base classifier errors can be corrected if Hamming Distance (HD) is chosen as the distance metric, and the minimum HD between any pair of codewords is e.
Although the encoding and decoding of ECOC matrices are open research problems, it is important to note that randomly generated ECOC matrices (randECOC) have been shown to reach Bayes performance when used with large enough number of base classifiers, each of which exhibiting close to Bayes accuracy [31].In practice, it has been experimentally demonstrated in [32] that for problems involving ∼10 classes, randECOCs of length 20-30 would be enough to converge to optimum performance, whereas this number would grow to 200-300, when the number of classes is ∼100.
III. Design Strategies for randECOC Using CNNs Under the assumption of unconstrained computational resources, the optimal strategy to achieve the highest prediction performance using randECOC would be to train each base classifier independently.End-to-end training of these classifiers, each of which is initialised with random weights, would help increase the diversity between classifiers and enforce independence which is a key element in achieving close-to-Bayes performance [31], [32].However, this procedure would suffer from similar time complexity drawbacks as in averaging ensembles and be impractical in real-life applications.
For this reason, in this section, we propose and analyse different design strategies for randECOC matrices, which address the shortcomings of time complexity associated with averaging ensembles, while still achieving better performance than their time efficient alternative, GBDT.In the design strategies presented in Section III-A through III-C, we propose to initially train a multi-class base network to obtain the bottleneck features (as opposed to end-to-end training), and build three implementation techniques with varying accuracy vs time complexity trade-off on these features.
Specifically, after presenting the straightforward approach to designing randECOC ensembles with base classifiers trained independently using bottleneck features in Section III-A, we propose a more time-efficient implementation strategy based on multi-task learning (MTL) in Section III-B.Then, in Section III-C, the MTL based strategy is further improved with the incorporation of an error-correcting mechanism as a separate layer of the network.This strategy aims to couple the base classifier training to the classification problem, as opposed to training the base classifiers only to be in agreement with the encoding matrix: Few research work exists to learn or modify the ECOC matrix after the training of the base classifiers, for their joint optimization [46]- [48].For all three designs, the base networks are assumed to be convolutional neural networks (CNNs).

A. Independent Learning of Base Classifiers
In this approach, the base classifiers are trained one by one and independently according to a given randECOC matrix, using the deep features extracted from the bottleneck layer of a base network.A schematic diagram illustrating an example of independently trained base classifier networks is given in Figure 1.Here, we propose to design the base classifiers as shallow networks, whose outputs are then combined for an errorcorrecting randECOC decoding to give the final output.In other words, after extracting the output vector Y(x) for a given test sample x from all shallow networks, the prediction is carried out in a separate decoding step, where x is assigned the class with the closest codeword to Y(x) (see Equation 1).

B. Multi-task Learning of Base Classifiers
In order to achieve close-to-Bayes accuracy, the number of base classifiers required for a randECOC ensemble should increase with the number of classes.Although all independent tasks can potentially be trained in parallel as proposed in Section III-A, this framework might be unattractive under the assumption of limited resources, despite the performance gain promised.
To address this, we consider the idea of simultaneous training of the base classifiers by employing an MTL based strategy, where the classifiers are trained to learn multiple labels, i.e. the desired base classifier outputs, at the same time.Although this method can only approximate the performance of the independently trained base classifiers, it is important from the point of view of accuracy versus time complexity trade-off.
In this approach, we have a single MTL network comprising several shared layers among all base classifiers, with L output nodes; as opposed to L independent networks.In other words, while training the independent classifiers sequentially would mean the repetition of the randECOC procedure L times, training all classifiers at the same time via MTL would imply carrying out this step only once.Hence, the time complexity of the MTL network is approximately L times better than the independent sequential training.An illustration of an example MTL network is presented in Figure 2.
The prediction is carried out in the same way as in Section III-A, where ECOC decoding is executed as the second step following the extraction of classifier outputs in the first step.Note that we propose this network should also include a small number of shallow, classifier specific layers to allow for diversity.As a further advantage of the MTL network, it should be noted that the sharing of the base network and the subsequent layers are expected to reduce overfitting, as observed in literature [49], since the nodes in the shared layers are constrained to work for multiple classifiers.

C. Multi-task Learning with Embedding
Despite its advantages in terms of speed and reduced overfitting, the MTL network described in Section III-B is suboptimal in the sense that the second step of the prediction, namely ECOC decoding, is carried out separately from the network training.In other words, while the base classifiers are enforced to learn the dichotomies (two-class problems) indicated by the randECOC matrix, they are not enforced to reveal the desired multi-class label.
In order to address this issue, we propose to extend the MTL network with a K-node output layer, with weights set from the randECOC codewords and the output nodes representing the original classes.This layer not only enforces the final, multi-class decision on the outputs of the two-class base classifiers, but also includes the ECOC decoding inherently.The proposed framework is illustrated in Figure 3 with an example architecture.It is referred to as "MTL w/ embedding" in the remainder of this paper.
It is worth mentioning that the randECOC matrix is not learned here but is pre-set.In some earlier work, the matrix was modified during or after the training of the base classifiers, with the goal of reducing this decoupling between the encoding and base classifier training stages [46]- [48].
Let us assume that the nodes corresponding to the base classifiers h j , j = 1 . . .L are connected to the output nodes o i , i = 1 . . .K with the preset ECOC matrix weights w i j = M i j .For a given input x, each output node o i represents the score for class c i , such that ( Note that the maximum value of o i (x) is L when all the base classifier outputs are in agreement with their associated bits of the codeword for that class (targets); while the minimum is −L when all base classifier outputs are wrong.In other words, The loss function used to train the network is designed with two goals: 1) To maximise the output of the correct class, o c ; 2) To match the output vector h(x) to the predetermined codeword w c , so as to benefit from the ECOC framework.Therefore, given a sample of class c and groundtruth T = [t 1 . . .t K ] (one-hot encoded vector where t c = 1 for only the correct class and zero elsewhere), we use the loss function given in Equation 4. We ignore o i , i c because maximizing o c is equivalent to minimizing other class outputs, thanks to the design of the ECOC matrix.
With ternary ECOC where there are zeros in the code matrix, the maximum output value of L is not attainable for o c , hence L should be replaced with the number of non-zeros in a codeword.
To train the network, we use stochastic backpropagation, starting with the weights of the base classifiers h j , as the ECOC matrix weights are fixed.The partial derivative of our combined loss function with respect to h j (x) is computed as: For the final prediction, the class c i that has the maximum o i (x) (equivalently, minimum distance to the base classifier outputs h(x)) is chosen as the correct class.

IV. Experimental Analysis and Results
To evaluate the effectiveness of the proposed randECOC techniques and compare their efficiency in terms of time complexity and accuracy with the state-of-the-art ensemble methods, we conduct various experiments using well-known deep architectures and multi-class datasets.Specifically, the comparative studies are performed on: 1) Simple averaging ensemble; 2) Gradient boosting decision trees (GBDTs): XGBoost, LightGBM, and CatBoost; 3) randECOC ensembles: Independent learning, MTL, and MTL with embedding.After carrying out the comparisons, we combine randECOC and GBDT approaches with ensemble averaging, i.e. we generate ensembles of randECOC and GBDT ensembles and analyse their performance.The purpose of this experiment is to measure the highest possible prediction accuracy, for scenarios where the available resources (computational resources including processing power, time and storage) are not a limiting factor for the user.In Section IV-A, the details of the datasets used in the experiments are presented and in Section IV-B, various base network architectures utilised in this study are described.This is followed by providing the details of the experimental setup in Section IV-C, and the thorough discussion of the results in Section IV-D.

A. datasets
We carry out the experimental analyses on four stateof-the-art multi-class classification problems based on digit classification and object recognition using images.In all tasks, each image contains a single object on an unconstrained background.
• CIFAR-10 [39]: This dataset consists of 60, 000 (32 x 32) images belonging to 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck), and is divided into 50, 000 images to be used for training and 10, 000 for testing.• CIFAR-100 [39]: Similar to CIFAR-10 dataset, CIFAR-100 consists of 50, 000 training images and 10, 000 testing images.There are 100 classes in this dataset, grouped into 20 super-classes.Each image comes with a "fine label" which is the class label and a "coarse label" which is the superclass to which it belongs.In our study, we make use of the fine labels.
Inception-V3, proposed by Google in [50], is a widelyused image recognition model.It consists of symmetric and asymmetric building blocks, including convolutions, average pooling, max pooling, dropouts, and fully connected layers.Batch normalisation is used extensively throughout the model and applied to activation inputs.The network is 48 layers deep with a total of 23.8 million parameters.
Xception [51] is an extreme version and an extension of the Inception [12] architecture, which replaces the standard inception modules with depth-wise separable convolutions.The network is 71 layers deep with a total of 22.9 million parameters.
The final architecture, SENet [52], introduces the squeezeand-excitation block that adaptively re-calibrates the channelwise feature responses by modelling the interdependencies between channels to automatically acquire the importance of each feature channel.SENet is the winner of ILSVRC 2017 classification challenge.

C. Experimental Setting
All the base networks employed in this study are pre-trained on the ImageNet dataset [53].The inputs to the ensemble systems are obtained by forward passing the datasets through the fine-tuned networks and extracting the features of the last pooling layer (2048 neurons).Note that for a fair comparison, no extra randomness such as data augmentation, has been applied during training or feature extraction.
The averaging ensembles are obtained by training 5 base networks with different random weight initialisations, as this is a typical number employed in these ensembles to respect time and computing power constraints.For randECOC using independently trained base classifiers, we train L classifiers with a simple multi-layer perceptron architecture, where L is set to 30, 300, and 30 for CIFAR-10, CIFAR-100, and SVHN datasets, respectively.The architecture, which is depicted in Figure 1, consists of three fully connected layers with (500, 50, 10) units, each followed by rectified linear unit (ReLU) activation function and a dropout layer.Finally, each output layer has one neuron that is associated with a tangent hyperbolic activation function (tanh) and a mean square error (MS E) loss function.
The randECOC framework using MTL is composed of two shared fully connected layers with (500, 50) units, each of which is followed by a ReLU activation function and a dropout layer.For each classifier, there are some specific layers: a dropout layer, a fully connected layer with 10 units, a ReLU activation function, a fully connected layer with one unit, and a tanh function.The output units are concatenated to form one layer with L units defining the output layer.Similar to randECOC with independently trained base classifiers, MS E loss function is used here.The network structure of this framework is as given in Figure 2.
The randECOC using MTL w/ embedding, as given in Figure 3, mimics the setup of the randECOC framework using MTL, with an additional layer to include ECOC codewords as weights, for which the learning rate is set to zero.Note that as the random weight initialisations impacts on training all randECOC frameworks, we report the mean and the standard deviation of the testing accuracy from 5 independent runs.
All the networks including the base networks are optimised using RMSPROP optimiser with 3 × 10 −4 learning rate, 0.99 squared gradient decay factor, and a batch size of 64 images per training iteration for the base networks and 512 images for the randECOC experiments.The implementation is performed using the Deep Learning Toolbox and MatConvNet [54] within MATLAB, with a single NVIDIA GeForce GTX 1080 Ti 11GB graphics processing unit (GPU).
The GBDT frameworks are implemented using the official XGBoost, LightGBM and CatBoost Python packages on Google Colaboratory with the provided free Tesla K80 11GB GPU.In our experiments, we fine-tune the most vital hyper-parameter for these frameworks, which is the number of iterations that is relative to the number of created trees.The highest validation accuracy has been obtained using 60 iterations for CIFAR10 and SVHN datasets and 300 iterations for CIFAR100 in all of the employed gradient boosting methods.Rest of the hyper-parameters are set to the default values suggested by the corresponding authors.
For the set of experiments where the randECOC and GBDT frameworks are combined with ensemble averaging, we train 5 base networks, each of which is initialised by random weights, separately for all network architectures and dataset combinations.For each network architecture, we first evaluate the ensemble averaging performance with the 5 networks.Then, GBDT and randECOC approaches are applied to the features extracted from each base network, resulting in 5 ensembles in each case.The ECOC matrices used in all 5 networks are kept the same.Finally, ensemble averaging is applied to the 5 GDBT and 5 randECOC ensembles to obtain the final prediction.
To further validate our results, we carry out a final set of experiments with the PlantVillage dataset which is a large, crowd-sourced dataset of real-life diseased plant images.

D. Results
We report the result of a comparison of the evaluated frameworks in Section IV-D1 and the performance analysis of combinatory approaches in Section IV-D2.
1) Comparison of the Ensemble Frameworks: The performance of the 3 ensemble frameworks together with their corresponding time complexity, while using three base networks, is shown in Table II, III, and IV for CIFAR-10, CIFAR-100 and SVHN datasets, respectively.The performance is gauged in terms of classification accuracy, and the time complexity is measured as the training and test time spent, over and above the time required by the base network.Note that while the hardware is slightly different for GBDT and ECOC frameworks, their time complexities are both accepted as small and no strict comparison is made between the two in terms of time.
Ensemble Averaging: As expected, the averaging ensemble achieves the highest accuracy for all datasets and base networks, with highest accuracies of 97.69%, 89.91% and 97.75% for the CIFAR-10, CIFAR-100 and SVHN datasets, respectively.Despite surpassing the base network by a relatively high margin, this ensemble comes out as very costly in terms of time and the resources required.Specifically, the training times are of the order of thousands of minutes, or about several days, which is often not available to researchers, providing the motivation for this work.
Gradient Boosting Decision Trees: Among the GBDT frameworks, none outperforms the others on all datasets, and more importantly, it can be observed that all three variations cause degradation over the base network performance at least for network architecture and dataset.This is a very important finding, proving a clear evidence in support of the perceived instability and inconsistency of this technique, especially when dealing with a high number of classes.Specifically, for CIFAR-10 and SVHN datasets, XGBoost appears as the best performing algorithm as shown in Table II and IV, respectively.It improves the testing accuracy of the base networks at the expense of minimal additional training time, with improvements of 0.83%, 0.30%, and 0.40% on CIFAR-10 and 0.89%, 0.20%, and 0.19% on the SVHN dataset, compared to the base network.However, this method deteriorates the base network accuracy for all network types on CIFAR-100.For this dataset, the only GBDT improvement over the base network performance is achieved when using Xception as the architecture and employing LightGBM or CatBoost.This is in line with the theoretical underpinning of the inability of these methods to cope with a high number of classes [55].
randECOC Ensembles: We see that all the variants of the randECOC framework improve the testing accuracy over the base networks and the best GBDT approach 1 , in almost all of our experiments.Despite some drop in the performance in comparison to the averaging ensembles, a much faster training  Among the MTL based randECOC ensembles, MTL w/ embedding performs always better than or equal to MTL, while revealing similar time complexity.The independent learning approach obtains the highest accuracy; however only with a slight margin over MTL w/ embedding and a lot more additional training time (more than 20 times in all scenarios).
The strength of the MTL based randECOC approaches over GBDTs is emphasised especially when dealing with high number of classes.As shown in Table III for the CIFAR-100, MTL w/ embedding improves the accuracy by 1.65%, 2.49%, 0.55% over the base networks, and outperforms the best GBDT approach (LightGBM in this case) by 1.98%, 1.46%, 1.35%, for the three network architectures.Note also that, the training time of LightGBM for this problem is also greater than that of MTL w/ embedding.
2) Combinatory Approach -Ensemble Averaging of GBDT and randECOC Ensembles : As an important outcome of the comparative experiments presented in Section IV-D1, the averaging ensembles tend to achieve the highest accuracy for all the base networks and dataset combinations, benefiting from their increased computational complexity.Under the assumption of an adequate computational resources, we aim further to improve this accuracy by assisting the averaging process with GBDT and randECOC, as explained in IV-C.
The results of these experiments are provided in Table V.It can be observed that GBDT+averaging approaches outperform the baseline averaging ensemble by the slightest margin, while the randECOC+averaging methods provide a higher performance improvement, ranging from 0.05 up to 2 percentage points, where the highest improvement is observed for the CIFAR-100 dataset.
Although the best accuracies are acquired from randECOC using independent classifiers, MTL based approaches follow closely, revealing better accuracy than GBDTs in all scenarios other than one (Inception-V3 with SVHN), where the difference in performance with the best GBDT framework (XGBoost) is as small as 0.02%.The consistency in the improvement in accuracy not only over the base network, but also the baseline averaging ensemble and the GBDT+averaging ensemble, renders randECOC+averaging as the best performing classifier combination technique in the literature.
We would like to underline the fact that the GBDT and ran-dECOC frameworks operate on the features extracted by the base networks; hence training the combinatory approach with these frameworks takes little additional time.For instance, training 5 randECOC ensembles on top of the 5 base networks only takes 21 minutes for the CIFAR-10 dataset, while training the 5 base networks takes 3160 minutes.The additional time corresponds to 0.58% overhead.
3) Experiments with Real-Life PlantVillage Dataset : Experiments on PlantVillage dataset [41] are done using 5fold cross-validation due to the lack of a designated test set.The experiments are conducted only for GBDT and randECOC approaches using the Xception network due to   its favorable performance-size ratio and the combinatory approach is applied on only one fold, due to large computational requirements.
Results on this dataset are shown in Table VI, where it can be observed that while all performances are very close, all randECOC variants achieve superior accuracy than all GBDT variants in all folds.Moreover, the combinatory approach of randECOC achieves the state-of-the-art results (99.81%) on this dataset, while Mohanty    or by employing gradient boosting decision trees (GBDT) on the features extracted from one fully trained network.Despite all its advantages in terms of the performance gain, the increased time complexity the averaging ensembles incur, which is shown to be in the order of days and weeks for problems involving a high number of classes, may make this method computationally infeasible or inefficient for users with limited resources.Even though GBDTs address this inefficiency, they have been shown to be unstable in terms of the improvement they offer over the base networks.In our experiments, we have shown that there exists no GBDT method which provides consistent improvement over the base accuracy for all architectures and datasets.Addressing the drawbacks of GBDTs, we have proposed and analysed three ECOC-based design techniques, which provide a reliable and stable improvement over the base network performance as well as the performance of GBDT under all settings.Moreover, two of the proposed designs achieve time complexity benefits similar to GBDTs.
The proposed design techniques are based on independent learning, multi-task learning (MTL) and multi-task learning with embedding (MTL w/ embedding).It has been shown that MTL w/ embedding always provides an accuracy equal to or greater than that of MTL, and both methods have a comparable time complexity with those of GBDTs.Independent learning provides the best performance among the ECOC based methods.However, the performance gain over the MTL based methods is marginal and comes with the a time complexity trade-off, though this complexity is still much less than that of averaging.Therefore, for problems to be tackled with a limited computational resource, we suggest that employing ECOC methods, the choice of which is to be made by the user depending on the fine-tuned requirements of the problem, is the best strategy; i.e.MTL w/ embedding for fastest training, independent learning for a relatively slower but marginally better performance.
To offer solutions for scenarios where the available resources are not a limiting factor for the user, we have conducted experiments with simple averaging ensembles of GBDT and ECOC frameworks, and shown that the combinatory framework built using any of the ECOC methodologies achieves the best performances among all methods, at the expense of negligible additional training time.
In conclusion, the ECOC framework, either alone or in combination with the averaging methodology, appears to provide the most efficient ensemble learning approach.In the future, the feasibility of end-to-end training of the proposed design strategies using the ECOC framework will be explored for the cases where time and space complexity is not a restriction.
and a pre-designed code matrix M of size K ×L as illustrated in Table (I), for K = 4 and L = 5.A particular element M i j ∈ {+1, −1} indicates the desired label for class c i to be used in training the base classifier, h j .For instance in Table (I), the base classifier, h 1 , is assigned the task of separating instances belonging to classes c 1 and c 2 from instances belonging to classes c 3 and c 4 .The classes c 1 and c 2 are re-labelled with label +1, while c 3 and c 4 are re-labelled with label -1, to reflect this two-class problem.

FC2Fig. 1 .
Fig.1.An independent base classifier architecture with a 3-hidden layer shallow network, consisting of fully connected layers followed by rectified linear units, one for each base classifier of the ECOC ensemble.The input comprises the features extracted by the bottleneck layer of a trained base network.

FC2( 50 Fig. 2 .
Fig.2.Multi-task learning architecture, with two shared modules and one classifier specific module.All layers are fully connected networks with rectified linear units.

Fig. 3 .
Fig.3.Multi-label architecture with embedded ECOC decoding, including two shared modules and one classifier specific module.The base classifier output layer is followed by the ECOC embedding layer with fixed weights.The output o i corresponds to the score of class c i .

•
We perform an empirical comparison of the randECOC ensembles and state-of-the-art ensemble methods for deep learning, i.e. ensemble averaging and GBDTs, and show that the proposed MTL strategies provide the best time complexity versus accuracy trade-off.•We propose a hybrid approach, combining randECOC strategies and ensemble averaging, to achieve state-of-the art classification performance for all network and dataset combinations.The article is structured as follows.

TABLE I A
sample ECOC matrix for a 4-class classification problem with 5 base classifiers

TABLE II Comparison
of the results obtained on the CIFAR-10 dataset using Inceptions-V3, Xception, and SENet architectures as base networks.The best results obtained in each group are shown in bold and the performance decreases compared to the base networks are shown underlined.The numbers in parentheses show the performance change compared to the base network

TABLE III Comparisons
on the CIFAR-100 dataset using the Inceptions-V3, Xception, and SENet architectures as base networks.The best results obtained in each group are shown in bold.

TABLE IV Comparisons
on the SVHN dataset using the Inceptions-V3, Xception, and SENet architectures as base networks.The best results obtained in each group are shown in bold.
et al., Too et al., and KC et al. reported %99.34, %99.75, and %98.34 respectively [56]-[58].This is the first study to date, which comprehensively analyses ECOC in relation to the deep learning research, while proposing novel strategies to focus on the accuracy-complexity trade-off.The current state-of-the-art deep ensemble techniques in the literature are constructed either by averaging the outputs of the multiple realisations of a deep network architecture by randomising / changing some of its constitutional elements,

TABLE V Test
accuracies for the combinatory methods.The best result corresponding to each dataset and base network, is shown in bold.

TABLE VI 5
-Fold cross validation and the combinatory approach on the Plant Village dataset using the Xception base network.The best results obtained in each group are shown in bold.