Building Ensemble of Deep Networks: Convolutional Networks and Transformers

This paper presents a study on an automated system for image classification, which is based on the fusion of various deep learning methods. The study explores how to create an ensemble of different Convolutional Neural Network (CNN) models and transformer topologies that are fine-tuned on several datasets to leverage their diversity. The research question addressed in this work is whether different optimization algorithms can help in developing robust and efficient machine learning systems to be used in different domains for classification purposes. To do that, we introduce novel Adam variants. We employed these new approaches, coupled with several CNN topologies, for building an ensemble of classifiers that outperforms both other Adam-based methods and stochastic gradient descent. Additionally, the study combines the ensemble of CNNs with an ensemble of transformers based on different topologies, such as Deit, Vit, Swin, and Coat. To the best of our knowledge, this is the first work in which an in-depth study of a set of transformers and convolutional neural networks in a large set of small/medium-sized images is carried out. The experiments performed on several datasets demonstrate that the combination of such different models results in a substantial performance improvement in all tested problems. All resources are available at https://github.com/LorisNanni.


I. INTRODUCTION
The use of Convolutional Neural Networks (CNN) has revolutionized the field of image recognition, achieving impressive results in a wide range of applications [20].
Researchers have explored various ways to improve the architecture of these networks, including the combination of specialized layers to form new topologies.However, as neural networks become deeper, they are susceptible to problems such as the vanishing gradient and difficulties with optimization.In addition to enhancing the architecture of CNNs, finding robust and stable optimization algorithms is equally important to maximize performance [16].Traditional optimization methods such as Gradient Descent (GD) and Stochastic Gradient Descent (SGD) have been widely used, The associate editor coordinating the review of this manuscript and approving it for publication was Yiming Tang .
with Adam being the more popular method due to its ability to improve the network's search for a solution.This has led to the development of many Adam-based variants.In this study, we propose new Adam-based variants and compare them to other optimization methods.We also explore the potential of transformers for building ensembles and compare different transformer topologies.Our experiments show that ensembles based on the combination of different approaches outperform state-of-the-art results in all tested problems.Despite the complexity of the proposed system, it requires minimal parameter tuning and works well in various problems without the need for specific pre-processing or optimization for each dataset.All the tested code and datasets are available for reproducibility.
The main contributions of this article are as follows: i) We propose a set of new Adam-based variants useful during the training phase of the models.
ii) We define ensembles of different CNN models, trained with these new Adam-based variants, and different transformer topologies.Compared with existing machine learning methods, our ensembles provide competitive results in different domains.iii) We provide an empirical evaluation of ensembles trained with standard and new Adam-based variants.Evaluating the performance on several metrics and datasets, we demonstrate that adopting different optimization algorithms and different network models is beneficial for ensembles.iv) We introduce a testbed for evaluating Transformers ensembles, leveraging publicly accessible datasets, and showcasing a baseline ensemble.As a result, other researchers can readily compare their ensembles with our reported baseline, under the condition that they employ the same transformers with identical hyperparameters.The remainder of the paper is structured as follows: Section II reports related works on the topic analyzed in this work.Section III introduces some basic notions about transformers, CNNs, the Adam approaches proposed in this work, as well as the datasets used for experiments.Section IV describes the experimental environments, the testing protocols, and the performance indicators; moreover, we suggest and discuss a set of experiments to evaluate our ensembles.In Section V, we discuss the value of our work providing more insights about the usefulness of the proposed approach.Section VI concludes this work and provides some further perspective on this research.

II. RELATED WORK A. TRANSFORMERS IN IMAGE CLASSIFICATION
In image classification, Convolutional Neural Networks (CNNs) have been state-of-the-art for many years, and models such as ResNet [20] or EfficientNet [48] are still considered baseline approaches in many works.However, the recent introduction of transformers by Vaswani et al. [47] has changed this trend.Transformer models have achieved disruptive performance in the neural language processing field, making it the leading approach [6], [12], [28].The improvement is attributed to the multi-head self-attention.After the encoding of words in tokens, the self-attention is computed as: where Q ∈ R NxD k is the query matrix, K ∈ R MxD k is key matrix and V ∈ R MxD v is the value matrix.D k is the dimension of the keys, with N and M denoting the lengths of queries and keys.The multi-head self-attention uses H different heads computing the attention, as in Eq. 1, and concatenating the output.In image classification, Dosovitskiy et al. [14] introduced the Vision-Transformer (ViT) as the first pure transformer model to achieve results equivalent or superior to CNNs.ViT utilizes the encoder from Vaswani's original work [47] and stacks a multilayer perceptron to classify features extracted by the encoder.In this approach, the input tokens are image patches encoded in a latent space.However, training the ViT network is challenging due to its lack of inductive biases compared to CNNs [14], [46].Moreover, the network struggles to aggregate local information.To address these challenges, ViT requires a pre-training on a dataset with millions of images, namely JFT-300M dataset.
A solution to ViT data issues is proposed in [46].Their approach involves distilling knowledge acquired from a convolutional neural network, combined with actual groundtruth values, to enable transformers to learn the inductive bias from the distillation.Additionally, the authors introduced a ''distillation token'' to enhance the distillation process and lead to better performance with simpler training.The model is named Data-efficient Transformer (DeiT).
Swin [29], one of the top transformer models inspired by ViT, utilizes a hierarchical window structure to address scalevariation issues.The attention is computed within the local window, resulting in improved performance and execution time.The partitioning is gradually shifted along the network hierarchy to maintain interaction among different windows.The authors also demonstrated the effectiveness of Swin as a backbone in various vision tasks, including semantic segmentation.
Transformers have different characteristics compared to CNNs.According to [38], CNN models make decisions mainly based on texture, while ViT weights the shape more.Therefore, some authors have considered merging the two.An example is CoAtNet [10], which merges depthwise convolution with self-attention by stacking them vertically.With this approach, the model leverages CNNs inductive bias, Transformers capacity, and attention mechanism.CoAtNet may be considered a hybrid approach since it adopts a convolutional structure while inserting self-attention from [47].

B. ENSEMBLE OF IMAGE TRANSFORMERS
Vision transformers are gaining more attention for their ability to recognize long-range dependencies and their capacity [19], [27].Despite the impressive results obtained by deep learning models, their performance can be further boosted through an ensemble of different models [26] or the same model with different weights [31], [40].Each model of the ensemble can learn a representation that concentrates on particular aspects or representations of images, improving the final prediction, as demonstrated by the work of Savelli et al. [40] that improved small lesion detection using an ensemble of CNNs trained on different views of the same lesion.Kyathanahally et al. [26] used an ensemble of six CNNs to improve plankton classification by averaging the predictions of the components, following the demonstration by d'Ascoli et al. [18] that an average ensemble can reduce overfitting.
Despite the impact that vision transformers have had on image classification, only a few works have exploited their benefits in an ensemble approach [25], [35], [41].In [21] the authors compared a vision transformer ensemble with a CNN ensemble for brain tumors.However, they found that the ensemble of vision transformer models performed worse than the convolutional ensemble.In [41] the authors proposed a model that exploits both ViT and convolutions, named Cvit, and created an ensemble of two models trained on the frequency domain and time domain to classify movement from EMG signals.The authors compared the proposed solution only with convolutional baselines and ViT.Recent work from Kyathanahally et al. [25] showed astonishing results in image classification among different datasets with an ensemble of three DeiT models.In [35], authors achieve enhanced ensemble performance for polyp segmentation by combining diverse transformer models (including HarDNet-MSEG, Polyp-PVT, and HSNet) through distinct training techniques and introducing a novel mask averaging method, resulting in superior segmentation results across multiple datasets.ETECADx [1] is an AI-based computer-aided diagnosis (CAD) framework for early breast cancer detection, which combines convolutional neural networks and the self-attention mechanism of vision transformer encoder.Another study [2] introduces a masked face recognition system, combining two Convolutional Neural Network (CNN) models and two Transformer models, which achieve better accuracy when ensembled, outperforming other models in recognizing masked faces.
In our work, we propose to combine different transformers and convolutional models, demonstrating their superiority on various datasets.Each model contributes to the final prediction through the weighted sum rule.Thus, the vision Transformers and CNNs are trained separately and their outputs are merged together.Similarly to [2] and [25], we find that Transformer models, exploiting the attention mechanism, boost the ensemble results when merged with CNNs.

C. ADAM OPTIMIZERS
A fundamental role in deep learning is played by optimizers.An optimizer in machine learning is a mathematical algorithm or technique that adjusts the internal parameters of a model to minimize or maximize a specified objective function, improving the model's performance on a given task.One of the most used optimizers is Adam [23].It is used for gradient-based optimization of stochastic objective functions.It is a combination of two other optimization algorithms, namely, Adagrad [17] and RMSprop [23].The key idea behind Adam is to adaptively estimate the first and second moments of the gradients in order to perform effective optimization.
Some authors proposed variations of Adam, creating new optimizers.An example is DiffGrad [16], proposed in 2019.It relies on the assumption that a decrease in the rate of gradient variations suggests the presence of a global minimum.The objective is to generate substantial strides when the gradient is undergoing large changes while taking smaller steps when the gradient is changing more gradually.Overall, DiffGrad has shown promising results in experiments on various image classification and object detection tasks, often outperforming Adam and other popular optimization methods in terms of both speed and accuracy.
In [37], three more Adam variations were proposed, namely DGrad, Cos and Exp.DGrad is a modified version of DiffGrad which considers the absolute difference between the current and the moving average of the element-wise squares of the parameter gradients.In this way, it is more robust to fluctuations in the difference between the gradients.
Cos is a modification of DGrad that incorporates a learning rate that varies in a cyclical manner [43], leading to an improvement in classification accuracy and typically requiring fewer iterations.
Exp is a modification of DGrad that includes two simple element-wise operations: product and exponential.The purpose of Exp formulation is to mitigate the effect of large variations in the gradient but also to allow the function to converge for small values.BAS-ADAM [22] is an improved version of the Beetle Antennae Search (BAS) algorithm, enhancing convergence behavior and avoiding local-minima by adaptively adjusting step-sizes using the ADAM update rule, resulting in faster convergence and efficient optimization of non-convex functions compared to Particle Swarm Optimization (PSO) and the original BAS algorithm.
A recent Adam variation is AngularGrad [39] that exploits the direction/angle of consecutive gradients to adjust the learning rate.Thanks to angle direction, the optimization becomes smoother while keeping a good trade-off between speed and performance.
In this paper, three more Adam variations are proposed.Their objective is to search the solution space differently and vary the prediction made by the models.Changing optimization creates models suitable for ensemble.

III. MATERIALS AND METHODS
In this section, we describe the different components of the proposed ensemble and we detail the new Adam variants proposed in this study.

A. CONVOLUTIONAL NEURAL NETWORKS
CNNs are a type of deep neural network that was specifically designed for image classification, computer vision, and other related applications, such as medical image analysis [9], face identification, and object recognition, among others.CNNs are designed to operate in a similar way to the human brain by perceiving visual information [24].Recently, the combination of these models in ensembles has been proven to be beneficial in terms of performance (see for instance, [8], [36], [40]).
Convolution involves sliding a small filter (also known as a kernel) over the input data, which is typically a two-dimensional grid of pixels in the case of images.The element-wise multiplication is performed between the values of the given filter (learnable weights) and the corresponding values in the input data.The resulting products are summed up to produce a single value.This process is repeated by sliding the filter over the entire input, with a certain stride, to produce a new output matrix called a feature map.
The convolution operation allows the neural network to detect patterns or features present in the input data.These patterns can be as simple as edges or corners in the case of images.As the network learns through training, it adjusts the weights of the filters to capture increasingly complex and abstract features, such as textures, shapes, or object parts.Convolutional layers are typically stacked in CNN architectures, and each layer learns to recognize different levels of features.This hierarchical feature extraction enables CNNs to understand the content of images in a way that is analogous to how the human visual system processes information [7].
In our experiments, various models pre-trained on Ima-geNet are tested and combined.The last layers of each model are modified to fit the number of classes of the target problem without freezing the weights of the previous layers.The models evaluated include ResNet50 [20], which is about 8 times deeper than VGGNet [20] and uses residual layers and global average pooling layers instead of fully connected layers, and EfficientNetB0 [45], which is designed for mobile devices and uses a multi-objective network search that optimizes accuracy.

B. VISION TRANSFORMERS
In our experiments, we utilized advanced transformer models for image classification.Due to the significant amount of data required for training transformers from scratch, we adopted models that already trained on ImageNet that were then adapted to the task at hand through fine-tuning.These pre-trained models were obtained from the Timm library, 1 including DeiT-Base with a patch dimension of 16, ViT-Base with a patch size of 16, Swin-Base with a patch size of 4, and CoAtNet with a continuous log-coordinate relative position bias removed.It is worth noting that ViT implementation in Timm was pre-trained on Imagenet-21k, whereas the others were not.
For each model, we adjusted the last layer to align the output with the number of categories in each tested dataset.To prevent overfitting, we retained a validation split with a 0.25 split ratio from each training set and resized the image dimension to 224 × 224 to match the required input dimension.

C. TRAINING AND TEST PHASES
During the training phase, each CNN is trained by adopting an optimization algorithm that is chosen at random among the ones available.The training process includes 20 epochs with a mini-batch size of 30 patterns and a learning rate of 0.001.Data augmentation is applied by flipping and rescaling the images, but only if the size of the training set is less than 5000 images.Otherwise, no data augmentation is performed.
Transformers are trained following the pipeline adopted for the DeiT model in [25], replicating the procedure for all four models.Specifically, we trained the models using the AdamW optimizer with cosine annealing [30], and set a low learning rate of 10 −4 coupled with a weight decay of 0.03 to preserve the learned network.We set the batch size to 32.
During the test phase, the outputs of all the models are combined to compute the overall prediction.Thus, each model of the ensemble contributes to the final prediction.The output of the models is not automatically mapped into a probability distribution.Thus, we applied a softmax function to the score of the last layer before creating the ensemble prediction.As a loss function, we used the standard cross-entropy, as opposed to [25] that used the weighted cross-entropy.If the minimum validation F1 score did not decrease after five consecutive epochs, we decreased the learning rate to 10 −5 and 10 −6 to improve convergence.The best-performing model with the highest F1-score in validation during the training was saved.During training, data were randomly flipped or rotated.To create the ensemble, we repeated the training procedure ten times for each model and each dataset.
In Figure 1, we show how the training and test phase are organized.This can be considered as a testbed for ensembles that we use for our experiments combining different topologies of CNN and transformer facing seven publicly available datasets.However, this structure is general and can be used to combine any number of topologies.

D. DATASETS
We assess the proposed ensembles by adopting several image classification benchmarks.Table 1 reports some information for each dataset: a short name, the number of classes and samples, the testing protocol, and the original reference.For the testing protocols, we adopt the following abbreviations: • xCV indicates that an x-fold cross-validation has been adopted (e.g., 10CV means that a 10-fold crossvalidation has been used); • X:Z indicates the fractions of the dataset used for training and testing respectively.
The adopted datasets contain information from very different domains, in particular: • WHOI consists of images, taken from Woods Hole Harbor water using Imaging FlowCytobot; • ZooScan is composed of grayscale images that were acquired from the Bay of Villefranche-sur-mer using the Zooscan technology.The images were automatically cropped before classification to remove any artifacts caused by manual segmentation.
• Kaggle dataset is a subset of a larger dataset obtained using the ISIIS technology in the Straits of Florida and was used for the National Data Science Bowl 2015 competition.
• Zoolake is a dataset of microscopic plankton taxa images collected from the dual-magnification Scripps Plankton Camera in Lake Greifensee.The dataset was built using images acquired from wild plankton from 2018 to 2020.
• BG (Breast Grading Carcinoma) dataset contains images of size 1280 × 960 pixels and is divided into three classes representing grades 1-3 of invasive ductal carcinoma of the breast.
• Deng contains images of pests commonly found on plants between Europe and Central Asia.It was created by collecting images from several online sources such as Insert Images, IPM images, Dave's Garden, and Mendeley Data.
• VIR comprises a total of 1500 Transmission Electron Microscopy images, each with a size of 41 × 41 pixels, of various virus types classified into fifteen distinct categories.
• TEM (Transmission Electron Microscopy) contains annotated transmission electron microscopy images (size of 1376 × 1032 or 2048 × 2048 pixels, depending on with which electron microscope they were captured) of 14 virus classes along with extracted image patches centered on virus particles.

E. NEW OPTIMIZATION METHODS
In this section, we introduce three new optimization methods.
To better understand the proposed optimizers, it is worth recalling the Adam optimization algorithm and DGrad [37].
The update rule for Adam can be written as follows: β 1 = 0.9; β 2 = 0.999; α = 0.001 (6) where m t is the first moment (mean) of the gradients up to time step t; v t is the second moment (uncentered variance) 124966 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
of the gradients up to time step t; g t is the gradient at time step t; β 1 and β 2 are the decay rates for the first and second moments, respectively; α is the learning rate; ϵ is a small constant added for numerical stability; m t and v t are bias-corrected estimates of the first and second moments, respectively; θ t is the current set of parameters at time step t.
DGrad [37] is a variation of Adam inspired by Diff-Grad [16].It considers the absolute difference between the current and the moving average of the element-wise squares of the gradients (av t ), see Eq. ( 7)- (9).In this way, it is more robust to fluctuations in the difference between the gradients.For updating the parameters, Eq. ( 10) is applied using the definition, for the weighting factor, reported in Eq. ( 9): In the following paragraphs, the novel optimizers proposed in this paper are described in details.

1) HYPERBOLIC OPTIMIZER (HYP)
The first proposed optimizer is Hyperbolic (Hyp).It has a similar behavior of Exp [37], but does not mitigate the effects of large variations in gradient.Indeed, while the function described in Exp gets low values for great gradient differences, this approach simply calculates parameters as follows: With a = 10, b = 2/3, and c = 3/2 we obtain the plot reported in Figure 2. The calculation of the final weighting factor of the learning rate is: then, the weighting factor of Eq.( 12) is used in Eq. (10).

2) MIND OPTIMIZER
The second novel optimizer is called MinD and exploits two of the best optimizers in different ways.It calculates at every iteration ξ t1 using DGrad and ξ t2 using Exp.Then, the effective ξ t is chosen by applying: then, Eq.( 13) is applied in Eq. (10).

3) ANGULAR INJECTION OPTIMIZER (AI)
Angular Injection (AI) optimizer is based on AngularGrad [39] and injection [15].It generates a score to control the step size based on the gradient angular information of previous iterations.AI takes into account the information from the angle/direction of the gradient vector instead of just the magnitude of it.To exploit the change of gradients during the optimization steps, an angular coefficient was introduced (Eq.( 14)): A min = min(A t , A t−1 ) ( 15) where a = 10, b = 2/3, and c = 3/2.The weighting factor is: then, Eq.( 17) is applied in Eq. (23).
In order to utilize the curvature information during optimization, the curvature information guided (weighted) second-order momentum is injected into first-order momentum: The value delta in Eq.( 21) represents the difference in the short-term of the parameters; the weighting factor ξ t in Eq. ( 23) is the one defined in Eq. ( 17); β 1 and β 2 are initialized as in Adam.
In this approach: • in the first iteration, we use standard Adam; • in the second we fix ξ t =1 in Eq. ( 23), we use avg t and avgsq t calculated using gradients obtained by standard Adam in the first iteration (i.e. using eq.( 2) and ( 3)); • from the third iteration as in the method explained above.

F. WILCOXON SIGNED-RANK TEST
The Wilcoxon signed-rank test, introduced by Mann and Whitney in 1947 [32], is a statistical test designed for comparing paired data samples obtained from individual evaluations.Unlike parametric tests, this non-parametric test does not rely on assumptions about the underlying distribution of the data, such as a normal distribution.Instead, the Wilcoxon signed-rank test takes into account both the magnitudes and signs of the differences between paired observations.This test serves as a non-parametric counterpart to the paired Student's t-test and is particularly useful when the population data does not follow a normal distribution.Its primary purpose is to assess whether two related paired samples are drawn from the same distribution.By analyzing the ranks assigned to the differences between the paired observations, the Wilcoxon signed-rank test provides a reliable means of evaluating the null hypothesis.

IV. EXPERIMENTAL ANALYSIS
In this section, we report the results of the experimental evaluation, considering different performance indicators for comparing the approaches.Moreover, validation of the superiority (using p-value) of one method over the others is provided by the Wilcoxon signed rank test.All the experiments were taken on a Windows Server 2019, with an Intel Core i9-10920X CPU, 3.5 GHz, and 256 GB RAM, we employed an Nvidia Titan RTX 24 GB, 1350 MHz.They are developed in Matlab 2022a/PyTorch.
Since CNNs require input images at a fixed size, we apply two different strategies for resizing plankton images.The first is sqr, in which the process of square resizing involves first padding the image to achieve a square dimension and subsequently resizing it to match the CNN input size.The second one is padding only (pad), in which the image is directly adjusted to match the CNN input size, without the intermediate square resizing step.It is important to note that the square resizing step becomes necessary only in specific scenarios where the original image dimensions exceed the dimensions of the CNN input size, prompting the image to undergo resizing.Padding is performed by adding white pixels to plankton images.Since we propose ensembles, half of the nets (in each ensemble) use sqr and the other half pad.
Our experiments involved a large number of training sessions applied to multiple topologies.As it is widely known, the behavior of the training loss should be carefully analyzed to check that the training phase was properly run and to measure the model convergence towards the desired task.In our context, a key element that should be highlighted is the influence of the optimizer on the behavior of the loss function during training.Figure 3 reports the loss while training a ResNet50 network on the Deng dataset for 20 epochs: in (a) the Adam optimizer was used, while in (b) and (c) the proposed Hyp and AI methods were employed, respectively.As it can be seen, the novel optimizers lead to a more homogeneous and faster convergence with respect to the Adam optimizer.
In all the experiments, each CNN and transformer network was trained several times using the standard Cross-Entropy loss.Each training leads to a network instance.Ensembles are created composing several instances and/or topologies.In the following, we denote with A+B the composition of networks A and B by sum rule.In the case of CNNs, each network is trained seven times -so, A+B means that both networks A and B were trained seven times on the same dataset, and the resulting 14 instances are combined by sum rule.
In the following, results labeled with SGD refer to the output of the fusion of 14 stand-alone CNNs trained using stocastic gradient descent and combined by the average rule.This ensembles is shaped in this way to ensure that it is comparable against the ensembles obtained by Hyp+MinD (ensembles that have size fourteen).

A. ENSEMBLES OF CONVOLUTIONAL NEURAL NETWORKS
We first run a set of experiments to get the performance of the ensemble of CNNs when using different Adam variants during the training phase.This was also useful for a comparison with the original methods that are reported as baselines.The results are reported in Table 2.In particular, we compare several Adam variants with the original Adam and SGD; for stand-alone Adam variant approaches, there are two values in each cell of the table : • Average accuracy of seven stand-alone CNNs trained with the given optimization method; • Fusion by average rule of seven stand-alone CNNs trained with the given optimization method.
From the results in Table 2 it can be noticed that the performance of the ensembles with the fusion by average rule is always better than the average accuracy of the standalone networks, providing another piece of evidence that the adoption of the ensemble is beneficial for the performance of the system.In Table 3, we report the performance on the plankton datasets of the most interesting Adam variants.The Adam variants, reported also in Table 3, outperform those not reported by a p-value of 0.001.However, all the Adam variants shown in Table 3 have similar performance.Moreover, in Tables 2 and 3 it can be noticed that Hyp+MinD outperforms both ensembles based on Adam and SGD with a p-value of 0.0001.Hyp+MinD+AI outperforms Hyp+MinD with a p-value of 0.005.
124968 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. ENSEMBLES OF TRANSFORMERS
The second set of experiments focuses on transformers -this was useful to get the performance of the ensembles and to compare them with the stand-alone original methods.We also tested transformer models trained with Adam variants, but we noticed no particular gains, therefore, for the sake of space, we have not reported them.
In Tables 4 and 5 we report the results obtained by the transformers.The following methods are reported in the table, where x represents the number of combined models: • D(x), sum rule among x Deit -that is, x instances of the Deit transformer trained on the same dataset and combined by sum rule; • S(x), sum rule among x Swin; • V(x), sum rule among x Vit; • C(x), sum rule among x Coat; • CNNs is the ensemble of Hyp+MinD+AI coupled with both ResNet50 and EffNetB0; • (D+V+S+C)(x), sum rule among x Deit, x Swin, x Vit and x Coat; • CNNs+(D+V+S+C)(x), sum rule among CNNs, x Deit, x Swin, x Vit, and x Coat, before the fusion, the scores of each topology are normalized by the number of networks in the given ensemble, so the weight of CNNs is equal to that of C, S, V, and T.
Interestingly, there is no winner among stand-alone transformers and among the ensemble of transformers.However, it is interesting to notice once again that each set of 10 transformers outperforms its stand-alone transformer with a p-value of 0.002 (e.g., D(10) outperforms D(1) with a p-value of 0.002).On the other side, combining different transformer topologies does not seem useful as in the CNN case, for instance: (D+V+S+C)(2) behaves similarly to V(10) (sets of similar size); (D+V+S+C)(10) outperforms (D+V+S+C)(2) with a p-value of 0.002, however, this comparison is not fair given the different sizes of the two ensembles.Moreover, (D+V+S+C) (10) outperforms CNNs with a p-value of 0.002.

C. ENSEMBLES OF CNNS AND TRANSFORMERS
In the last rows of Tables 4 and 5, we report the results obtained by ensembles of CNNs and transformers.The small gain from ensembles of different topologies is probably due to the saturation of the models' performance on the datasets considered.In Table 6, we compare the best-performing approaches, considering the error under the ROC curve as a performance indicator.The fusion between CNNs and transformers outperforms both CNNs and transformers with a p-value of 0.01.
We tried to increase the size of the ensemble named ''CNNs'' by also using DenseNet201 and MobileNetV2 topologies.The results are reported in Tables 7 and 8.For the sake of computational time, we have run this test only on ZooLake and Deng datasets.It is interesting to notice that while the use of these CNN topologies improves the performance of the CNNs ensemble, the performance of the fusion of CNNs with the transformer ensemble remains similar.This suggests a possible plateau of performance reached with these approaches.

D. COMPARISON WITH SOTA AND ELAPSED TIME
We compared our approach against state-of-the-art (SOTA) approaches reported in the literature.Our proposed ensemble overcame the SOTA in many tested datasets.As the datasets we considered in our tests were used in hundreds of articles, we reported only a few articles in which SOTAs were obtained in those datasets to avoid creating huge tables.In addition, we only reported articles adopting the same test protocol we used, while we found other articles reporting better results but using different protocols, leading to an unfair comparison.Table 9 reports these results.For [42] we report the results obtained after 100 epochs to be coherent with our test protocol.
It is clear that our approach is not suitable for problems with strong computational constraints.We trained a large set of networks and achieved very good performance, however, we did not set any hyperparameters to optimize performance on specific datasets, in order to avoid overfitting and preserve the generality of the proposed approach.Nevertheless, with current GPUs, even ensembles of more than 10 networks can classify dozens of images per second.In Table 10 we report the inference time, which is obviously higher when vision transformers are used; however, considering that these are able to analyze hundreds of images in a second, this is a reasonable time for many applications.

V. DISCUSSION ON ADAM VARIANTS
In this work, we have developed several optimization methods that address the problem of finding a good minimum in different ways, so it is useful to combine them in an ensemble.The proposed approaches work considering the absolute difference between the current and the moving average of the squares of the parameter gradients.In this way, it is more robust to fluctuations of the difference between the gradients of different iterations with respect to Adam and DiffGrad.
As detailed in the literature (e.g.[15]), an ideal parameter optimization method should follow the rules depicted in Figure 4, where is the parameters tensor, delta is the difference of the parameters between two training iterations (see Eq. 21), g are the gradients.In the flat region (S1), an ideal optimizer should perform a large step in order to escape from the flat area.In the so-called ''large gradientsmall curvature'' area (S2), for faster convergence, it is important to perform large step size.In the ''steep and narrow valley'' (S3), a minimum is found.In this area both gradients and delta are large, a small step size is required for finding the minimum and reducing the oscillations.
Taking this into consideration, the three proposed optimizers have the following properties: • the Hyp optimizer is based on DGrad/DiffGrad, so the idea is that gradient is changing more gradually near the minimum, see S3 area when approaching the minimum.DGrad/DiffGrad are based on a sigmoid function that squashes every value between 0.5 and 1.Instead, Hyp is based on a function that squashes every value between 0 and 1 (see Eq. ( 12)), it takes larger steps in the S2 area and lower steps near the minimum.The main drawback of this approach is that when gradient variations are close to zero the parameters are updated only for a small value, so this approach could converge more slowly than

FIGURE 1 .
FIGURE 1. Example of an ensemble.Each CNN model is trained using an optimization algorithm that is randomly chosen a priori.During the test phase, the predictions are combined to compute the outcome of the ensemble.

FIGURE 2 .
FIGURE 2. Plot of lr t for a = 10, b = 2/3, and c = 3/2.On y-axis, the lr t value.On the x-axis, the ag t value.

FIGURE 4 .
FIGURE 4. A common situation in optimization that illustrates the significance of adaptive parameter updates in the optimization process[49].

TABLE 1 .
Description of the datasets used in this study.

TABLE 6 .
Error under the ROC curve (in %).

TABLE 8 .
Error under the ROC curve (in %), using more CNN topologies.