Data Augmentation and Transfer Learning for Brain Tumor Detection in Magnetic Resonance Imaging

The exponential growth of deep learning networks has allowed us to tackle complex tasks, even in fields as complicated as medicine. However, using these models requires a large corpus of data for the networks to be highly generalizable and with high performance. In this sense, data augmentation methods are widely used strategies to train networks with small data sets, being vital in medicine due to the limited access to data. A clear example of this is magnetic resonance imaging in pathology scans associated with cancer. In this vein, we compare the effect of several conventional data augmentation schemes on the ResNet50 network for brain tumor detection. In addition, we included our strategy based on principal component analysis. The training was performed with the network trained from zeros and transfer-learning, obtained from the ImageNet dataset. The investigation allowed us to achieve an F1 detection score of 92.34%. The score was achieved with the ResNet50 network through the proposed method and implementing the learning transfer. In addition, it was also concluded that the proposed method is different from the other conventional methods with a significance level of 0.05 through the Kruskal Wallis test statistic.


I. INTRODUCTION
Since the end of the 20th century and the beginning of the 21st century, we have witnessed the new industrial revolution, the second informatics revolution [1]. The emerging developments and technological advances have allowed us to create increasingly powerful tools with incredible performances in different areas, where medicine could not be the exception [2]. Advances range from simple tasks to tasks so complex that they were usually performed by professionals or experts [2], [3]. These advances are largely thanks to artificial intelligence (AI), one of the most awaited paradigms since several decades ago [4]. It is perhaps a little challenging to define artificial intelligence since many authors establish intelligence as the ability to generate a response to a stimulus or achieve a goal in a specific environment [5]. In this sense, artificial intelligence can range The associate editor coordinating the review of this manuscript and approving it for publication was Genoveffa Tortora . from the most straightforward systems to highly complex processes that resemble the cognitive processes performed by the human brain [6], [7]. The latter is the desired approach, where deep learning (DL) has managed to address some of these processes, even surpassing human performance in some tasks [8]- [10].
Moreover, DL is one of the fastest-growing topics in recent years, arousing interest in various research areas that rely on manual, extensive, or tedious processes, such as medicine [11], [12]. Besides, DL artificial neural networks have advantages that make them even more attractive. For example, DL networks do not require prior feature extraction and can be used directly on the raw data [13]. Moreover, despite the complexity of the tasks, the network model is usually governed by a few mathematical expressions [14]. While the model becomes complex because of the number of layers that constitute it, the forte of deep learning is focused on task performance and not on statistical or mathematical inference, i.e., for most researchers, DL models can be VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ treated as black boxes that need input data and labels to replicate tasks (Supervised Learning) [15]. These examples are detection, classification, segmentation, and prediction, among the most common applications in medicine [16], [17].
-Unfortunately, not everything is so simple in DL-. One of the main challenges is having a large amount of data to efficiently train AI models in different tasks. This constraint is the most critical point, especially in medicine, where access to images is limited by cost or few study subjects. Besides, the images require the authorization of the subjects, who may refuse even if their data are anonymized [18]. Consequently, strategies to synthesize or augment data are becoming increasingly common in studies with small data sets [19].
On the other hand, artificial intelligence (AI) has taken a vital role in radiology, allowing it to automatically address tasks such as lesion detection or image quantification [18]. Moreover, due to high effectiveness and reduced processing times, AI (especially deep learning) has been highly involved in cancer pathology or brain tumors, one of the most recurrent diseases worldwide [20]. Brain tumors are a type of cancer manifested by excessive and uncontrolled growth of abnormally functioning cells [21]. The damage to the cells is generated by different factors that can range from genetic (increasing the risk of suffering from this pathology) to external factors such as chemical substances or exposure to high radiation sources [22]. In general, tumors are classified into heterogeneous neoplasms that include differentiable lesions (e.g., meningiomas) or highly invasive and poorly differentiable lesions such as multiform gliomas [23]. Glioma has the highest mortality rate among brain tumors, manifesting with increased progression of pathology. Statistics show that glioma accounts for almost 80% of malignant tumors [24], generating a 5-year survival rate of less than 21% in people older than 40 years [25]. However, early detection leads to a significant reduction in these statistics [26]. Fortunately, great efforts are invested today to address this, and other targets related to brain cancer [27]. Research is conducted using different tools, including DL, a science that has been very popular in recent years in the radiological field. For example, so far in 2021, DL research related to brain cancer can be found, such as Radiation therapy planning of head and neck cancer patients [28], automatic diagnosis of brain tumors [29], detection and classification of brain tumors [30]- [33], diagnostic feasibility assessment with DL networks [34], detection of brain metastases [35], prediction of survival in patients with infiltrating gliomas [36], the prognosis of glioblastoma multiforme [37], analysis for diagnostic biomarkers of glioma [38], segmentation of brain tumors [39]- [42], segmentation in dosimetry in organs at risk [43], and denoising to improve quality in subjective imaging [44].
Recent research shows promising findings and results, covering many applications in favor of brain tumor detection and treatment. However, despite the good results highlighted by the authors, few investigations have validity in the real clinical context due to serious limitations. Mainly, the authors highlight the limited access or the small amount of data for training the models, preventing the generalization of the results. For example, Olin et al. state that the models used were trained with small data sets for head and neck patients, limited to a study of no more than 800 scans [28]. Similarly, Jayachandran et al. have only 775 patients with glioblastomas [34]. Similarly, Amemiya et al. work with 127 patients, stating that the data are small, which would imply a better performance if the number of data is augmented [35]. For their part, Tandel et al. are limited to 130 patients with brain tumors; however, they avoid this drawback by using transfer learning and augmenting the data with image scaling and rotation [30]. Similarly, Jiang et al. increase the number of images through flipping, scaling, and smoothing [39]. Similarly, Wang et al. use rotation, flipping, image warping, and color (contrast) change through gamma function [41]. In general, most authors performed the applications with a small data set; however, they did not implement any data augmentation strategy or learning transfer. Examples of these are: Menze, Al-Saffar, Khairandish, Islam, Song, Poel, Yan, and Wong et al. [29], [31], [32], [36]- [38], [43], [44].
The presented literature clearly shows the need to augment the number of training data due to the limited available data set. Moreover, the few studies that use data augmentation do so without reporting which strategy is more efficient. Therefore, in this work, we explore the different data augmentation strategies on the performance of the ResNet50 network in brain tumor detection in magnetic resonance images.
• This research work offers the following novel contributions: • A review of conventional data augmentation methods is presented.
• A new data augmentation method based on principal component analysis is proposed.
• A comparative framework between different data augmentation methods is proposed.
• The effect of transfer learning on the performance of the convolutional neural network ResNet50 is compared.
• The results are evaluated through the non-parametric Kruskal Wallis test, based on the distribution of means of the data.
A comparison between the activation maps of the ResNet50 layers under the similarity coefficient is presented with centered kernel alignment.

A. DATASET
The investigation was based on The Cancer Genome Atlas Low-Grade Glioma (TCGA-LGG) database [45], [46]. The set has 110 participants and three types of image acquisition sequences, with fluid-attenuated inversion-attenuated inversion recovery (FLAIR) imaging being the sequence of choice for data augmentation. The images are axial slices of size 256 × 256 in uint8 format, i.e., images with 8-bit unsigned integer data.

B. DATA PREPROCESSING
The images were only reformatted and normalized, leaving the intensity values on the 0 to 1 scale in float32 format. Deep learning methods are generally designed to work on the raw data [47]- [49]; therefore, no further preprocessing was performed on the images.

C. CONVOLUTIONAL NEURAL NETWORK
There are many deep learning neural networks, and this approach is of great interest since greater depth allows the network to perform more complex tasks [50]. However, the increase in depth poses two main problems. First, deeper networks require a larger number of training parameters, hence a larger dataset to arrive at a high-performance network, and second, depth limits training due to gradient fading [51]. In this research, the different data augmentation strategies are used to solve the first drawback and, to solve the second one, the ResNet50 network was chosen [52]. The network is described in detail in appendix A.

D. DATA AUGMENTATION
The inherent need for large amounts of data in deep learning networks has encouraged the development of many strategies ranging from simple transformations such as geometric transformations to complex images composed of mosaics. Among the most commonly used techniques [19], [53], are the following basic techniques: • Translation [54], [55].
• Random deletion of frames [66]. The methods are classified as basic and/or deformable and represent about 86% of the data augmentation methods applied in medical imaging for deep learning [53]. Each method listed is described in detail in Appendix B.

1) PCA-BASED AUGMENTATION (PROPOSED METHOD)
Principal component analysis (PCA) is generally used to reduce the dimensions of a data set or even eliminate noise if it is used as an encoder-decoder [67]- [69]. The method takes a series of samples or observations and creates new components generated as the linear combinations of the first ones. The components are generated hierarchically, and each component represents a percentage of the variability of the data, where the first component z 1 has the largest percentage, and each new component has a smaller percentage than the previous one. Mathematically, the first principal component has the form expressed in Equation (1) or (2).
In other words, let X be an observation of m variables, i.e., X ∈ R m . The observation can be represented from a smaller number of latent variables Z , as shown in Equation (3).
where, Z is the vector of n principal components, with n < m. In other words, Z t equals (z 1 , z 2 , z 3 , . . . , z n ), with each principal component z i being the linear combination of the original m variables and W the matrix of the coefficients of these linear combinations, which are calculated following the following considerations: For the first component z 1 the maximum variance subject to the constraint of Equation (5) must be satisfied.
Subsequent components are calculated under the same reasoning, considering that the new components must be orthogonal to the previous ones, i.e., the i-th component must fulfill the restriction of Equation (6).
For the case of images, the reasoning is the same. However, each image would represent an observation X and each pixel of the image a different feature. Thus, for an image of 128 × 128 there would be 16.384 features. Therefore, it is possible to represent the pixels of an image in a smaller number of features while preserving the higher variability of the images. The transformation by PCA to several latent variables is reversible, i.e., the original variables can be obtained from the principal components, and the greater the number of components taken, the greater the similarity in the reconstruction of the original variables. Generally, the reconstruction of images with a smaller number of components is used to eliminate noise because components associated with such noise are eliminated [67]. The process is based on finding the projections of the components on the original centered space, as shown in Equation (7).
where,X is the observation with the m variables centered with respect to their means µ i (i = 1, . . . m), for the case of several observations. As mentioned above, PCA can be used to eliminate noise by taking a smaller number of principal components for the reconstruction. In other words, eliminating the last principal components would preserve most of the explained variance and, therefore, the fundamental essence of the image would be preserved. In this order of ideas, altering the principal components with random noise would imply the partial modification of the image, preserving its primary attributes, i.e., it could be possible to generate new images from a reference image.
Based on the above considerations, it was proposed to generate images as follows: The original images were flattened to vectors of 16, 384 features. The features were projected into a latent space of lower dimensionality through PCA. Each vector was multiplied pointwise (Hadamard Product) by a random noise vector V r with the same dimensions but with values from a threshold t to 1 (V r ∈ [t, 1]). The threshold was determined as t = 1 − noise R , where noise R is the proportion of noise added. For example, if noise R equals zero, the values of V r would be constrained to 1, implying that the latent variables would not be altered. Finally, the modified latent variables Z were used to generate the new features through the inverse transformation. The process is exemplified in Figure 1.
Mathematically, the model of the new images would be given by the expression of Equation (8).
E. TRANSFER LEARNING As mentioned above, DL networks need a large amount of training data due to the high number of parameters. In this sense, transfer learning is another widely used method to initialize the model weights, avoiding training from zeros or random distributions. The process consists of taking a network and training it with an extensive database, allowing filters to take the weights to create the complex activation maps associated with that dataset. Generally, if the database is large enough, the network learns the task with a high degree of generalization. The attribute can be retained for an equivalent task with another dataset, and the network would generate good results if subsequently trained with the new data, even if the data is sparse [70], [71]. Following this order of ideas, data augmentation methods were trained from zeros and implementing transfer learning with the ResNet50 network and the ImageNet natural image database [72].

F. LOSS FUNCTION
Although many loss functions exist, cross-entropy remains one of the most reported and used for the case of twoclass classifications [73], as is this case. Precisely, the function measures the difference between two probability distributions, calculating the entropy associated with each class or element. The concept can be applied to images, taking each pixel as one of two distribution elements (e.g., healthy tissue and tumor) [74]. The binary cross-entropy (L BCE ) is defined mathematically, as shown in Equation (10).
where, y is the actual data set andŷ is the predicted set.

G. EVALUATION METRICS
As an important part of an objective comparison of the models used, our approach was based on four evaluation metrics: Accuracy, Sensitivity, Specificity, F 1 score, and Precision. The metrics are expressed as shown in Equations (11) through (15) [75]- [77].
The above metrics are expressed in terms of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). In addition, for this specific case, the metrics represent the following observations: Accuracy: The ability of a network to correctly classify the different classes, i.e., tumor and non-tumor.
Sensitivity: The ability of a network to classify actual tumors.
Specificity: The ability of a network to correctly classify real non-tumor images. F 1 Score: The ability to correctly identify the different classes in proportion to the number of classes.

H. STATISTICAL ANALYSIS
The Kruskal Wallis test was used for statistical estimation between groups, which evaluates whether two or more samples belong to the same distribution based on the median of these samples. The test uses the null hypothesis with the assumption that all samples come from the same distribution. Then, for a p value less than 0.05, it would imply that the null hypothesis is false and, therefore, a statistically significant difference would be established between the two groups tested. Note that the value of 0.05 or significance level can have a lower or higher value. However, this value is the most accepted since it represents only 5% of concluding that there is a difference when there is none [78]. The method, assuming k groups with n observations, defines the H statistic given by the mathematical expression of Equation (16).
where, n i is the number of observations in the i-th group, N is the total number of observations in the two groups, r ij is the rank of the i-th observation over the j-th observation among all observations and k is the number of groups [79], [80].

I. CENTERED KERNEL ALIGNMENT (CKA)
Finally, to understand the behavior of neural networks as a function of different layers, the similarity between layers was included through the centered kernel alignment method [81]. Particularly, CKA takes two feature maps as inputs and calculates the normalized similarity, as shown in Equation (18).
where, K and L are the similarity matrices of any two feature maps (see Equation (A.1)), HSIC is the Hilbert-Schmidt independence criterion for similarity based on the dot product [81], [82].

J. EXPERIMENTAL DESIGN
The different data augmentation strategies were compared by training the ResNet50 network with the TCGA-LGG database. The data were normalized and split into training and validation data. The ResNet50 network was trained from scratch and implemented transfer learning from the trained network with the ImageNet database. Each training was executed with the k-folds method using 10 folds. The network was used under the binary cross-entropy loss function.
In addition, the performance of the network during training was validated with the accuracy metric. Subsequently, the network was evaluated with the F 1 score, accuracy, sensitivity, specificity, and precision metrics through the test data.
It is worth noting that the network was run an average of 40 times under the following hyperparameters: -Loss function: binary cross-entropy.
-Number of epochs: 50 -Optimizer: Adadelta -Batch size: 10 -Initialization of weights: Uniform Glorot -Bias initialization: Zeros Finally, the different configurations were compared through the Kruskal Wallis statistical model, where the p-value between these configurations was calculated to establish statistically significant differences. In addition, the similarity matrices generated by the centered kernel alignment method were also compared. The ResNet50 architecture was modeled with the main Keras and TensorFlow libraries under the Python programming language. The execution was performed on the Colab platform configured with 25 GB of Ram and Tesla T4 GPU.

III. RESULTS
Initially, the images were generated through the proposed method based on principal component analysis (PCA). The reconstruction of the images with different noise ratios is illustrated in Figure 2. The results clearly show that the images retained part of their spatial characteristics, even for the case with a high percentage of noise (0.9 Noise equivalent to 90%). Consequently, we used the images with 90% random noise for data augmentation in this work.
As mentioned above, the network was trained under the different data augmentation methods, with and without transferring the weight values (Transfer Learning). Therefore, the results shown below obey both cases. It should be clarified that the tables and spider graphs are shown in percentage values, while the box-and-whisker and training figures are given in their fractional equivalents, i.e., with values from 0 to 1. Table 1 shows the maximum values achieved by the data augmentation methods. Additionally, this is ordered from highest to lowest, taking the F1 score as a reference. The results show that the proposed PCA-based method achieved the maximum values in both cases, i.e.,  Similarly, Figure 3 shows the results of Table 1, being possible to observe that some methods presented similar behaviors. For example, random frame removal, overlapping, and noise addition had relative values in all five metrics. On the other hand, the proposed method is highly effective in both cases, i.e., without Transfer Learning and with Transfer Learning, generating the largest pentagons.
Additionally, it is worth noting that the trend in the scoring order of data augmentation methods was partially preserved, i.e., the proposed method generated the best results in the two cases. Similarly, noise addition and superposition maintained their positions, being the worst-performing strategies. In fact, the results show a maximum variation of up to 3 positions, where Flip, Distortion, and Random frame deletion, moved up three positions for the case with transfer learning. Figure 4 presents the distributions of the 40 runs for each data augmentation method. Results were generated with the test data for the F1 score, accuracy, sensitivity, and specificity metrics. The distributions presented scores above 0.5, and it is even observed that the limits of the distributions reached values close to 1, demonstrating the effectiveness of the network combined with the data augmentation strategies. Additionally, the figure shows that the proposed method presented a compact distribution with the interquartile ranges with a more significant upward trend than the other methods. The behavior of the proposed method was maintained in  Similarly, Figure 5 presents the distributions of the 40 runs for each data augmentation method, but with learning transfer. In particular, the distributions generated with the learning transfer presented an upward shift, i.e., the results improved in all four metrics. Additionally, the figure shows that the proposed method presents the best distribution. Therefore, the proposed method is more likely to obtain a network with better performance.
The results presented better scores with the proposed method; therefore, only the training and similarity matrices VOLUME 10, 2022  for the ResNet50 network with the PCA-based method are shown below. Additionally, the results with and without learning transfer are also included. Figure 6 shows the training of the ResNet50 network for training with data augmentation by PCA, where Figure 6a and Figure 6b present the results starting from zeros and Figure 6c and Figure 6d with transfer learning. The results show similar behavior, i.e., progressive growth of model accuracy and decreasing losses as a function of epochs. Also, it is worth noting that the error bands are small, which implies a homogeneous training between the different model runs. On the other hand, the main difference between the training from zeros and the one implemented with the transfer learning lies in the fact that, for the first case, the network did not reach values as high as in the second case. In other words, transfer learning allowed for higher accuracy and reduced loss. In addition, the training and validation curves did not present significant differences, guaranteeing reduced overfitting, as can be deduced from the results obtained in Table 1. Figure 7 shows the similarity between the 190 ResNet50 network layers for the case of training from zeros and with learning transfer. Additionally, Figure 7 presents the similarity between layers for the network trained with the ImageNet data, with the activation maps generated by the same dataset; in other words, Figure 7c is the reference similarity matrix. The reference matrix has little similarity between the farthest layers, i.e., between the first and the last layers. On the contrary, the closest layers present coefficients with similar values. In the case of transfer learning training (Figure 7b), the pattern is preserved in the matrix; however, the similarity between layers increases in the layers far from each other. Finally, training from zeros (Figure 7a) essentially loses the pattern concerning the reference matrix; however, the layers at the extremes still have reduced similarity. Table 2 and Table 3 show the p-value of the Kruskal Wallis test statistic. Table 2 shows the results between the data augmentation methods for the two cases: from zeros and with learning transfer. For the first case, it is observed that there is a more significant number of pairs of methods that have p-values above the significance level (greater than 0.05 highlighted in bold), indicating that the methods come from the same distribution, i.e., they have no difference between them. On the other hand, the proposed method only had a p-value above the significance level with the Cropping method with the network without learning transfer. Almost similarly, the proposed method did not have p-values above the significance level with any method in the case of the trained network with learning transfer, i.e., the proposed method is statistically different from the others.
Finally, Table 3 shows no p-value above the significance level, i.e., all methods have statistically significant differences when trained from zeros and with learning transfer, showing the high effectiveness of weight transfer.

IV. DISCUSSION
This paper presents a robust experimental framework for evaluating different data magnification methods in brain tumor detection with magnetic resonance imaging. The study was based on 12 different data augmentation methods, including a new image generation method based on principal component analysis (PCA). Additionally, a comparison between the training process from zero and with transfer learning is presented. The results showed the high effectiveness of the proposed method, achieving a maximum F1 score of 92.34% and outperforming the other evaluated methods. Additionally, all data augmentation methods were run in 40 runs to generate the distributions of model behaviors, with a better distribution observed for the proposed method under training with and without learning transfer. The scores of the distributions were subjected to the Kruskal Wallis non-parametric test statistic, where it was estimated that the proposed method is statistically different with a significance level of 0.05, guaranteeing the high effectiveness concerning the other conventional methods.
Although the results are promising, the work has some limitations or concepts that were not addressed in this article and would be interesting to explore as future work. For example, data augmentation was explored for each strategy individually; however, in some research, data augmentation is used by combining two or more strategies, creating a larger amount of data from the same reference image.
On the other hand, our focus was on 1.5 Tesla FLAIR images and, therefore, the results are extrapolated only to this type of image. Future work needs to be explored with other types of sequences, such as T2, T1, with contrast agents or proton density, and even with images generated by resonators of higher field strength (e.g., 7 Tesla). In this same sense, the study focused on the ResNet50 network since it is one of the most reported and efficient detection tasks. However, it is necessary to implement the strategies on other convolutional networks to generalize the results obtained in this work.
The proposed method showed promising results; however, the technique was used with a noise percentage of 90% on the principal components, being the noise ratio a variable that was not considered for the training of the models.
In addition, since the images are subject to noise randomness, it is possible to generate several images from one. Therefore, it would be possible to explore the performance of the networks by augmenting the same image several times with this strategy. Finally, the study was performed on a single dataset, presenting homogeneity in the data, implying biased results towards that dataset.

V. CONCLUSION
An experimental framework for detecting brain tumors in magnetic resonance images was proposed, comparing 12 data augmentation methods with a new method based on principal component analysis. The generated images retained part of the spatial features, allowing to train the ResNet50 network until reaching an F1 score of 92.34%. The network, together with the proposed method, proved to be statistically different from conventional methods with a significance level of 0.05, guaranteeing the high effectiveness of the model. On the other hand, it was also possible to establish that data augmentation presents better results, generating significantly better models than models trained from zero.

A. CONVOLUTIONAL NEURAL NETWORK RESNET50V2
The ResNet50v2 network consists mainly of convolutional layers, which use a convolutional operator. The operator, also known as filter or kernel, processes the image generating feature maps that are in turn used by the subsequent convolutional layers. The maps are patterns or abstractions that generally lack statistical inference but are the fundamental basis of the network to arrive at the desired task (e.g., detection) [83]. The ResNet50 network uses a total of 50 convolutional layers (see Figure 8); each map is established by the same mathematical model described by Equation (A.1).
ij represents the j-th kernel of the l-th layer. * is the convolutional operation between the kernel and the input feature map, which corresponds to the previous convolutional layer's output and has a depth of i feature maps. b (l) j is the bias associated with the convolutional operation with the j-th kernel and ϕ (l) is the activation function of that layer [84], [85].
Additionally, the network is based on the concept of residual connection or mapping. In particular, the connection creates trajectories parallel to the convolutional layer sequences, allowing smooth transmission of the gradient through the layers and preventing the gradient value from being zero. Furthermore, the connection forces the network to learn the residual mapping f (x) − x, being easier to train if the ideal residual mapping is the identity function f (x) = x (see Figure 8c) [86]. The convolutional layers are connected through such connections every three layers, as illustrated On the other hand, although the ResNet50 network receives its name because it comprises 50 convolutional layers (see Figure 8.b), it has four types of layers apart from the convolutional layers, residual mapping input, and output. The additional layers are activation, pooling, batch normalization, and padding. In general, the ResNet50 consists of the 190 layers shown in Figure 9, being this the network implemented in this research. The additional layers are described in the following sections, except for the padding layer because it simply fills the images from zero to recover the original size lost after the convolutional layers.

1) ACTIVATION FUNCTION
In the mathematical models of the previous section, the activation function was defined and denoted by the Greek letter ϕ. The function is one of the fundamental elements in neural networks since it allows emulating the activation of the artificial neuron as a biological one would. The operation is constituted by a nonlinear relationship between the weighted input and the neuron's output and can vary depending on the design. However, we used the ReLu function [87] in this study since it allows faster training than other functions while maintaining its nonlinearity [88].

2) POOLING
In the convolution process, small changes on the input image generate small changes in the feature maps. Then, pooling layers were devised to endow the convolutional layers with some transitional invariance. Generally, the process calculates the maximum (or average, as the case may be) value for patches of a feature map and uses it to create a downsampled (clustered) feature map. In this sense, clustering reduces the size of feature maps, simplifies the model, and reduces the computational burden [89], [90].

3) BATCH NORMALIZATION
Batch normalization was devised to mitigate the problem of changing internal covariates produced by the change in the internal distribution of each feature map and the random initialization of the weights. The effect limits the learning rate but can be reduced by modifying the distribution toward a normal distribution, i.e., with mean 0 and standard deviation 1, as shown in Equation (19). The normalization is adjusted by training to an optimal distribution by a linear transformation, as shown in Equation (20). The parameters γ and β are learned by the model generating the new distribution, which improves the model performance [91]. The process also smooths the gradient flow and acts as a regularization layer [92]. Therefore, no additional regularization method was used in the implemented network.
In Equations (19) and (20) B represent the batch mean and variance respectively and ε is a stabilization coefficient, used to prevent the denominator from taking the value of 0.

B. DATA AUGMENTATION METHODS
This section shows how the algorithms of the main data augmentation methods work. In addition, the mathematical model governing each model is also included.

1) TRANSLATION
As mentioned above, the most common and simple methods are geometric transformations. The first of these is translation, which, as can be deduced from its name, the image is translated preserving the relative positions between pixels, but not its original position. Mathematically this operation is described by Equations (A.4) and (A.5).
where, here, the column vector is the translation vector. It is worth noting that, although Equation (A.6) is shown for two dimensions (on the x and y axis), the transformation can be applied for n dimensions, where the transformation vector will be a column vector of dimensions n [54], [55]. The above process is illustrated in Figure 10.

2) ROTATION
Rotation-based data augmentation is performed by rotating the image concerning its original position. Similar to translation, rotation consists of retaining the same relative position of the pixels but with a new coordinate axis system. Mathematically the transformation is given by Equation (A.7).
here, again the rotation matrix is a square matrix where the angle θ is the rotation of the image concerning the origin [55], [56]. Rotations can take any angle, with rotations multiples of 90 • being the most used in square images. In addition, rotations are generally taken concerning the image center and not from the origin, as illustrated in Figure 11.

3) FLIP
Image flipping is another geometric transformation, where the position of the pixels is inverted concerning one of the two axes (in the case of two-dimensional data). The mathematical model is governed by Equation (A.8), and the process can be seen in Figure 12.
where, x max and y max are the last positions reached by the image pixels on the respective axes.

4) RESIZING
Resizing or rescaling consists of assigning the new positions in proportion to a scale factor, which may be the same for each axis or have different proportions.
In particular, the change of scale can be interpreted as zoom in (scale factor >1) or zoom out (scale factor <1). Mathematically the resizing is expressed as shown in Equation (A.9). x where, f x , and f y are the scale factors for the x and y axes, respectively. Figure 13 shows two examples of image resizing.

5) DISTORTION
Distortion shifts the position of pixels to new positions that follow some function. Even this strategy can be the combination of one or several translations, rotations, and resizing. For example, the distortion of Equation (A.10)  contains the rotation, resizing, and translation processes in that respective order. It should be noted that Equation (A.10) represents the transformations in the homogeneous coordinates [57].  Figure 14 shows an example of distortion on an axial image of the brain. Geometric transformations in discrete space (such as digital images) can generate new positions that do not correspond to an integer pixel. Consequently, transformations are used with interpolation methods to find intensity levels that correspond to discrete pixel positions. For example, Figure 15 shows the resizing of 2.5 on a 3 × 3 figure. The process would assign new positions to the initial pixels. However, these positions would not correspond to discrete positions, and, in addition, there would be intermediate pixels that would not have an assigned value. In this sense, interpolation becomes necessary to determine the intensity levels in the discrete positions and the intermediate pixels, being linear and cubic interpolation the most used [58]- [60].

6) CROPPING
Image blending is a rarely implemented strategy. The process involves taking elements from several images with the same features to generate a new image like a mosaic [19], [61]. For example, Figure 16 shows the composition of a new image from the regions of 9 different images with the same features (axial images).

7) IMAGE OVERLAY
Another way of blending images is overlapping, as shown in Figure 17. The process consists of taking two images of the same size and matrix summing them multiplied by an attenuation factor [19].

8) NOISE INJECTION
Noise aggregation consists of summing a matrix of the same size with random values, usually with normal distributions (Gaussian) [62]. The process can help networks learn more robust functions by removing or hiding some image information, as illustrated in Figure 18 [19].

9) COLOR SPACE
Generally, images are stored as arrays of three channels with the same dimensions. The channels represent each of the intensity levels that make up the RGB image, i.e., the intensities of red, green, and blue. Therefore, it is possible to change the image's color while preserving its spatial characteristics, as shown in Figure 19. The process is known as color space shift and can be performed in any number of spaces since each space combines the original channels in different proportions.
Color spaces can even be created by assigning to each intensity level a combination of the three RGB channels, i.e., a grayscale image (one channel) can be converted to an RGB space (three channels) [63]. The variety of color spaces is so  extensive that even PCA-based developments can be found, as performed by Krizhevsky et al. [93].

10) LINEAR FILTERS
Another of the most used strategies is linear filters. Generally, filters are used to focus and blur the image, as illustrated in Figure 20. The method consists in sliding the filter through the whole image, obtaining new values in the new image [64]. Particularly, this process is known as convolution and, in fact, is the fundamental basis in convolutional neural networks [65].

11) RANDOM DELETION OF FRAMES
Deletion is a strategy inspired by regularization based on neuron dropout. The process randomly eliminates regions of the image, preventing the neurons from learning part of the information, as illustrated in Figure 21 [66].