Convolutional Neural Network-Based Natural Image and MRI Classification Using Gaussian Activated Parametric (GAP) Layer

We propose a novel gaussian activated parametric (GAP) layer for deep neural networks specifically for CNN. This layer normalizes the feature vector using the Gaussian filter-based un-sharpening technique. The goal of the proposed method is to normalize and activate the initial and intermediate feature layers of a deep CNN so that a customized layer can make the feature more distinguish and layers are smoothly tuned for the target-domain classification. Our experiment shows the use of the proposed layer for normalization gives a result almost similar to that of batch normalization and in few cases, the result is slightly better at the cost of higher training time. To demonstrate the result of the proposed layer we are using 4 layers of an encoder-based network as a base architecture for classification in which the first two normalization layers are GAP layers or BN layers whereas the remaining are BN layers. The result of the experiment is better in classification with two GAP layer as normalization layer in a bulkier dataset like ADNI MRI (3D CNN accuracy: 93.58% vs. 91.89%) and CIFAR-10 (2D CNN accuracy: 75.21% vs. 75.11%), whereas the result is better with replacement of first BN layer with GAP in a smaller dataset like 5-animals dataset (2D CNN accuracy: 62.92% vs 58.48%). Therefore, we suggest the use of single GAP layer normalization for smaller datasets and two GAP normalization layers for bigger dataset training. Also, the proposed method produced better results than the cross-channel normalization-based AlexNet network under scratch training.


I. INTRODUCTION
Deep neural network (DNN) has been the dark horse in the field of machine learning (ML) and deep learning after the success of LeNet-5, an emerging Convolutional neural network (CNN), in 1989 for handwriting recognition [1], [2]. The massive success behind the use of DNN is because of its capacity to accommodate a larger number of trainable model parameters which contributes to the accurate extraction of feature for pattern recognition, image-classification as in ImageNet using AlexNet [3], GoogleNet [4], ResNet [5], object recognition (R-CNN [6], [7]), scene segmentation The associate editor coordinating the review of this manuscript and approving it for publication was Felix Albu .
(SegNet [8], [9]) and other tedious tasks for human perception. The commonly used DNN in image classification or recognition is CNN, with convolution filter as the key feature detector from primary level features like edge, color, corner, line to higher-level features like texture, pattern, shape, etc. for its class identification [10]- [12], [31]. Hence, CNN is basically an image feature extractor. In CNN, the weights of convolution filters are the key parameter to train, and it determines how a particular filter works. Besides the convolution filters, many other learnable layers also participate in weight update during training via backpropagation, so that they all work conjointly to produce final down-sampled features with class-label properties. The training algorithm or the objective function for optimization and the initialization techniques are the key components to escort the cross-entropy loss to minima, so that the training can be stopped, i.e., the weights can no more be changed and technically the model/network is said to achieve convergence. Achieving optimal convergence is our primary goal however, only reaching the lowest minima does not guarantee a high success rate in test (unseen) sample classification. i.e., the 'test error' can be high with a low 'training error'. This familiar problem in ML is called overfitting or generalization error, which is the most serious issue in any DNN architecture. Many regularization techniques along with dropout, early stopping, random sampling, etc. are introduced to reduce this generalization error. However, being the state-of-the-art algorithm in several benchmark datasets, numerous ML algorithms are still not fully understood and working as a black box in many tasks, it can be realized by the fact that many standard DNN fails to generalize [13], [36], [37].
Besides, any NN under scratch training is affected by the randomness in the parameter initialization, which brings some short of disparities in the input node distribution of layer during training time and makes it extremely difficult to train networks with saturating nonlinearities considered as covariant shift [14]. This phenomenon considered specifically as an internal covariant shift is some sort of curse for convergence and generality as well. To overcome this, batch normalization (BN) was introduced to minimize the effect of this covariant shift. It also speeded up the training process and cope up with a higher learning rate without exploding gradient. BN uses layer-wise whitening technique image i.e., mean zero and unit variance for normalization and decorrelation, with only two extra parameters per activation one for scaling and the other for shifting. It helps to reduce training time and preserves the interpretation capacity of the network [15]. It might be because the convolution weights are updated so quickly that the weights tend to move towards convergence faster while the validation accuracy still lags far behind. BN was mainly designed for speedy training and reducing internal covariant shift also, the claimed regularization technique was said to eliminate the use of dropout. Because of this BN also claims to reduce the generalization error as dropout does the same. However, at the same time, we observed a prominent overfitting problem with the use of BN in our base architecture. Along with BN, many other lately used activation functions for classification purposes are designed with learnable parameters for optimal model fitting and little overfitting risk [16]. Similar works done lately to design normalization layers without using minibatch mean in DNN includes filter response normalization (FRN) [38], group normalization (GN) [39], and layer normalization (LN) [40]. All these methods do not operate in batch dimension to avoid minibatch dependency for calculating scaling factor and only use activation map channel statistics. However, none of them uses any filtering function like Gaussian filter or image filters for sharpening effect in the layer as our proposed method. So, we were interested in designing a normalization layer along with activation parameters to address these issues of overfitting and covariant shift. For experiments, we are using 4 layers of an encoder-based network as a base architecture with a normalization layer in each encoder. Also, AlexNet with local response normalization [3] which is similar in architecture to our base model is also used for scratch training all dataset to compare the classification result. The proposed method has been mathematically explained via matrix equivalency with the CNN normalization layer. We use three parameters per activation in the proposed layer, one for scaling the original feature, the other for scaling the masked feature, and the last one as offset or bias to shift the output. Our contribution in this paper is: i. We are proposing a novel Gaussian filter-based normalization layer to integrate with deep neural networks, which we call Gaussian activated parametric (GAP) layer. Also, instead of performing minibatch averaged scaling, we are proposing the same channel mean scaling for normalization within the layer.
ii. Comparative analysis is performed for the proposed GAP layer in alternate to BN in the base architecture to study the feature extraction, histogram analysis, internal covariance problem via correlation test, and overfitting. Along with a comparison of the final classification results.

II. BACKGROUND AND MOTIVATION
It has been observed that the convolutional filter weights in the initial layers of CNN tend to mimic some image filters once the network is fully trained. In Krizhevsky et al. [3] the trained convolutional weights/parameters were mostly the edge detectors and color filters. The edge filters could detect horizontal, vertical, and diagonal edges, besides these filters are translationally invariants in nature and so works spatially for all the input images. Kaimming et al. [16] report the filters of the first convolutional layers were mostly Gabor-like filters such as edge or texture detectors, and the results after full training showed that both positive and negative responses of the filters are revered. With these previous results pertaining to the basic image filters and negative responses being equally important during training, we were interested in designing a layer for early feature detection and quick imitation. For this instead of manually initializing different filters, we were interested in finding filters to smooth the overall result not only the single features like edge, color, blob, etc. Our experiment showed the use of edge detection filters like Sobel, Prewitt, Robert, etc. [17]- [20] works unidirectionally without bringing feature variance and the classification result was also poor. Hence, we need a universal filter to boost up the learned features and can work marginally and cleverly for all types of feature attributes like edge, blob, color, corner, etc. For this, we selected the Gaussian filter. The Gaussian filter can be used as an average smoothing function along with the normalized input. However, using Gaussian smoothing with batch normalization seems inappropriate, so to normalize the features, we use filter/channel mean normalization. In this normalization, the mini-batch properties in not entertained, instead the mean features from all the filter weights (i.e., convolution weights in all channels) are normalized channel-wise for each batch dimension as described later in detail in section IV. So, the obtained mean feature vectors work only for its channel images without mixing the mini-batch properties. In doing this the injection of noise is reduced, and hence single image property is only summed up. Since we are using the proposed layer as a feature enhancer, it follows the convolution layer to operate on the activated image output.
In an experiment [21] to understand the effect of randomness in NN, the Gaussian function was used to add noise to the input and bring non-uniformity in the signal by adding random pixels to the source image. This destroys the relationship of training pixels with its label. Further to make the experiment more random, the image pixels were re-shuffled and re-sampled from a gaussian distribution. However, the NN was not heavily affected by this and was able to fit the test set. Thanks to the stochastic gradient descent algorithm [22] used during the training. In our case, we have used the gaussian function not as a noise generator, but as the mask generator for the un-sharpening process and the work in channel normalization [41] has motivated us in doing so. We aimed to obtain good training convergence without destroying the correlation property between the input and output of each layer. Instead of performing Gaussian smoothing on the whole image, we have used the Gaussian kernel to generate matching mean and variance to the original image kernel. Kernel size is determined by the size of the preceding convolution filter kernel, to avoid mixing of filter properties during the smoothing process.

III. GAUSSIAN FILTER AND UN-SHARPENING PROCESS
Gaussian filter [23] is a spatially weighted average filter, which works as a point-spread function for any image-pixel distribution. It operates as a non-uniform low pass filter with a higher weight to the central pixel producing a normal distribution of pixels. And its rotationally symmetric nature provides directional unbiasedness for image morphological operation. This property of the Gaussian filter also helps to marginally preserve edges and brightness, while producing smoothing results for image attributes with averaging values. However, the input standard deviation (SD) (σ ) has some effect in blurring the image. T. Lindeberg [24] derived the optimal discretized Gaussian kernel and proved Gabor functions can look very much like Gaussian derivatives. In order to operate in the CNN feature matrix, we need a discrete approximation of its kernel. This is done as in equation (1) where i, j, and k represents the matrix row, column, and depth and σ 2 is the input variance which is equal to 1 for standard Gaussian output. The output for 3 × 3 × 3 and 5 × 5 × 5 matrix is shown in plot Figure 1.
(1) The term 1 √ 2π is a normalization constant that comes from the fact that the integral over the exponential function is not unity. It is shown that the convolution with a Gaussian kernel is a linear operation [23] so in order to bring non-linearity in the system, we need to add some non-linear functions like Leaky ReLU for rectification. A 3D kernel is used in designing GAP layers for both 2D and 3D CNN, in 2D CNN the filter operates with the mean of all convolution-based images, whereas in 3D CNN the mean value is operated in the whole volume itself. However, in both cases, the learning coefficient parameters are updated separately acknowledging the mean effect from its channel filters for the final output.
The un-sharpening process involves three steps i) blurring of image i.e. correlation operation of input image volume with the Gaussian kernels ii) subtraction of original input image volume with its blurred version to generate masking kernel iii) adding the masked version to the original input. Figure 3 illustrates the operation involved inside the layer in detail.

IV. PROPOSED GAUSSIAN ACTIVATED PARAMETRIC (GAP) NORMALIZATION LAYER A. ARCHITECTURE AND TRAINING
Let us consider the output from the first convolution layer as in base architecture (see Table 1), which acts as input for TABLE 1. The base architecture is used for testing the proposed method against BN for classification. Please note that 2D and 3D architecture is different with different activation size. Note: here 'g1g2b3b4' architecture indicates 1 st normalization GAP (g1), 2 nd GAP (g2) and 3 rd , 4 th both BN as b3 and b4 respectively. Similarly, b1b2b3b4 means all BN and so on. The selection of hyperparameters and activation function is based on our previous work [12].
The base architecture is used for testing the proposed method against BN for classification. Please note that 2D and 3D architecture is different with different activation size. Note: here 'g1g2b3b4' architecture indicates 1 st normalization GAP (g1), 2 nd GAP (g2) and 3 rd , 4 th both BN as b3 and b4 respectively. Similarly, b1b2b3b4 means all BN and so on. The selection of hyperparameters and activation function is based on our previous work [12].
the proposed gaussian activated parametric (g1) layer. This input can be represented as a 4D array for 2D CNN (5D array for 3D CNN) of its size as X = [227, 227, 32, 64] = [image_row, image_col, channel_size, minibatch_size]. Here the input X contains 2D images of size 227 × 227 each from 32 filter outputs, i.e., 227 × 227 × 32 volume of a single image with 32 different activated images for the same image, and the last dimension i.e., 4 th dimension represents the minibatch number for training. Each minibatch represents different images for multiple classes so in total 64 different input images, with 105.53M pixels/weights as batch input for forward propagation in the first GAP layer. Similarly, for the 2 nd layer, the output size after pooling is reduced to 113 × 113 so the output from the 2 nd convolution layer is 113 × 113 × 64 × 64 (64 filters in 2 nd convolution), so a total of 52.3M weight inputs for the 2 nd GAP layer. To make it clearer, input X or 1 st GAP input can be represented as the following block-matrix in Equation (2), where X n b represents the activated image of n th filter in b th minibatch of training: Here, the mean value Xm is calculated based on its 3 rd dimension i.e., the number of filter or channel, (for 3D CNN it is calculated on the 4 th dimension) hence can be represented as a matrix: Here, the column matrix on the right side contains the mean value of each batch from different training samples. The dimension now for Xm = [227 227 1 64] which gives the averaged value of 'same images' activated 'differently' with convolutional filters. It is an empirical mean value of all channels so contains 64 different images for each batch updated during training, and is not used post-training, hence an empirical mean. Similarly, standard deviation is also calculated for Xs = [227 227 1 64]. Now we re-center and re-scale X using Xm and Xs as, Xf = (X − Xm) /Xs which produces the averaged mean-centered and 1/Xs scaled output, Xf . Please note all the arithmetic operation in the matrix is an element-wise identical dimension operation. Hence the size of Xf = [227 227 32 64]. Here in equation (3), X 1 is the mean from the activated output of the first training images In all the graphs training accuracy reaches convergence (100%) faster with 100% BN whereas the validation curve does not follow the training curve for a longer time, which means the weight update process ends sooner and cannot generalize better. In the case with 50% BN and 75% BN, the training curve achieves convergence lately and so does the validation. This helps the network to update weight slowly (at the same learning rate) and hence might reduce the overfitting (although we have a clear case of overfitting due to a large gap in training and validation accuracy in all cases, better architectures might change the result, however in our case we are using our base architecture to test the proposed idea). Here, the CNN with GAP layers tends to reach 100% accuracy slowly so that validation accuracy is addressed for a longer time whereas with BN layers the training accuracy shoots up quickly causing a higher gap (coincidently we have named GAP for the proposed layer) between the training and validation layer during the early stages of training. Hence the overfitting problem is still not completely tackled in batch normalization. Even though with a low learning rate and higher learn-rate-drop factor the problem still prevails. Concerning this, we want the training convergence to generalize better with the proposed GAP layer. Our result suggests the testing result remains the same or slightly improved in most of the cases and feature extraction is also enhanced. 32 1 and so on. Hence, the variance is nominally very less, i.e., in general, var(X b ) ≈ var(X n b ). In BN, once the training is finished each BN layer has 'n' number of trained mean and variance per activation stored in the trained network, which is later used to normalize the input during prediction. However, in our method, the layer does not store the trained mean and S.D instead, during prediction, the weight of convolutional kernels is passed below so the empirical mean for each image is calculated from each channel output, hence no need for pretrained mean and S.D.
Channel normalization [41] standardizes each channel in a CNN, independently for every training example, and scales and shifts the resulting input with a (learnable) scalar. It is comparable to instance normalization [42] and BN for a single training sample. In our case Xf is the normalized unsharpened version of its original minibatch input, which is the shifted and scaled version of the input X . α, β works as a separate channel standardizing factor for the input and the unsharpened version respectively. However, the pretrained value of α, β, and γ scales each channel separately. Xf is now spatially correlated with a Gaussian kernel, i.e., filtering is done with the kernel of a size equivalent to the size of the previous convolution filter (here 3 × 3 × 3) for the first GAP layer as in equation (4), where h www represents the Gaussian filter weights for the 3D kernel, calculated as equation (1) for i, j and k as w = 1 to 3 or 5. For the second GAP normalization i.e., g2 layer in g1g2b3b4 architecture, the used Gaussian kernel is obtained via convolution of the discrete Gaussian kernel with each other to get a second order filter response of the second normalization layer, different than the first normalization layer. Please note that we have used a 3D kernel in all GAP layers to accommodate the normalization of feature in the activation map (the 3 rd /channel dimension) and the filtering kernel moves throughout the whole Xf vector. . Schematic representation of proposed layer, along with input and output histogram for comparison. The input signal is represented as a ramp signal to demonstrate the edge detection process. However, in our experiment input X to the layer is the activated image matrix from the preceding convolution layer. The input passes through the normalization unit to produces a scaled and shifted version of X having a narrow range of feature values. Later, the Gaussian smoothing function transforms the feature vector X n in a weighted-average fashion to produce a sharpened version of images i.e., X g . The difference of X and X g produces a masking vector X mask , which is again added with the original X to produce Z. The learnable parameters α, β, and γ scales X, X mask and offset respectively. First GAP layer following c. visualization. Check the difference in the output of batch and GAP w.r.t its respective convolution layer, the color is heavily changed in batch normalization due to insert of its batch properties, but gaussian output remains the same, without any sharp change in filter color, instead, the color is slightly mixed up with similar color, hence a smoothing process is done here.
In equation (4), the symbol 'o' represents the correlation operation which is similar to spatial filtering operation in image processing. Mathematically correlation process is the same as a convolution in the time domain, except that the signal is not reversed before the multiplication process. The idea here is to perform filtering to all images in a replicated manner so that all 32 × 64 images are covered. However, the blurring or smoothing effect works differently for each image from X 1 1 , X 2 1 , . . . . . . X 32 1 on the first batch and so on. The obtained gaussian images can be represented in a matrix as in equation (4), where each Xg n b represents an image after the Gaussian filter. The applied Gaussian kernel is the same size as the preceding convolution kernel to perform linear correlation. Hence, our initial aim of combining normalization and activation into a single unit seemed not practicable, so we need to use an extra non-linear (activation) function to further support the classification.
And the mask is obtained as Xmask = X −Xg that subtracts the normalized Gaussian signal from its original version, for each channel output without inheriting the batch properties. Hence finally we forward the output from the layer as: Here α, β and γ are learnable parameters each of size [1 1 N], which have unique values for each 'N' filters and FIGURE 6. Classification result on different datasets for comparison along with validation accuracy, test accuracy, min 95% CI error, max 95% CI error as in Table 2. The validation accuracy and testing accuracy were calculated on the same set with identical training and testing conditions, to avoid any biases.
help to optimize the output values. α, β and γ are initially selected between 0 and 1 and acts as scaling co-efficient to control the gradient output during backpropagation of weight update. And the second parametric term in equation (5) i.e. β is initially less than 1, however with weights update during backpropagation, the value tends to be β > 1 in which case acts as a high boosting filter, emphasizing the contribution of unsharp masking, and when β < 1 the sharpening mask is more emphasized (please see figure 3). The other learnable parameters α and γ are also updated during backpropagation, α works as scaling coefficient for making output equivalent as (β.Xmask) and γ works as bias with no effect in the layer gradient loss. The MATLAB code implementation along with supporting codes are included in the Appendix.
Stochastic gradient descent (SGD) is the training algorithm for updating the learnable parameter (weights, bias, offset, coefficients) values computed using a mini-batch, instead of using the whole training set error at once as in standard ones [22], [25]. Here the loss is calculated on a mini-batch set by updating the network's parameters toward a negative gradient of loss at each iteration as in equation (6) w t+1 Here, weights or bias or offset w t+1 l update in layer 'l' at each iteration 't + 1', uses the weights of the previous iteration w t l . α l is the learning rate hyperparameter for the parameters of layer l, kept at value > 0, and initially 0.001 in our experiments. Since we are also using the SDG with the momentum it reduces the oscillation of the parameter weight update and finds the path of steepest descent towards the optimal value. For this, the hyper-parameter r, known as the rate of momentum is set for 0.95. Also, the negative term represents the gradient of the loss function in the layer updated as in back-propagation after every epoch.
The backpropagation calculates the derivative of loss with respect to (w.r.t) all trainable parameters in CNN. The layer gradient loss w.r.t to input X i.e., dl dX and other parameters α, β, and γ are updated as follow: In our experiment, it was found that 'k' value, when used, produces a small gradient value causing vanishing gradient problem, hence k was selected to be 1. Hence for the GAP layer with 32 activations and 64 minibatch sizes we define gradient loss as follow:  3 (a) respectively, the frequency of weight value was more in the mean range in input whereas later in output, the weight around mean is reduced. This drastic change in distribution is also shown in the mean response plot, 7.1(b) and 7.3(b) with correlation coefficients 90.46 and 84.73, respectively. However, the output pixel/weights distribution is not completely changed while using GAP normalization. The output histogram using GAP layers, 7.2(c) and 7.4 (c) follows the input pattern of 7.2(a) and 7.4(a) respectively. Also, the mean plot response shows a very high correction with its input filter mean, and the correlation value is 94.6 and 91.3, respectively. Please see the appendix for code implementation.
Training in minibatch has a significant effect in achieving convergence time, i.e., higher minibatch size makes the training quicker by reducing the number of iterations per epoch, on the other hand also affects the training accuracy. We are also training the network in mini-batch i.e., the entropy loss is calculated based on mini-batch input however, FIGURE 8. Correlation value plot between input and output in the normalization layer for all test images in the 5-animals dataset. Here in Layer 2, the BN (b1) layer produces a correlation value of around 90% for all test sets, whereas the GAP layer (g1) has a slightly higher correlation value than b1. Whereas in the second normalization layer i.e., Layer 6, the BN layer (b2) produces drastically low correlated output with its input, and in wide ranges for all test sets, i.e., ranging from 94% to as low as 22%, however, the output from g2 is not highly decorrelated with its input, hence in the range of around 90% correlation with its input. Input X is the output from the preceding convolution layer, and output Z is the output from the normalization layer. If the layer correlation value comes out to be very low, it means the layer has decorrelated the feature. However perfect correlation is also useless. the normalization process is not the batch normalization if the used normalization layer is GAP. In normalization generally, the whitening process is preferred where the input is linearly transformed to zero mean and unit variance i.e., the input will have a fixed distribution, which is also considered to remove the ill effects of the internal covariate shift. While activation process is like a filtering process where the activation function determines the weight output. Similarly in our method, the Gaussian filter works as an activation function along with a learned parameter to create a normalized mask. This mask works as the additional extracted feature (β.Xmask) with the original input so that the original signal is slightly boosted with its filter mean responses. So, later when linearly added to the original input adds the value according to the mask. If the masked value is only used, we would lose the entire input image property. Hence mask is added to the original input to bring a calculated variance, without losing the linear property of its input.
Few hyperparameters like initial learn rate, learn-drop rate, etc. affect the training time and learning ratio however in the longer run, results are not significantly different, the lately achieved convergence slightly affects the testing result only. Fully connected layers act as a single-layered feed-forward network with all parameters connected from input to output. Because of this nature, FCL is blamed for causing overfitting in the network, and potentials regularization techniques like dropout are used in between them [26], [30]. So, to reduce the number of parameters and eventually overfitting, we have reduced the number of FCL from 4 to 3 in our base architecture [12].

B. CLASSIFICATION PERFORMANCE AND DISCUSSION
We performed several classification experiments using the base architecture with different normalization layers used in the first 2 encoder part. Our main goal was to compare the classification result, with the first two BN layers replaced with GAP layers. Benchmark 2D datasets were used for this purpose. The used datasets are CIFAR-10 [10], Caltech-102 [27], 5-animals, and MRI images from OASIS [28] for 2D classification purposes. Among these, 5-animals and MRI images are prepared privately and are made available in a public repository, whereas others are already publicly available. The 5-animals dataset consists of around 700 images per class of five animals viz, tiger, lion, dog, cat, and fox. OASIS MRI consists of a total of 5220 images of four classes categorized based on clinical VOLUME 9, 2021 TABLE 2. Detailed experiment results using different normalization techniques in the same base architecture as shown in Table 1. The bold number signifies the best performance result. The training environment was identical in the case for all datasets in natural images and similarly for medical MRI. The AlexNet model was obtained using the MATLAB deep learning tool. In the case of the same dataset, the training, validation, and testing materials were also identical, so the result could not be biased in any case. Accuracy represents the % of correctly classified samples during prediction, whereas average test recall and precision are calculated by taking the mean of class-wise recall and precision. 95% CI error represents the error with a 95% confidence score, the one with a score above 95% is only calculated for a min-error value and one with a score below 95% is only calculated for max-error value. dementia rating (CDR) level of participants. Details of participants are described in our previous work on 2D CNN [29]. To test 3D CNN performance, we used 3D-MRI volumes obtained from the ADNI database (http://adni.loni.usc.edu/). To compare the result on bulky and small training samples we have prepared 2 types of datasets for 3D CNN, MRI_baseline is the bulkier one with 988 MRI samples and MRI_small with 187 MRI samples. These MRIs belong to one of the 3 classes viz AD (Alzheimer's disease), MCI (Mild cognitive impairment), and NC (Normal controls). Details on these medical databases can be found in our previous work [12]. Table 2 shows the detailed condition for the experiment and the obtained results. Overall, it shows the use of the proposed layer for normalization gives results almost similar to that of batch normalization, and in few cases, the result is slightly better with higher training time. The result of the experiment is better in classification with double GAP layer as normalization layer in a bulkier dataset like MRI_baseline with overall test accuracy 93.58% against 91.89% using 100% batch normalization. Similarly, CIFAR-10 test accuracy improved from 75.11% to 75.21 % by replacing the GAP normalization layers with the BN layer in the first and second encoder. While the result is better with the replacement of the first BN layer with GAP in a smaller dataset like the 5-animals dataset (2D CNN accuracy: 62.92% vs. 58.48%), which supports the use of the proposed method for normalization and activation.

C. FEATURE VISUALIZATION AND ANALYSIS D. CORRELATION AND GENERALIZATION
Correlation is the measurement of the similarity between two signals. The correlation coefficient of two random variables measures the linear dependency between the input feature matrix (X ) and output (Z ) with N scalar observations as the Pearson correlation coefficient r: The correlation coefficient also measures the covariance 'conv' between any two vector matrices. As BN is stated to reduce the internal covariant shift in its layer, we have measured the covariance between output and input in this layer using the Pearson correlation coefficient (r). Here in our case, X is the 3D feature matrix generated from the preceding convolution layer and Z is the output from the normalization layer with same dimensions as X . And the value of 'r's a single value for all filters for a single image, hence correlating average characteristics of the activation layer. (Please note for the filter-wise response we have plotted the mean response plot as in figure 7.1 (b) to 7.4(b)). Perfect correlation brings an identical result between input and output without any variance in which case the layer becomes useless, however very low correlation is also dangerous, as it brings very high variability and shift between layer input and output, FIGURE 10. Comparison of feature detection heatmap using various visualization algorithms for natural images. The used techniques to generate heat maps on test images in successive order are LIME [33], Occlusion [34], and Grad-Cam [35]. Overall, AlexNet [3] has a narrow heat map area i.e., the region of influence for classification, and similarly, the heat map area of g1g2b3b4 and g1b2b3b4 is wider and accurate than one using BN only i.e., b1b2b3b4. It signifies the better feature detection process done using GAP normalization.
which makes the layer suspicious to accidents like vanishing gradients. So, it is still unclear if a high correlation is good or not, which in our case, we expected a higher correlation value than the BN result, which turned out to be true.
Overfitting is one of the major adversity in ML that brings disparities in test performance and training performance within a trained network. This can be verified when the error on the training set is very less, but when an unseen similar data is predicted via the network the error is large. It means the network has 'memorized well' but has 'not learned well'. It is majorly the number of parameters that decide the fate of the network to overfitting. Keeping in mind that if the number of parameters in the network is much smaller than the total number of points in the training set, then there is little or no chance of overfitting, which simply means increasing the parameter of the training network increases the chance of overfitting. To check the generalization error, we computed the range of prediction error on 'N' test samples for a confidence score over 95% against the standard test error (STE) = 1 -accuracy. This test of error margin is also called the Wald test and represents the minimum and maximum error in the range of 95% confidence interval as shown in Table 2 and plotted in Figure 6 along with the accuracy graph. Besides, we plotted the T-SNE projection for test images and found out the errored distribution as shown in figure 9.

V. CONCLUSION
To conclude we have experimented with our proposed idea of the GAP layer as a normalization layer in 2D and 3D architecture CNN. Our experiments show the use of GAP layers produces a result similar or slightly better in the case of the bulkier 3D dataset and lighter 2D dataset. We studied the phenomenon of overfitting via training and validation graph, normalization layer covariance via mean response plot and correlation plot, and feature representation via TSNE and histogram plots. To summarize we have listed the finding as below: 1. With b1b2b3b4 architecture, the training accuracy shoots higher quickly, indicating an overfitting condition as the validation accuracy during the mid-training was still too low causing a higher gap between training and validation accuracy. Hence, with the GAP layer, we delayed the faster convergence of weights during training (please see Figure 2).
2. The weights of convolutional filters in early layers seems to change abruptly from convolution to normalization process during BN, (please see Figure 4) this might suggest the feature property of the input image becomes highly uncorrelated with its input after normalization. Consequently, the correlation coefficient between input and output of BN is quite low. However, using GAP the filter weights are only slightly changed and the input-output correlation value is higher (Please see Figure 7 and 8), indicating less distortion of image property.
3. Due to the minibatch mean value used for scaling in BN, the output from normalization layers is scaled with the minibatch mean, i.e., the minibatch properties of images are mixed, causing a higher squeezing in its feature value (Please see histogram plot in Figure 7). This might have brought higher variability in BN output, as discussed in point 2.
On the other hand, in GAP normalization the scaling mean coefficient is calculated from the activated channels from the same image i.e., equivalent to minibatch = 1, so the image property from the same image is only mixed up, without spoiling its feature attributes. 4. Scaling mean and variance are empirically calculated as in 3. for each input image, so no need to pass the trained mean and variance value as in BN during the testing phase. Also, the BN's error is higher for small batch size, due to imprecise batch statistics estimation.
5. The activation region for decision-making is wider and accurate in most cases using g1g2b3b4 or g1b2b3b4 than the BN-based b1b2b3b4 and AlexNet (please see Figure 10). 6. In most of the experiment, the result is slightly better ( Table 2) and can also visualize via T-SNE projection (see Figure 9) Our proposed layer is itself not a replacement for batch normalization (BN), since it is beneficial only if we use those layers in the first one or two convolution layers. Thus, it works as a good alternative for batch normalization in the early layers although not the ultimate layers. Moreover, another serious drawback in our idea is the use of the filter function itself, as it consumes a lot of time to perform a 3D filtering operation, because of this our method requires 2 (1 GAP) to 5 (2 GAPs) times more training duration to generate trained model. More the layers and training sample used; more will be the delay in training. To overcome this, if we can perform a weight-wise operation like convolution or BN in the layer itself without using an external filter function, might reduce the operation time. To do so, we need a more precise algorithm inside the network with parameters for Gaussian filter operation. This will be our future work. We hope our work will help the researcher working in the field of DNN to achieve better results in their application task.

APPENDIX
MATLAB implementation code is presented as C1, C2, C3. Confusion matrix of performance (as of Table 2) is also presented. The prepared dataset can be downloaded from https://drive.google.com/drive/folders/1G1fsK2VxaHkvtJJf vpiB3rMpiqCcdkB2?usp = sharing C1: MATLAB code implementation for GAP layer used after convolutional layer as an alternative to Batch normalization layer.