On the Dynamics and Feasibility of Transferred Inference for Diagnosis of Invasive Ductal Carcinoma: A Perspective

It is generally noticed that increasing the number of convolutional layers in generic image classification procedures proves to be detrimental to model performance in terms of validation accuracy and loss. Apart from vanilla CNNs, we have state-of-the-art (SOTA) architectures such as ResNet50 (and its variants) which show that through the use of skip-connections, higher performance metrics are attainable through deeper architectures. However, most evaluative metrics converge on a log scale as we go deeper with diminishing gradient of the metrics’ curves. Given these two contrasting speculations, in this paper, we implement various vanilla and SOTA CNNs for the diagnosis of one of the most common forms of breast cancer - invasive ductal carcinoma (IDC) - to examine and understand the feasibility of implementation of SOTA CNNs through transferred weights when juxtaposed with vanilla CNNs (and LeNet-5) of varying configurations in terms of their performance metrics and other parameters. In this paper, we solve the dual-objective of studying behavioural aspects of avant-garde CNN models (more specifically, VGG16, VGG19, ResNet50, ResNet50V2, MobileNetV2, and DenseNet121) and proper diagnosis of IDC through intermediate neural activations to critically evaluate and theorize the performance of different models. We notice that among all the models, only VGG16, VGG19, LeNet-5 and a selected vanilla CNN through an optimization procedure were the ones to attain the best metrics, shared amongst them.


I. INTRODUCTION
Deep Convolutional Neural Networks (CNNs) have interesting properties pertaining to the scalability of their feature capturing abilities. Generally, the depth of the deep CNN is decided by the number of features, and both are directly proportional to one another. With the natural tendency of capturing features of all different levels, i.e., low, medium, and high [1], CNNs have been put to great use for The associate editor coordinating the review of this manuscript and approving it for publication was Santosh Kumar . various applications [2], [2]- [6], inclusive of medical applications [2], [7], [8]- [12], [128]. One significant and relevant dataset to our discussion in this paper, the ImageNet [13], is a dataset which is used in the annually hosted ImageNet Very Large Scale Visual Recognition Challenge (ILSVRC) for both, object detection (correct localization of all objects present in an image) and object recognition (accurate identification of existence of objects in an image). The Ima-geNet is considered a standard benchmark for all SOTA models of object detection [14]- [17] and recognition. In this paper, we employ the techniques of transfer learning [18], [18], [20], [21] for transferred inference of IDC. There are many categorizations of transfer learning as given by [20] such as instance-based, mapping-based, network-based and adversarial-based. Our implementation of transferred weights is a network-based approach where SOTA networks are pretrained on ImageNet over a plethora of images. We re-use these pre-trained architectures barring the last few layers (and thus fine-tune the transferred model based upon our application) and compare those with fully, in-house trained vanilla CNNs to see how transferred learning affects model performance in the specific case of the detection of IDC for prognosis. Fig. 1 describes how we use ImageNet pretrained SOTA models for transfer learning of feature extraction facilitative weights.
Generally, when deep networks converge, their accuracy, loss and other performance metrics also saturate. However, as observed [22]- [24], the level of this asymptotic saturation degrades when architectures' layers are increased. This phenomenon is not observed in ResNet [22] due to the utilization of skip-connections between layers. We investigate this phenomenon further in our implementations of vanilla CNNs in Section V and see which parameters affect this degradation most and through which adjustments in specific parameters it can be minimized. Many researchers work on such comparative studies on datasets such as the CIFAR10 [25], MNIST [26], etc. with the problem that working on these datasets only helps us understand model performance and not how they might perform on real-world application-based datasets. Keeping that in mind, we perform our experiments on clinical medical data to achieve a two-fold objective of understanding the dynamics and feasibility of transfer learning for several CNN models along with the creation of reliable models for the prediction of IDC.
Breast cancer (BCa) encompasses several diseases and involves the uncontrolled division of cells in the breast tissue. Around 80% of the cases of BCa are identified as IDC [27] and is also referred to as infiltrating ductal carcinoma since the terms invasive and infiltrating refer to the cancerous cells breaking out of their origin ducts or glands to invade new spaces or new breast tissue. Less common types of IDC are medullary ductal carcinoma (MDCa), mucinous ductal carcinoma (MDCb), tabular ductal carcinoma (TDC) and papillary carcinoma (PC). MDCa comprises only 3-5% of all BCa cases and is visible through X-ray imaging or mammograms. MDCb, also called colloid carcinoma, is the condition where cancerous cells secrete mucous (the inner surface lining of organs of the digestive tract, liver, lungs, etc. is made up of mucous) which surrounds the BCa cells. The mucin associates with these cells and eventually they form a tumour. However, the prognosis of pure MDCb is better than other forms of IDC.
TDC comprises 2% of the IDC cases and has an excellent prognosis as compared to other cases of IDC. The tumour formed by TDC appear tube-like when studied under a microscope. PC accounts for 0.5% of the total IDC cases [28] the cells in the PC condition appear finger-like (papillary, made FIGURE 1. We make use of network-based transfer learning by using a portion of a fully trained CNN on the ImageNet dataset [13]. This portion comprises only the convolutional layers while the fully-connected dense layers are learnt in-house. Two vectors of 128 units are sequentially connected in the transferred model with a binary output layer at the right-most side. of papules) projections and is more prominently observed in postmenopausal women over the age of 60. The cases of MDCa, MDCb, TDC and PC are viewed as histological classification of the more general IDC -only a quarter of all cases of IDC are histologically categorized based on the BCa cell shape, size and arrangement. IDC is also categorized into four major molecular subtypes: luminal A (HER2-/HR+), luminal B (HER2+/HR+), HR2-enriched (HR/HER2+), and basal-like (HR-/HER2-). Clinical approximations for molecular subtyping or categorizing types of BCa are often not crisp, a major reason being that there is noticed an overlap between different molecular subtypes [27], [29].
It has been found that the use of the more recent deep CNNs has been better than using traditional approaches, mainly those which involve the extraction of handcrafted features from the images over which machine learning models like random forest are applied (this is also discussed in [47]). Neural networks have been found to perform better not only in computer vision tasks but also in other applications like speech recognition, reinforcement learning, and generative modelling. When many samples are present, for a pathologist, it is a time-consuming and difficult task to check for IDC for all the samples, which is where using deep CNNs gives a significant advantage in that it can generalize the features better owing to large amounts of data and also save time by providing instantaneous predictions.
We enlist our contribution in this paper as follows - • We implement avant-garde CNNs namely VGG16, VGG19, ResNet50, ResNet50V2, MobileNetV2, and DenseNet121. Along with these, we implement various traditional CNNs and LeNet-5 [26] and vary many different parameters to gather results and choose one best architecture among them.
• We critically analyse the performance of all the models and study their nature of predictions in the context VOLUME 10, 2022 of the influence of transfer learning for inference, and additionally, the influence of tune-able parameters in traditional CNNs on their performance metrics.
• Through this process, display the important parameterizations to use along with the extent of feasibility of transfer learning while creating a model for effective diagnosis of IDC through classification.
The rest of the paper is organized as follows -Section II (Related Work) describes the related work which is divided into three different techniques used majorly for the detection of BCa, Section III (Methodology and Materials) we describe the nature of the data and the techniques used in this paper such as CNNs (and their architectures), transfer learning, etc., Section IV (Evaluation Strategy) in which we define briefly all the metrics used for the evaluation of performance of all the models and also how we choose the best traditional CNN model for further considerations, Section V (Results) contains all the results in terms of the performance metrics, neural activations of intermediate chosen traditional CNN models, etc. In Section VI (Discussion) we analyse performance of each model with each metric and understand the effect of transferred weights for inference, and finally in Section VII (Concluding Remarks and Future Directions) we conclude the paper's findings and lay out the basis of work that can be done in this domain in the future.

II. RELATED WORK
Machine learning and deep learning approaches have been vastly employed to solve various medical problems [30]- [37]. More specific use-cases are gene selection and classification and diagnosis of cancer [38], [39], prediction of COVD-19 [40], [41], or detection of BCa through spider-inspired optimization [42]. Machine learning and deep learning methods are used even in non-medical fields [43]- [46]. We divide this section into three broadly employed approaches for the detection of BCa, namely WSI segmentation-based, Region of Interest (ROI) based, and unsupervised deep learningbased approaches.

A. WSI-BASED SEGMENTATION APPROACHES
Mostly, deep learning-based computer vision methods applied for the detection of BCa/IDC (also referred to as digital pathology) involve whole slide images (WSI) [47]- [49]. Cruz-Roa et al. [47] segmented the WSIs into various miniregions, similar to what we do in this paper, and compared the performance of deep learning workflows with SOTA handcrafted feature methods namely Gray Histogram (GH) [50] Fuzzy Color Histogram (FCH) [51], HSV Color Histogram (HSVCH) [52], RGB (red, green, blue) Histogram (RGBH) [51], Haralick features [53], Graph-based features [53], MPEG7 Edge Histogram (M7Edge) [54], Local Binary Partition Histogram [55] and JPEG Coefficient Histogram [52]. They employed a very small, simplistic CNN architecture with two convolutional layers and a final fully connected dense layer. It was noticed that the CNN performed best based on balanced accuracy (BAC) and F1-score (71.8% and 84.2%) which was an improvement of 6% and 4% respectively over the next best handcrafted feature.  used patch-based processing of the WSIs for detection of metastatic BCa through SOTA deep CNNs namely GoogLeNet [56], AlexNet [57], VGG16 [58], and FaceNet [59] and it was found that GoogLeNet and VGG16 attained maximum patch-based performance. Post classification, a tumour existence probability heatmap was generated which was used for computations of slide-based classification and lesion-based detection probabilities. Two interesting aspects of the work in [49] were the enrichment of the training set through inclusion of extra lymph node image data so as to help the models not misclassify such regions as BCa, and that to reduce computational costs, the WSIs were segmented by a threshold method that involved conversion of the image channels from RGB to HSV and application of Otsu's algorithm [60], and combination of the H and S mask images to get the final masks.
Janowczyk and Madabhushi [48] made use of deep learning approaches for seven different digital pathology tasks; one of these tasks was the correct segmentation of IDC from WSIs of breast tissue. The WSIs were divided into many different mini-patches (similar to our approach), but were resized to 32 × 32 and rotated for oversampling and tending to the problem of class imbalance. Using AlexNet with dropout and downsizing the patches, their model achieved an F1-score of 75.7% with a BAC of 84.23%, outperforming results obtained by [47] who considered patches of size 50 × 50. However, it was realised in [48] that using dropout did not improve results on the test set.
Exploring the depth-wise separable convolution methodology in CNNs, Alghodhaifi et al. [61] compared the performance of a standard CNN against a depth-wise separable CNN for the diagnosis of IDC through 50 × 50 patches extracted from a total of 162 WSIs. Depth-wise separable CNNs work by applying convolution to each separate channel (in this case, there are only three channels: red, green and blue) and then combine the resulting output channels through pointwise convolution. It was noticed in [61] that the standard CNN performed marginally better in terms of specificity, F1-score, and accuracy, however the precision and sensitivity scores for both models were nearly same. Interestingly, they found that application of Gaussian noise to both the models had contrasting effects: the accuracy of the depthwise separable CNN diminished by more than half (85.9% vs. 33.4%) while the standard CNN still held similar accuracy (87.1% vs. 77.4%). Using network-based transfer learning principles, Celik et al. [62] used two pre-trained SOTA CNNs that are included in the implementations of this paper namely DenseNet161 [63] and ResNet50 for the detection of IDC over patches of WSIs. They employed one-cycle policy [64] in which a tiny learning rate is chosen initially for training which is incremented after every mini-batch. This increment occurs until a proper learning rate along with the exploding loss value are reckoned. The main drawback of [62] is that they do not mention on which images or dataset the models were pre-trained on. This can be very crucial in the intelligibility of the model's outputs and behaviour. Moreover, we notice that in the literature there is seldom any work on comparison-based analysis of the performance of numerous SOTA CNNs which make use of transfer learning in the field of detection of BCa.

B. REGION OF INTEREST (ROI)-BASED APPROACHES
Subclinical diagnosis of BCa on whole images of full-field digital mammography (FDDM) through the use of deep learning techniques is a challenging task since the region of interest (ROI, where the BCa can be detected) is very small in comparison to the dimensions of the original FDDM image. To curb this issue, Shen et al. [65] pretrained a fully convolutional classifier on local patch-based WSIs embedded with annotations to incorporate ROI information. This pretrained classifier's weights were leveraged to initialize training of the same classifier on whole FDDM images to improve detection of BCa without the need of ROI annotations. They employed two different classifier SOTA CNN network designs which are also used in our paper namely VGG16 [58] and ResNet [22]. Dundar et al. [66] distinguished Usual Ductal Hyperplasia (UDH) from atypical ductal hyperplasia (ADH) and ductal carcinoma in situ (DCIS) over WSIs (manually identified ROIs) through multiple instance learning by making use of the large margin principle [67], [68].
Tackling the issues of automatic localization of ROIs for BCa from WSIs and classification of five different diagnostic varieties of ductal proliferations, Gecer et al. [69] used Fully Convolutional Networks (FCN) [70] for semantic segmentation of the WSIs to obtain ROIs from four different levels of magnifications. They showed that many redundant features are eliminated as the features are extracted from lower to higher magnifications. A deeper FCN was used for the classification of WSIs from five different diagnostic ductal proliferations namely non-proliferative changes, proliferative changes, IDC, ADH and DCIS. The morale behind usage of a deeper CNN for this task was to extract more features per WSI owing to visually similar proliferations. The performance of their model on the quin-classification task was not satisfactory (achieving an accuracy of 39.04%), so, in their last contribution they showed the fusion of the ROI and classifier outputs for WSI-level diagnosis helped improving accuracy. In more traditional mannerisms of extraction of features from digital mammography imaging, Yengec Tasdemir et al. [71] detected abnormal areas in a mammography by features extracted by Histogram of Oriented Gradients (HOG) [72] and Haralick features [73] to detect ROIs for presence of BCa. The mammography was segmented into smaller ROIs of size 73 × 68 and then converted into a two dimensional Discrete Wavelet Transform (2D-DWT) for multi-resolution decomposition of the ROIs [74]. On this 2D-DWT, Haralick and HOG features were extracted which was followed by a feature selection stage before classification by random forest, support vector machine (SVM) and AdaBoost.

C. UNSUPERVISED DEEP LEARNING-BASED APPROACHES
More recently, researchers have looked into unsupervised methods of deep learning for the detection of BCa and components of histopathology tissue [75]- [79]. [75] made use of FusionNet [80], a form of a Convolutional Autoencoder (CAE), that made use of very long skip connections between the encoder and decoder subnets to generate images -similar to those done by generative models in machine learning [81]. As done predominantly elsewhere, they used patches of WSIs for detection of IDC by only training the encoder network of the FusionNet and running a softmax classifier to obtain binary outputs. Autoencoders are used for pre-training deep learning models but are also very useful for mapping high dimensional data into a latent space, thus acting as a powerful feature extractor. This feature extraction property is exploited by CAEs for image retrieval tasks. When we consider tabular data for BCa risk prediction, Belciug et al. [82] compared the performance of supervised and unsupervised deep learning approaches namely Multilayer Perceptron (MLP), Radial Basis Function (RBF) and Probabilistic Neural Networks (PNN) as supervised networks and Kohonen's selforganizing map (SOM) [83] as the unsupervised network. The SOM performed equally well as its supervised counterparts and outperformed PNN by a 5% difference of testing accuracy. It was noticed that the p-value between the average portions of correct classifications (through the z-test) was higher than the significant value (where p > 0.05) for RBF and SOM, indicating no significant statistical difference in their positive classifications. The p-value was lower than the significant value (p < 0.05) for SOM vs. MLP and SOM vs. PNN meaning that significant statistical difference did exist for their positive classifications. This concluded that unsupervised deep learning methods performed similar to their supervised counterparts in neural networks. Self-supervised approaches have also been employed as done by Xu et al. [79] through the use of stacked sparse autoencoders (SSAE) for automatic detection of nuclei in breast histopathology. The SSAE framework outperformed other techniques such as Expectation Maximization (EM) [84], Blue Ratio (BR) thresholding [85], and Colour Deconvolution (CD) [86] in both qualitative and quantitative terms.

III. METHODOLOGY AND MATERIALS
In this section, we describe the data used for our experiments and the preprocessing techniques applied on them to bring them into suitable form. Further, we briefly explain the architecture of the models used in our experiments and finally we present a formal explanation of network-based transfer learning employed in our approach.

A. DATASET
We make use of 162 WSIs collected by [47] and [48] scanned at 40x magnification. For our experiments, as mentioned earlier, instead of taking the WSIs, we use a sliding window technique and extract 277,524 patches having dimensions VOLUME 10, 2022 50 × 50 characterized by a binary attribute to determine the existence of IDC. The binary class distribution is given by Table 1. We use a 9:1 train-test split ratio. After an initial screening of these patches, we noticed that a presence of IDC was attributed by darker shades of pink, i.e. tending to be purple. To understand this better, we plot a flattened colour histogram over three channels for normal and IDC patches as shown by Fig. 2. The x-axis contains the bin count (we take 256 bins) and y-axis depicts the number of pixels. Since each component (RGB, for red, green, and blue) represented has intensities varying ∈[0, 255], suitably, we take 256 bins to account for each intensity count. It is noticed from Fig. 2

B. IMAGE AUGMENTATION
Deep learning models require abundance of data to train properly. Usually, image datasets of such scales are too spaceintensive to maintain or transport for different applications. Image augmentation is a technique applied to the base dataset for the diversification of input images in terms of count and quality [87], [88]. This is achieved through various ways such as whitening transforms [89], rotations, shifts, shearing, zooming, rescaling, etc. The augmentation parameters we used in our implementations are given by Table 2. Rescaling is applied by multiplying data points with the given argument on the images after all other transformations are applied. Shear range represents the shear intensity which is the shear angle in counter-clockwise direction in degrees. Zoom range is the upper limit for a range used to sample random values lying within to zoom the image. Horizontal flip randomly flips the images horizontally. Only rescaling is applied to the testing set.

C. CNN AND ACTIVATIONS
In this subsection we give a brief overview of the working of a CNN and how we use different CNNs in the implementation.
CNNs, introduced by [26], have proven to be the backbone of modern deep computer vision technologies such as face detection [90], [91], action recognition [92], [93], scene labelling [94], etc. The convolution operation, most popularly used in signal processing, between two functions p(t) and q(t) can be defined as This operation is performed on pixel values by various convolutional layers to extract features through the use of 2D matrices known as kernels or filters. This convolution step preserves the spatial relationships and representations in the image. The number of parameters are reduced approximately by 75% in the next step of max-pooling which only extracts the maximum counts of convolved values in a fixed sliding window. After a series of combinations of convolutional and max-pooling layers, finally, a flattened vector is obtained which is fed into an Artificial Neural Network (ANN) acting as a feed-forward network which learns to output the correct classes. This step is usually called the full connection (FC). Fig. 3 pictorially depicts the methodology used for detecting IDC in patches of a WSI. As seen from Fig. 3, we extract the intermediate activations of different convolutional and max-pooling layers to better understand the features detected by subsequent layers for the interpretation of how inputs are transformed. Due to existence of three channels, we visualize these activations channel-wise -independently -and plot the inputs decomposed into the different learned filters of the layers. Further, generation of class activation maps can be done through Global Average Pooling (GAP) [95] which obtains the spatial average of the feature map of all units of the convolutional layer at the end whose weighted sum is taken for the final activation maps (see Appendix D). Application of class activations is not feasible in this setup as we automatically classify 50 × 50 regions of a WSI which results in seeming like a low resolution class activation map. Further, we discuss the transfer learning methodology used in our implementations.

D. WEIGHT TRANSFER
Deep learning frameworks require a lot of data to train effectively. Hence, fetching sufficient data can sometimes be a tedious prospect. This problem is largely solved in the literature and in real world applications through the use of readily available weights to initialize or kick-start the training of any CNN. The features learned by successive layers in a CNN for any task may be generalizable for use in a different task. We use nomenclatures used by [96] and [97].
Let there be a domain D given by D = {X, P (X )} where X is the feature space and P(X ) is the marginal probability distribution, having X = {x 1 , x 2 , . . . , x n } ∈ X. The data space of any task T is represented by X with P(X ) denoting the marginal probability distribution of a particular learning sample. Task T is given by T = {Y, f (x)} where Y is the label space containing the targets and f (x) being the target probability function which may be written as a conditional probability f (x) = P(y|x). Over the course of training, parameters of f (x) are adjusted to optimize and minimize  distances between outputs of predictive function f (x) and P(X ). The predictive function f (x) comprises tuples (x i , y i ) where x i ∈ X , y i ∈ Y . Finally, before being able to define transfer learning, we take two instances a and s.
Transfer learning may be defined as follows -if we are given a learning task T a having domain D a , we can use a source domain D s with a well defined T s . Through latent knowledge transfer from D s and T s , an attempt at improving the predictive function f a (.) is made (which is a component of learning task T a ) where D a = D s or T a = T s . If we denote the sizes of domains D a and D s by n a and n s respectively, then, we may say that usually n s n a . For the learning task of training all successive convolutional layers in any SOTA CNN used in our implementation (except LeNet-5), we make use of a source domain of ImageNet by using network-based transfer learning to use weights of pre-trained models. We do this by freezing the parameter learning process of the convolutional part of the networks and learning only parameters of the fully connected (FC) layers. This process has been shown in Fig. 1.

IV. EVALUATION STRATEGY
In this section we enlist descriptions of all the terminologies associated with our evaluation strategy for all the models. For a binary classification task, we have cases of true positivity (TP), true negativity (TN), false positivity (FP), and false negativity (FN). TP indicates a correctly classified positive, i.e. in our case, a correctly classified case of IDC. Similarly, TN indicates a correctly classified negative, FP a falsely classified positive and FN a falsely classified negative. Based on these four terms, we define precision, sensitivity (or recall), specificity, F1-score and balanced accuracy. These metrics are widely used in the literature for classification tasks. Precision P is the ratio of TP to all the labels predicted as positive and is given by (1), P helps answering to what extent the model correctly classifies positive cases. Further, sensitivity S n (or recall) is the ratio of TP to the number of positives in reality, given by (2), S n gives the measure of how many correct predictions of positive cases were made out of total positive cases. Specificity S p can be seen as an opposite of S n because it gives the measure of correctly labelled negatives (TN) out VOLUME 10, 2022 of the total population of the real distribution of negatives. Mathematically, F1-score F takes a combination of P and S n which presents the harmonic mean between these two variables. It is given by (4) as, In this paper, we use two different types of accuracy metrics: regular accuracy (RAC) and balanced accuracy (BAC). RAC will be used when we describe the test set validation accuracy of different models. However, once a confusion matrix of classifications is generated for all the models, we will calculate a BAC that will better represent model performance. BAC is required when there is a high class imbalance and can be mathematically expressed for binary classification tasks as, While, RAC can be mathematically expressed by (6) as, Finally, we use the Matthews' Correlation Coefficient (MCC) [98] for in-depth analysis of each model. MCC (also known as the phi coefficient) lies in the range ∈ [−1, 1] where −1 and 1 respectively mean total disagreement between observation and prediction, and perfect prediction. A value of 0 indicates that the model is as efficient as a random classifier. Most importantly, it is a balanced metric, meaning that class imbalance does not perturb the ease of its interpretation. Mathematically, A binary cross entropy loss (BCE) is calculated for the training of all the models. This BCE loss is taken into consideration when we calculate an optimization function (that we describe in this section later) and also by the neural net itself for the adjustment of weights and biases. BCE is expressed mathematically as, In (8), the distribution of data labels is given by y making p(y i ) the model's prediction on data label i. True data distribution is represented by v with n being the total number of samples. Given (8), we are now able to tract an optimization function used to select the best trained-from-scratch traditional CNN. In our experiments, we train fifteen CNNs by changing various parameters such as number of layers, neurons, regularizations, etc. that we shall describe in Section V more. As mentioned earlier, to determine how feasible transfer learning is in our application, we must compare it to some baselines, and hence we use vanilla CNNs for this comparison. Selection of a 'best' CNN can be tricky due to three metrics that all play a pivotal role in describing performance, namely, validation accuracy (or RAC), validation BCE loss, and training time. Here, validation refers to the calculation of metrics on the validation or test set (we use validation set and test set interchangeably in this paper, although their meanings in detail are not exactly same). Ideally, it is desirable to maximize RAC, minimize BCE loss and minimize training time, as we do in (9). Given a classifier model M θ i ;ϕ i with parameters θ i and implementation information ϕ i , we denote a set C = {M θ 1 ;ϕ 1 , M θ 2 ;ϕ 2 , . . . , M θ i ;ϕ i , . . . , M θ 15 ;ϕ 15 } that contains all the traditional CNN models used for experimentation. The implementation information ϕ i can be thought of as an m-tuple where m is the number of hyper-parameters (and other architectural information) that we vary over all our experiments. The cardinality and elements of this m-tuple will be clearly shown in Section V. Now, mathematically, the optimization function O M θ i ;ϕ i is given by (9), if we denote max(x) and min(x) by ψ(x) and ω(x) respectively, In (9), α(·) denotes the validation RAC, τ (·) denotes the training time, and H M θ i ;ϕ i (v) denotes the BCE loss for given model M θ i ;ϕ i . The objective is to maximize O M θ i ;ϕ i given by argmax M θ i ;ϕ i O M θ i ;ϕ i . This procedure yields us a single model M θ i ;ϕ i that we regard as the 'best' vanilla CNN to be compared with other SOTA implementations. Hence, maximizing O M θ i ;ϕ i transforms (9) as, It is important to note that we had to normalize values of the function τ M θ i ;ϕ i because of the huge difference in the scale of the values yielded by τ M θ i ;ϕ i as compared to α M θ i ;ϕ i and H M θ i ;ϕ i (v) -the latter two being restricted in the range ∈ [0, 1]. Typically, τ M θ i ;ϕ i yields values of units of seconds (s) which, due to hardware-related limitations, can never lie in [0, 1]. Thus, we apply a normalized τ M θ i ;ϕ i in our final optimization function, this function being denoted The normalization function N (x) is defined by (12) as, Using (8) and (12) in (11), we get, We remark that the range of O M θ i ;ϕ i varies between [0, ∞).

V. RESULTS AND DISCUSSION
In this section, we look at the implementation details of all the fifteen vanilla and SOTA+LeNet-5 CNNs and results achieved by them. 1 As discussed in Section IV, we also calculate important metrics such as precision P, sensitivity S n , specificity S p , F1-score F, RAC and BAC. Moreover, through selection of the best vanilla CNN (further denoted as C best ) using optimization function O M θ i ;ϕ i we compare performance of transferred inference of SOTA models against C best +LeNet-5 and also inspect the intermediate neural activations of the latter two models. Further, we remark that only select SOTA models (out of a pool of all models shown in Table 3) were implementable due to the limitation of certain models having a fixed minimum input dimension size. Since our patches were of dimensions 50 × 50, all SOTA models (as we had originally planned) could not be implemented.
The fifteen traditional CNN models M θ i ;ϕ i are characterized by implementation information θ i which is given by an m-tuple,  Table 4. The number of neurons in each AL is taken to be 128. Tuples such as (128,64,32) in FD 1 The experiments were performed on a 64-bit workstation with 4 GB RAM having an Intel i5-4460 @ 3.20 GHz processor on Windows 10 Home OS. Python 3 was used as the programming language for experimentation. represent the number of filters being used in each successive CL; this description follows for stride S as well. However, tuples in KS represent the square root of the sizes of kernels used in each successive CL. For example, a KS of (9, 3) represents size of first kernel taken as 9 × 9 in the first CL and 3 × 3 in the second. This description follows for pooling layer sizes PS as well. LVBCEL and MVA are the result of minimum loss and maximum accuracy encountered at any epoch. Let there be vectors A = {A 1 , A 2 , A 3 , . . . , A 15 } and L = {L 1 , L 2 , L 3 , . . . , L 15 } which store the RAC or accuracy A j and BCE loss L j for each epoch j ∈ [1,15]. Then, LVBCEL is defined as ω(L) and MVA as ψ(A). We noticed that 15 epochs for these vanilla CNNs were enough for proper convergence. Lastly, TT has the SI unit of seconds (s) and is represented by τ (M θ i ;ϕ i ) in (13), as shown at the bottom of the next page. We present results of vanilla CNNs in Table 5, in which we find C best through the maximum value of O M θ i ;ϕ i . Descriptions of L1, L2, BN and DO (which are the regularizations) are given in Appendix A, Appendix B, and Appendix C. In Table 5, a tick mark represents the use of the corresponding regularization, and a cross represents that the regularization was not used. These regularizations have been used randomly (in regard to their position in the network) for all the models.
According to Table 5 and Fig. 4, CNN 11 can be regarded as C best since it attains the highest value for optimization function O, hence C best ← M θ 11 ;ϕ 11 . The architecture of C best is visually represented in Fig 4. Additionally, for our experiments, we remark that a batch size of 32 images was used and the activation function for each FC layer was taken to be rectified linear unit (reLU) [104], except the last layer, which had sigmoid activation for vanilla CNN experiments and softmax for SOTA models. To have a better idea of number of parameters that each architecture learns as opposed to other models, we specify the number of total number of parameters along with the count of trainable and VOLUME 10, 2022 non-trainable parameters for each model. The number of parameters is calculated at each CL and added up. If a CL has n filters of size p × q with bias b and the number of channels c, then the number of trainable parameters at this CL can be calculated as (n × p × q × c) + b. In the case of FC layers, the adjustable weight matrices along with the biases are taken to be its parameters. We remark that for all models except LeNet-5 and C best , through network-based transfer learning, only the FC layers' parameters are learned with all the CL parameters frozen. Through this dual nature of experimentation it becomes possible to learn and understand the feasibility of transfer learning in our application as a comparison can be made between models that had transferred weights against the models that did not. Table 5 describes the composition of the FC layers of all the models.
We freeze certain layers (which have a certain number of parameters) -these layers are pre-trained from ImageNet data. In Table 7, the models which are pre-trained have nonzero number of non-trainable parameters. We calculate a ratio to understand the extent of the proportion of parameters that we freeze. Metrics of training accuracy denoted as T R A, testing accuracy as T E A, training loss as T R L and testing loss as T E L in Table 8 over 15 epochs for each model are  calculated. From Table 8, we notice that transferred inference can have a diminishing effect on T E A since the pre-trained SOTA models of VGG16, VGG19, ResNet50, MobileNetV2, DenseNet121 and ResNet50V2 attain a maximum T E A of 78.9%, 74.6%, 71.6%, 77.9%, 77.3% and 74.7% respectively over 15 epochs. On the other hand, LeNet-5 and C best attain T E A maxima as high as 81.1% and 83.7% respectively. Even with minimum T E L we have 0.433 and 0.367 for LeNet-5 and C best which are the lowest among all other models. These are the first evidences that only training the FC component for SOTA models keeping ImageNet weights for all convolutions is not a competent approach when compared to training smaller CNNs from scratch. It may be possible to have better performance with SOTA models by freezing less number of parameters and let those be learned. However, the biggest drawback in doing this is the computationally intensive nature of such training-from-scratch procedures for all SOTA models, making possession of advanced hardware a necessity. The high number of parameters to be learned, as seen in Table 7, for all SOTA models disallowed us to test their efficacy with a F/T ratio of 0. Another striking difference noticed in Table 8 is the general trend of T E A and T E L for all models having F/T ratio > 0 vs. the improving trend of LeNet-5 and C best for these metrics. There is no improvement T E A and T E L for pre-trained models indicating that adjustment of weights and biases of the FC components hardly makes any difference for the same features detected by all lower convolutional operations. Due to frozen weights and biases of all convolutions, there is no improvement or change in the higher level features detected by the final layers. Remarkably, it is possible that if LeNet-5 and C best had been trained for more epochs, their maximum T E A and minimum T E L may have differed to be even better. To visualize the higher level features detected by LeNet-5 and C best , we plot their intermediate neural activations for all convolutional layers given in Fig. 6 and Fig. 7 respectively.
It is evident from Fig. 6 and Fig. 7 that as we go deeper with the convolutions, the features selected are more abstract in nature. This aspect is more pronounced when activations of max pooling layers are included as well. Further, we review the confusion matrices obtained by all the models under our observations and find the metrics given by (1), (2), (3), (4), (5), and (7) as defined in Section IV. In Table 9, we denote each model's respective confusion matrix (CM) using the following convention -TN ← (0, 0) , FN ← (0, 1) , FP ← (1, 0) and TP ← (1, 1), where we have the abscissa and ordinate of the CM as (x, y). Fig. 8 plots all the metrics given in Table 9 for each model for better visual interpretation of the attained metrics. It is noticed that VGG16 and VGG19 perform relatively better than all other models when evaluated using said metrics. For better understanding, however, we discuss the performance of each model based on each metric in Section VI further.

VI. DISCUSSION
Detection of IDC, and hence BCa, is a problem that has profound clinical importance for facilitating the development of AI-driven techniques in modern day medical practices. Faster and more accurate diagnoses may be possible with augmented AI systems supervised by experts or clinicians, making their job easier and less intensive. Detection of IDC is an active area of research with numerous developments on different fronts for the diagnosis of BCa as we saw in Section II. Of the techniques that employ CNNs, many have used transferred inference on models such as the VGG16, VGG19, ResNet50, etc. mainly based on ImageNet weights [13], [106]- [111]. The brunt of the results of our work are given by Table 5, Table 8 and Table 9. However, Table 8 and Table 9 do not clearly give away any single model being superior to the other. This is because, we discussed in Section V how Table 8 portrays the trends of T E A and T E L to have a higher gradient of improvement for models  (13) to find C best .

FIGURE 4.
Performance of of C best as opposed to other vanilla CNNs based on optimization function O M θ i ;ϕ i based on Table 5.
without transferred weights (LeNet-5 and C best ) along with better T E A and T E L. However, in Table 9 (and thus in Fig. 8) we notice that pre-trained models such as the VGG16 and VGG19 perform comparably well, if not better, than LeNet-5 and C best in terms of P, S p , F and BAC. This observation is noticed in [109] as well, where it was seen that pre-trained networks trained on non-medical images surprisingly performed comparable to those pre-trained on a medical image domain. We discuss this effect in detail in Section VI (A), and discuss a few other aspects in the following sub-sections.

A. ABSENCE OF NEGATIVE TRANSFER
The fact that there is no clear superior model when pretrained models are put against those trained from scratch in our instance means that negative transfer [97] does not play a major role in the application of transferred inference for detection of IDC when using datasets collected by [47], [48].
To define negative transfer formally, let us consider the nomenclature used in Section III (D). Let there be predictive learners f 1 (·) and f 2 (·) trained on D a and the latter on (D s + D a ). Then, negative transfer is the condition where f 1 (·) performs better than f 2 (·).The comparable performance of both schemes of training is surprising in this case because the source domain D s is the ImageNet, which consists of images very different to those of breast histopathology. One of the reasons that negative transfer does not impact model performance here could be the intra-class variability in IDC datasets, as also discussed in [109]. Intra-class variability, in other words, means a high variance between different 50 × 50 patches of the same class. To demonstrate this variance, we show three different types of patches which belong to the same class in Fig. 9. Transfer learning provides a case for many different variations in the image to be detected. However, it can be detrimental when task domain D a has very specific features -which is not the case in IDC detection. VOLUME 10, 2022   There are many intra-class and lesser inter-class variations, making the effects of using transferred inference largely neutral.

B. METRIC-BASED ANALYSIS
From Table 7 we noted that LeNet-5 and C best performed best in terms of RAC / T E A maxima being 81.1% and 83.7% respectively. However, especially in medical domains, testing accuracy alone should almost never be considered alone -the reasons being how they can vary in their behaviour when predicting positive or negative cases, as we shall see. Further, at times, false negatives may be more important to reduce.
In terms of precision P, from Fig. 8 it is noticeable that VGG16 and VGG19 performed better than the rest. This means that these two pre-trained models are better at correctly predicting the positive IDC cases, i.e. having less number of FP. However, the same does not apply to sensitivity S n , where the two trained-from-scratch models LeNet-5 and C best perform better than all other SOTA models, meaning that they are better at predicting the positive cases out of all the positive cases in the test split of the dataset. In other words, LeNet-5 and C best have minimal number of FN. Again, this does not hold for specificity S p which is same as S n but for negative cases. VGG16 and VGG19 having top values for S n means that they are better at predicting the negative cases out of all the negative cases in the test split of the dataset. Presenting a harmonic mean between P and S n , we have F1-score F which is attained best by the models VGG16, LeNet-5 and C best . In terms of balanced accuracy BAC, the five models, VGG16, VGG19, DenseNet121, LeNet-5, and C best perform equally well. Finally, when we look at the Matthew's Correlation Coefficient MCC, it is noticed that LeNet-5 and C best outperform other SOTA models. All these results are summarized in Table 10.
From Table 10 it is evident that there is no single superior model, however it becomes clear that ResNet50, MobileNetV2, and ResNet50V2 do not perform as well as the other models, since they do not appear in the top performing models list.

C. THE CONUNDRUM OF IDC DETECTION ACCURACY
In machine learning applications, the validation set or test set accuracy plays a major role in determining model performance. However, such is not the case with clinical applications. Along with the accuracy, other metrics such as those discussed in Section IV also play a major role. Some suggest that MCC is the most informative single score for a binary classifier through which a 2 × 2 confusion matrix can be attained [112]. Nonetheless, there are works available in the literature that compare attained accuracy to of those achieved in the past [62], [106], [107]. The problem with this is that in the case of detection of BCa, accuracy and other metrics are highly dependent on the dataset. Many characteristics may be attributed to a dataset such as the size, class balance ratio, intra-class variance, inter-class variance, sample dimensions, etc. All these factors make comparing attained metrics to those done in the past by other groups of researchers a futile strategy. To the best of our knowledge, there are no other works in literature that compare all the models used in this implementation with the same dataset as comprehensively as we have, taking into consideration all the different metrics that we use for evaluation (as discussed in Section IV). Hence, we do not provide a comparison-based analysis of our work as opposed to other works done in the past. Enforcing this ideology, we notice that the detection accuracy attained by [47] and F1-score by [48] (the two sources of the dataset that we use) are in agreement to what we achieved in this paper. Hence, there may be researchers achieving test VOLUME 10, 2022 accuracy as high as 95%, to which we argue that the dataset in consideration plays a major role.
The aspects discussed in sub-sections VI (A), VI (B), and VI (C) help us understand the dynamics and feasibility of transferred inference for the diagnosis of IDC. Transfer learning only either has positive or no effect to the detection of IDC as discussed in 6.1. In this paper, we notice that on certain metrics, pre-trained models perform better than   trained-from-scratch alternatives, and vice-versa (Table 10). It may be possible for SOTA models to outperform trainedfrom-scratch models when they are pre-trained on domains closely resembling the data distribution being used for comparison, unlike here, as we use ImageNet pre-trained weights for all SOTA models. However, one obvious advantage of making use of transfer learning in this application is that a lot of time and computation can be saved for training models having a large number of parameters. Our work is different from that done in the literature because we try to analyze the applicability of transfer learning from a generic source domain in the detection of IDC. To the best of our knowledge, there has been no direct comparison of the performance of vanilla CNNs and larger pre-trained models from ImageNet for the detection of IDC. Through this work, we hope to inform the readers in the scenario where a choice is given to them -whether to use vanilla, trained-from-scratch CNNs or use an ImageNet pre-trained large CNN model.

VII. CONCLUDING REMARKS AND FUTURE DIRECTIONS
In this paper, we explore the dynamics and feasibility of transferred inference for the detection of invasive ductal carcinoma (IDC). We use pre-trained models namely VGG16, VGG19, ResNet50, MobileNetV2, DenseNet121 and ResNet50V2 along with LeNet-5 and a custom CNN architecture C best chosen by comparing various traditional small-scale CNNs through maximization of an optimization function. For all models except LeNet-5 and C best , transferred ImageNet weights were used and we tested the efficacy of both non pre-trained and pre-trained schemes on various metrics such as precision, sensitivity, specificity, F1-score, balanced accuracy and Matthew's correlation coefficient. We noticed that although LeNet-5 and C best performed slightly better in terms of testing accuracy, transferred inference did not have a pronounced impact when all other metrics were taken into account as a whole. The best results for metrics were shared between largely VGG16, VGG19, LeNet-5 and C best (Table 10). Due to the significant difference between the source domain of transferred weights (ImageNet) and the data distribution of the dataset of IDC, pre-trained models may not have been tested with their full potential. It may be possible to do so by using pre-trained models trained on a similar source distribution. To put these results into perspective, pretraining large CNN models over a generic source domain such as ImageNet does not provide a significant increase in various aforementioned performance metrics when compared to smaller, trained-from-scratch vanilla CNNs comprising only a few layers. Given the higher complexity and time involved in training larger models, it would almost be better to always use a vanilla CNN rather than use large models pre-trained on a generic source domain. Training models from scratch, as time and computationally intensive as it may be, promises to be a worthy alternative when proper source domains for transfer of weights are not available.
Admittedly, it is a challenging feat to achieve clinicianlevel accuracy for deep learning methods in the detection of IDC due to high intra-class variance in the datasets. Future directions for the detection of IDC may involve a mixture of detection of breast cancer (BCa) through whole slide images (WSI) using models trained to classify only patches of the WSI, as done in [65]. Models such as Fast R-CNN [14], Faster R-CNN [17], You Only Look Once (YOLO) [16], and Single Shot Detection (SSD) [15] may be used to localize the exact regions of the carcinoma in WSI. More emphasis may be given to tackle IDC detection using unsupervised deep learning methods such as extraction of high level features through restricted Boltzmann machines [44] and deep Boltzmann machines [113], deep belief networks [114], VOLUME 10, 2022 autoencoders, and etc. to explore and open doors for more open comparisons between the efficacy of different varieties of techniques for IDC detection. Future work in detection of IDC is very important when deep CNNs area concerned because the large models, when trained from scratch, are very slow to train given the sheer amount of data needed to properly train neural networks to detect IDC. For this, faster methods that use the deep CNN methodology must be developed that can be trained faster and provide similar performance benefits. Lastly, WSI-based patch dataset creators may consider addition of two more classes, namely 'sparsenormal', and 'sparse-IDC' to tackle with patches having a majority of the regions empty (white) as seen in Fig. 9 (left) to help CNN-based techniques better identify classes and reduce the intra-class variance in IDC datasets.

CONFLICTS OF INTEREST/COMPETING INTERESTS
The authors hereby declare that there is no conflict of financial/personal interest or belief that could affect their objectivity.

APPENDIX A L1 & L2 NORMALIZATION
Regularizations, such as the L1 and L2 regularization [115], [116] are used to decrease model complexity by penalizing significantly larger weights and reducing them to avoid overfitting. Given a loss function L(x, y) with input vector x and model outputs y, we can define the model's predictions by a function f : x → y given as, w i x i i = w 0 + w 1 x 1 + w 2 x 2 2 + · · · + w n x n n (14) In (14), w i denotes the weights where f (x i ) takes n input variables. Hence, L(x, y) can be given as, L1 regularization (sometimes also called as Lasso regularization) introduces a regularization term to (15) with a regularization parameter λ that determines the extent of penalty applied on weights as shown in (16).
On the other hand, L2 regularization (also known as Ridge regularization) is given by (17)

APPENDIX B BATCH NORMALIZATION
Batch normalization (also termed as batch norm) proposed by Ioffe & Szegedy [123] speeds up convergence of a deep neural network by normalizing each d dimensions in each layer. Having two learnable parameters p and q, the technique takes a mini-batch B of size k, B ← {x 1,...,k }, and normalizes each input x i intox i to be plugged into a linear transformation to restore the representation power of the model.
In (18), M B is the mean of the values of the batch, V B is the variance and c is a constant added for stability. The mean and the variance are defined by (19) and (20) respectively as, In final transformation step, the learnable parameters p and q are used as, Batch normalization has been used in various applications [124]- [127].

APPENDIX C DROPOUT
Dropout, proposed by Srivastava et al. [24] is another technique used in deep neural networks to reduce the extent of overfitting. It does this by randomly ignoring a percentage of units (or nodes) on the layer it is applied on. By ''ignoring'', it is implied that these nodes are deactivated and do not fire or propagate any values. The randomization is obtained by sampling the Bernoulli distribution. Fig. 10 shows the simplistic nature of dropout.
G. M. HARSHVARDHAN is currently pursuing the B.Tech. degree with KIIT Deemed to be University, Odisha, India. He has authored over 20 papers which have been published in reputed conferences and SCI/ESCI journals. His research interests include unsupervised and self-supervised deep learning methods, clinical applications of machine learning for medicine, generative models in machine learning, recommender systems for content delivery, energy-efficient task scheduling in cloud computing, topic modeling in natural language processing using latent Dirichlet allocation, and modeling high dimensional clinical data in medicine.
AANCHAL SAHU is currently pursuing the Bachelor of Technology degree in computer science and communication engineering with KIIT Deemed to be University, Odisha, India. In the past, she has interned as a Technical Content Writer at Heu Technologies Private Ltd., HYD, India. She has also worked at a Texas-based software company HighRadius as a Product Engineer Intern. Her research interests include intersection of security, stability, and society. She has published papers covering various aspects of security and stability, like artificial intelligence for healthcare, financial security, and the environment. She has published five research papers and articles over the past year.
MAHENDRA KUMAR GOURISARIA (Member, IEEE) received the master's degree in computer application from Indira Gandhi National Open University, New Delhi, and the M.Tech. degree in computer science and engineering from the Biju Patnaik University of Technology, Rourkela. He is currently pursuing the Ph.D. degree with KIIT Deemed to be University, Bhubaneswar, Odisha. He is presently working as an Assistant Professor with the School of Computer Engineering, KIIT Deemed to be University. He has an experience of more than 18 years in academia and seven years in research. He has published more than 60 research papers in different book chapters, international journals, and conferences of repute. He has also served as the organizing committees members for various conferences and workshops. His research interests include cloud computing, machine learning, deep learning, data mining, soft computing, and internet and web technology. He is a member of IAENG and UACEE, and a Life Member of ISTE, CSI, and ISCA.