Lightweight Histological Tumor Classification Using a Joint Sparsity-Quantization Aware Training Framework

Cancer decision-making is a complex process that can be exacerbated by the limited availability of oncological expertise. This is particularly true in rural areas and settings with fewer resources. Recently, there has been an interest in the potential of artificial intelligence in reliable computer-aided diagnosis tools in such settings. Nevertheless, the majority of deep learning algorithms are resource hungry in terms of data and storage requirements. In this work, we propose a novel lightweight deep learning model for histological tumor classification through a Joint Sparsity-Quantization Aware Training framework. Extensive experiments were conducted to evaluate the proposed framework. Promising performance has been achieved compared to the most relevant state-of-the-art work with a classification accuracy of 94.26% and an average $5\times $ reduction in the memory footprint. This work aims at opening doors toward efficient point-of-care diagnostic devices suitable for environments with limited resources.


I. INTRODUCTION
The widespread use of deep learning models has many implications across multiple disciplines.Models are becoming larger in size with massive amounts of data collected and fed to them, requiring tremendous computing power and energy.There is a current need to counteract the negative consequences of traditionally large models by looking into methods that scale back resource consumption.Deep learning that is economical in size and resource consumption is an attractive prospect, as it cuts back on cost, is accessible to more devices, and reduces the carbon footprint.The lower cost of entry incentives deployment to constrained resource devices, which can reach communities that cannot afford the cost of higher-end devices and expensive training.This is especially important in medical devices that facilitate the The associate editor coordinating the review of this manuscript and approving it for publication was Rajeswari Sundararajan .clinical decision-making process which is needed in areas with a deficiency in medical expertise.
Model compression is an area that is receiving a lot of attention due to the potential gains in conserving memory and computation, coupled with the interest and traction the Internet of Things (IoT) and mobile devices have gained in recent years, hence expediting the demand for efficient models.Quantization, pruning, and knowledge distillation are the main directions concerned with balancing computational costs with the associated consequence of accuracy degradation which are crucial factors for enabling deployment to resource-constrained devices.Deep learning applied to medical data has the potential to improve diagnostic accuracy and subsequent patient prognosis through early detection.Deep learning has a track record of being used in many medical applications, for example in drug prediction [1], medical image segmentation [2], cancer classification [3], abnormal speech recognition [4], and others.
Examination of histopathological images remains one of the essential methods for diagnosing cancer.Despite it being an effective diagnostic procedure, expert knowledge about the disease is vital for a valid interpretation [5].Moreover, histopathological examination is liable to the subjective interpretation of the pathologist which is dependent on their level of experience, thus leading to variations in patient assessment [6].Therefore, the need for a second opinion is often merited, however, it may be difficult and time-consuming to obtain.This has created a niche for computer-aided diagnosis systems for rapid diagnosis in which deep learning has shown great promise [7], [8].
Despite the history of using deep learning in medical applications, the availability of medical data is severely restricted in a legal and natural sense.This is at odds with the data-driven nature of deep learning.Laws and ethics regulating patient data might be a hindrance to widespread data collection, alongside the privatization of medical datasets.Until now there is still no standardized protocol in place for data collection, nor consistency in data quality [9].In conjunction, the fact that some diseases manifest more rarely in the population makes it impossible to create equal classes of diseases [10].Additionally, medical data needs to undergo annotation by experts, making abundant, high quality and annotated medical data a cumbersome goal to achieve.We arrive at a point where traditional deep learning's superior abilities in pattern recognition, which are especially important in tasks dependent on recognizing abnormalities, sometimes reported to be arguably superior to that of human experts [11], [12], [13], are being undermined by the scarcity of medical data [14].
Motivated by the extensive computational and data resources required for deep learning, in this work we aim to provide an efficient framework that can aid medical professionals in cancer decision support which could be suitable for deployment to resource-constrained devices such as EDGE and IoT devices for healthcare applications.The contribution of this paper can be summarized as follows: framework is proposed for lightweight histopathological classification.The approach presented interleaves transfer learning, sparsity, and quantization techniques through the training process.
• Extensive experiments were conducted to evaluate the impact of sparsity and quantization within the context of malignant tumor detection in breast histopathological images.
• We empirically evaluate the proposed framework on the BreakHis dataset and observe that JSQAT results indicate the possibility of a performance/memory tradeoff that balances the needs of a medical classifier with the memory bounds of a resource-constrained environment.We compare our results with other lightweight models oriented toward breast cancer histopathological classification.

II. RELATED WORK
The problem of histopathological classification is of great interest across multiple disciplines.Cancer diagnosis remains a challenging task for pathologists given cancer's heterogeneous nature [15].Cancer is varied in its types and classifying malignant tumors in images is a nuanced problem.Diagnostic accuracy and interpretability are priorities in medical applications which also apply to histological classification [16], [17].However, the majority of the literature on medical images is confined to the traditional deep learning paradigm that is both computationally and data intensive.
There is a need to address the demand for efficient models in histological imaging and medical imaging in general, to enable performant models using limited resources [18].This can contribute to more cost-effective solutions and wider accessibility especially in resource-limited healthcare facilities that suffer from insufficient infrastructure to support large models and a shortage of medical specialists.We believe that compression methods such as pruning and quantization in addition to transfer learning can allow for resource-efficient models to be integrated into clinical workflows in lowresource settings.
The reviewed literature prominently features CNNs, which are exceptional in pattern detection and suited for navigating the challenging topology of histological images.Previous works that tackle the problem of breast cancer classification can be grouped into ensemble or fused models [19], [20], lightweight models, or other various deep learning or machine learning models.Models in the literature can be also classified according to being magnification agnostic [21], [22] or magnification dependent [23], or whether data augmentation was used [24].
Another prominent approach is transfer learning based methods, typically using ImageNet as the base dataset.As an example, an approach followed in [25], uses a pre-trained EfficientNetV2 [26] backbone with a modified dual squeeze and excitation network for binary classification of BreakHis images.The authors report a precision of 0.9858 and an F1-score of 0.9764 at 40X magnification.Using a pre-trained network is also attempted in [27], where the authors devise an approach based on using a MobileNet network [45] with a Support Vector Machine named MobileNet-SVM.This yielded an accuracy of 91% on the 400X magnification of BreakHis.
A subset of medical image classification research is taking resource efficiency into consideration, however, those that tackle histopathological images are significantly fewer in the literature as the main body of literature does not prioritize resource/performance trade-offs.This could be due to the anticipated impact on diagnostic accuracy from reducing computational resources.However, striking the balance between the two will allow computer-aided diagnosis systems to reach low-resource devices and communities.
We wish to highlight efforts in this direction as the following reviewed works are conceptually closer to the goal we aim to achieve.In [28], the authors used post-training  The results indicate a 3.3× compression at an accuracy of 87.31% on 40X magnification BreakHis images [29] and 89.94% accuracy for 16-bit PTQ.In another work [30], the authors followed a structured pruning approach by filtering less important filters based on their absolute sum.The pruned VGG19 [31] based model achieved 90.08% accuracy on the magnification-independent binary classification task.The authors also reported a 47.54% parameter reduction and 63.46% FLOPs reduction.In [32], the authors applied structured pruning on a hybrid Inception network [33].Image pre-processing and data augmentation were applied to the dataset.The accuracy for 40X magnification is 85.7 ±1.9% when 50% of the channels are pruned.
While not widely attempted in the medical literature, there has been an interest in exploring ways to combine quantization and pruning to achieve lighter models mainly in the computer vision community.This combination has been implemented on medical images in [34].The authors devise a toolbox for producing efficient medical models for constrained hardware where they provide customizable pruning and quantization options.Of the applications tested, the performance of the toolbox on 2D nuclei segmentation task achieved a Dice score of 0.567.

III. METHODOLOGY
This section details the methodology of the proposed JSQAT framework and the associated experimental configurations.The proposed framework consists of three major stages, starting with transfer learning and fine-tuning [46] on histopathological images of breast cancer labeled as malignant or benign.The second stage is sparsity-aware training, and the final stage is quantization-aware training as shown in Figure 1.In this first stage of transfer learning, we started with two ImageNet [39] VGG19 [31] and ResNet-50 [44] pre-trained networks.Each network is appended by a global average pooling layer, a fully connected layer, and a classification (softmax) layer.We set the frozen/unfrozen layer ratio to around 70%/30% of total parameters.This results in most of the layers being frozen which preserves the pre-trained weights as shown in Figure 2, which demonstrates the state of the model layers at each of the three stages.

A. SPARSITY-AWARE TRAINING
Network pruning is a well-researched approach to model compression [47], with two main methods; unstructured and structured pruning, our approach to pruning falls under the former category.We incorporate network pruning into our training to reduce model complexity and overcome overfitting, with the goal of reducing the storage required in order to suit the capabilities of resource-constrained devices.
To maximize the efficacy of the pruning step, we adopt L 1 regularization as a preliminary step before pruning by integrating it into the loss function of the fine-tuning stage.L 1 regularization-based pruning is used for its effectiveness in minimizing weights and encouraging sparsity in the training layers thus facilitating the subsequent pruning step [48], [49].When added to a training loss function, the L 1 regularization term imposes a penalty that encourages more weights to become zero or near zero by penalizing the sum of absolute weights, this allows the model to learn a sparse representation.The regularization term consists of a penalty hyperparameter (λ) that can be tuned to control the severity of the penalization.
We also experiment with another regularization technique, L 1 L 2 .While L 1 regularization encourages sparsity in weights, L 2 regularization [50], [51] is primarily for controlling model complexity by distributing weights evenly throughout the model.The implication of using L 2 regularization is preventing overfitting and stark differences between weights by promoting weight decay.Typically it is used in conjunction with L 1 regularization for a more pronounced effect when it comes to sparsification and regularization-based pruning to improve the overall generalization ability of the model.
During the pruning step, we use unstructured magnitudebased pruning [52] in which the smallest weights are removed until the sparsity constraint is satisfied.This is followed by re-training to allow the network to recover from potential accuracy loss.The former steps are repeated iteratively, this is referred to as iterative pruning [53].It is important to point out that the magnitude-based pruning schedule that is followed enforces the sparsity percentage layer-wise not globally to the entire network.We chose unstructured magnitude-based pruning as our method due to its flexibility and more effective compression without compromising accuracy [54].
Equations ( 1) and ( 2) show the loss function used.This includes the Cross-Entropy loss term (L CE ) and regularization penalty terms to encourage sparsity in the training layers, y represents the ground truth and ŷ is the predicted output while N refers to the number of sample points.

L(ŷ, y)
The second and third terms of the loss function (Equation ( 1)) are the regularization terms which consist of the λ 1 and λ 2 hyperparameters and the L 1 -norm and the L 2 -norm respectively.The λ hyperparameter affects the severity of the regularization penalty.The regularization terms penalize the sum of absolute weights and squared weights.This incentivizes the loss function to minimize the magnitude of the weights [55], [56].In our experiments, we report the results for the following selection of hyperparameters < λ 1 , λ 2 >=< λ 0 , 0 >.Going forward, this is referred to as L 1 regularization mode and < λ 1 , λ 2 >=< λ 0 , λ 0 > is referred to as L 1 L 2 regularization mode.The λ 0 values tested are {0, 0.001, 0.005, 0.01, 0.05} for the first network and {0, 0.001, 0.005, 0.01} for the second network.

B. QUANTIZATION-AWARE TRAINING
In order to obtain a more compact network, we use quantization to reduce the number of bits required for each parameter from 32-bit floating point to 8-bit integers after the sparsification methods previously mentioned.In addition to minimizing the required memory storage, weight quantization lessens the burden of computation by enabling fixed-point arithmetic.For the purposes of our application, we opt for quantization-aware training (QAT), which is able to achieve competitive performance when compared to post-training quantization (PTQ).This is due to accounting for quantization error resulting from 8-bit quantization in the training process [57], which is not considered in PTQ.As shown in Figure 1, this entails quantization and dequantization blocks in the forward propagation path of the training.No quantization takes place in the backpropagation.Quantization in the forward propagation occurs according to Equation (3): where w is the input tensor/weight, w q is the quantized weight, q max and q min are the maximum and minimum values for the desired bit quantization, s is the scaling factor and z is the zero-point.Then, dequantization is achieved by ŵ = (w q − z)s.These operations are carried out by the quantizer and dequantizer nodes respectively.However, observing the dequantization equation, the recovered weight ŵ is not exactly equal to the original input weight w.As a result, quantization is a process that triggers loss which the optimizer tries to minimize by adjusting the network's weights.The loss due to quantization for a network with w inputs, y labels, and θ parameters can be described as L[f (w, q(θ)), y] where q is the quantization operation.Adding this new loss to the loss function in Equation ( 1), then the overall objective function to be minimized from all prior loss functions becomes: The steps in the framework are described in Algorithm 1, where the model undergoes sparsification and then quantization in the training pipeline.When performing transfer learning we opted for the fine-tuning approach where the network's upper layers plus the classifier layers are trainable, and the lower layers are frozen as demonstrated by the starting configuration in Figure 2. Results reported in the following section are of VGG19 and ResNet-50 pre-trained networks.As previously mentioned, only about 30% of the parameters are trainable, which reduces the number of total parameters of VGG19 from 20,576,466 to 7,631,506 trainable parameters.Similarly, reducing ResNet-50 from 25,786,386 total parameters to 7,720,082 trainable parameters.
Experiments were conducted to test different configurations of the loss function presented in Equation ( 4).Namely, the following modes of the framework were tested: Quantization-only Aware Training (QAT), Sparsityonly Aware Training (SAT), and Joint Sparsity-Quantization Aware Training (JSQAT).We experimented with enforcing the sparsity using the two modes introduced in Section III.These are namely L 1 regularization mode, and L 1 L 2 regularization mode.
We test the effect of pruning at 50%, 70%, and 90% sparsity on performance.We empirically experiment with different λ values of the regularization term to assess its effect Joint Sparsity-Quantization Aware Training schedule.The training schedule of the framework follows the prune first then quantize streamline as this is the recommended course for classification tasks [37].Quantization is applied to the trainable layers only, in order to minimize training time.For both networks tested (VGG19 and ResNet-50), the network was trained for 50 epochs at each stage.

D. DATASET
The dataset used is the Breast Cancer Histopathological Database (BreakHis) dataset [29], which is the largest dataset for histopathological breast cancer images.Even though it is the most comprehensive medical dataset for this condition with a total of 7909 images, it dwarfs in comparison to conventional image datasets like ImageNet.Originally, the dataset has eight classes divided into malignant and benign tumors as shown in Figure 3.We divide the dataset into two classes by merging the malignant tumor samples into one class and merging the benign classes at 40X magnification.The dataset is highly imbalanced with the number of malignant samples being twice the number of benign samples; therefore, we use class weights in training to balance any bias in the results.At 40X magnification, there are 1,370 malignant samples and 625 benign samples which is a total of 1,995 samples.The breakdown of the entire dataset is as follows:

IV. RESULTS AND DISCUSSION
We present the results for the following networks VGG19 and ResNet-50 when applying the proposed Joint Sparsity-Quantization Aware Training framework.We include the accuracies of the fine-tuned network at different λ values.The accuracies are averaged over three test runs.Firstly, the results of VGG19 are presented in Table 2, which includes the accuracies at different sparsity levels for different λ followed by the results of ResNet-50 in Table 3.Both networks are fine-tuned on the 40X magnification partition of the BreakHis dataset.The results are divided by the mode of regularization and the framework configuration used.This is outlined for both networks.Moreover, after displaying the test accuracies, the resultant model sizes are included in Table 4.The sizes of all networks are of .tflite[58] files with the exception of the baseline sizes of the model (underlined) which are .h5[59] files in MegaBytes.
Results on the 40X binarized Breakhis dataset at 0% and 50% sparsity levels are very close when Joint Sparsity-Quantization is incorporated into the training 119346 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.schedule, with L 1 and L 1 L 2 regularization modes on VGG19 and ResNet-50.The accuracy at 50% sparsity exceeds 0% sparsity in some cases.At 50% sparsity in Joint Sparsity-Quantization Aware Training, there is the advantage of decreased storage requirement which makes it the configuration that achieves better accuracy/size trade-off.This can be due to the regulatory effect on the network from the sparsification process, in addition to the hypothesis [53] that introducing sparsity to networks allows for better generalization which can translate to better performance.Considering the trade-off between sparsity and accuracy, Hoefler et al. [56] observe that moderate sparsity targets (defined as lower than 90%) benefit from magnitude-based pruning, especially iterative pruning, in contrast to high sparsity targets, which is the effect we have observed in our experiments where the sparsity and accuracy trade-off are best at 50% and 70% sparsity as shown in Figure 4.
VGG19 performs best at λ = 0.001 for both types of regularizations, with an accuracy of 94.26% at 50% sparsity.This combination results in a model size of 46.6 MB, which is a compression of 5× when compared to the original model size.We observe a similar occurrence with regards to ResNet-50 when applying Joint Sparsity-Quantization Aware Training at 50% or 70% sparsity alongside L 1 or L 1 L 2 regularization mode.In most cases, the accuracy here exceeds the baseline with more than a 2× decrease in size when compared to the baseline network at 0% sparsity.Overall, ResNet-50 demonstrates better relative performance at higher sparsity percentages when compared to VGG19.On the other hand, VGG19 displays higher compression ratios.ResNet-50's performance is best at λ = 0.001 using L 1 regularization mode, where 50% sparsity gives an accuracy of 94.81% and size reduction of 2.5×.Notably, 70% and 90% sparsity at the same λ give an accuracy of 93.87% and 86.34% respectively.In addition to achieving a size decrease of 4× and 9.8× for 70% and 90% sparsities when the Joint Sparsity-Quantization Aware Training framework is used.
To summarize our results, it is worth pointing to the enhanced accuracy at 50% sparsity when using the Joint Sparsity-Quantization Aware Training framework configuration for both VGG19 and ResNet-50 at most λ values used.With the addition of a 2.5× and 5× decrease in size for ResNet-50 and VGG19 whilst using only 30% of the total parameters.We note that VGG19 displays a reduced memory footprint with higher compression ratios but slightly lower accuracy than ResNet-50.In low-resource settings, using VGG19 as a backbone would be a better practical choice 119348 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
as it provides better compression ratios than ResNet-50 and competitive performance accuracy.Given that VGG19 yields almost double the compression ratio for only a 0.55% decrease in accuracy.Thus it proves to be a better choice when it comes to practical resource considerations in resourceconstrained environments.
To provide a point of comparison, we list our results with those of other lightweight models trained on the BreakHis dataset as shown in Table 5.We compare our best performing model against works that align with resource awareness, focusing specifically on pruning or quantization as compression methods.We limit the works to those that test on the BreakHis dataset in order to provide commonality when comparing since providing a fair and unbiased comparison is challenging due to the lack of standardization when reporting performance results on medical data.We note a 1.7 × more reduction in size and an increase of 6.96% in accuracy when comparing JSQAT (VGG19) to MobiHisNet [28].One of the resource gains reported in [30] is a 47.54% reduction in model parameters compared to our 50% model sparsity ratio, where JSQAT (VGG19) yielded a 4.18% increase in classification accuracy.Additionally, our unstructured sparsity approach at 50% sparsity resulted in an increase of 9.77% accuracy compared to the unstructured pruning and hybrid model amalgamation followed in [32].
The results are demonstrative that it is possible to achieve a fair compromise between accuracy and size to meet the criteria of both resource-constrained environments and medical image classification needs with the proposed Joint Sparsity-Quantization Aware Training framework.

V. CONCLUSION
Clinical integration of computer-aided diagnostic devices depends on a multitude of factors.In order to decrease the dependency on human observers for a faster and more streamlined process of early breast cancer detection, the subject of deep learning in computer-aided diagnosis has become a heavily investigated research area.Deep learning in the medical domain has enjoyed much success, however, there remain difficulties that hinder its progress, namely massive resource and data consumption.
To navigate this issue, we explore the use of transfer learning and fine-tuning on the BreakHis dataset to counteract the effect of limited data samples.To meet this end of classifying histopathological tumors, we develop a Joint Sparsity-Quantization Aware Training framework that integrates model compression techniques such as quantizationaware training, regularization, and magnitude-based pruning for the benefit of balancing accuracy and the memory footprint.We empirically investigate the effectiveness of the approach on different pre-trained networks to assess the resilience of different networks against the introduction of reduced precision and removal of weights.
For future evaluation in extremely data-limited settings, few-shot learning has emerged in recent times as a promising method to address limited size classes, with significant implications pertaining to generalization when paired with compression, specifically sparsifying techniques, especially in medical applications, we leave this as future work.
Accordingly, focusing on learning with limited resources and data is essential for medical applications as overcoming these two constraints allows for designing better cost-effective applications to be used in low-resource clinical settings.

FIGURE 1 .
FIGURE 1. Diagram of the methodology of the system.The pre-trained model undergoes three stages of training; standard fine-tuning, sparsity-aware training, and quantization-aware training in order to adapt it to the target dataset with the goal of subjecting it to the compound effect of sparsity and quantization.

FIGURE 2 .
FIGURE 2. Changes to the network at each stage, where the layers colored red are trainable.

Algorithm 1 2 : 6 : 1 N i=1 |w i | + λ 2 N i=1 w 2 i 7 : 1 N i=1 |w i |+ λ 2 N i=1 w 2 i
Algorithm for Sparse-Quantization Aware Training Input: Pre-trained network with parameters θ, total number of layers L, number of trainable layers k, number of iterations n, pruning mask m and pruning rate p Output: Lightweight classifier for histopathological images Stage I : Transfer Learning (Fine-tuning) 1: for one epoch do Minimize the loss function L CE 3: end for 4: Freeze the first L − k layers of the pre-trained model 5: for each epoch do Minimize the loss function L CE + λ end for Stage II : Sparse-Aware Training 8: Freeze the first L − k layers of the pre-trained model 9: for each epoch do 10: Prune the smallest absolute weights p 1 n % layer-wise.The resulting mask m is applied to the network parameters which become → θ ⊙ m 11: Re-train the network 12: Repeat for n iterations 13: end for Stage III: Quantization-Aware Training 14: for each epoch do function L CE +λ + L[f (w, q(θ)), y] 17: end for 18: Quantize the k layers to 8-bits C. PRACTICAL CONSIDERATIONS

FIGURE 3 .
FIGURE 3. Sample images from each class in the BreakHis dataset.The first row consists of benign tumors, the second row is of malignant tumors.The images are 40X magnification.

TABLE 1 .
Statistics of the BreakHis dataset.

TABLE 4 .
Model sizes for different pruning percentages for the proposed Joint Sparsity-Quantization Aware Training framework (JSQAT) and the Sparsity-only Aware Training (SAT) configuration in MegaBytes.Baseline sizes are underlined.

TABLE 5 .
Comparison with related work on the BreakHis dataset.