FastNet: A Lightweight Convolutional Neural Network for Tumors Fast Identification in Mobile-Computer-Assisted Devices

Histopathology diagnosis is an important standard for breast tumors identifying. However, histopathology image analysis is complex, tedious, and error-prone, due to the super-resolution image. In recent years, deep learning technology has been successfully applied to histopathology image analysis and made great progress. The well-known deep neural networks usually have tens of million parameters, which consume much memory to deploy the state-of-the-art model. In addition, deep neural networks rely on high-performance hardware resources, which impede the deployment of the state-of-the-art model on portable equipment. In this work, a novel framework which consists of a weight accumulation method and a lightweight fast neural network (FastNet) was proposed for tumor fast identification (TFI) in mobile-computer-assisted devices. The weight accumulation method was designed to obtain the tissue mask regions of interest and remove the useless background area in histopathology images, which greatly reduces the redundant computation cost. Furthermore, we proposed the lightweight FastNet to improve the computational efficiency on mobile devices. A novel attention loss (AttLoss) function was designed and applied in FastNet. The AttLoss function pays more attention on the positive samples and the indistinguishable samples, which greatly improves the performance. The proposed FastNet was compared with three state-of-the-art methods commonly used for image classification and object detection. Experimental results indicated that FastNet achieves the highest recall of 96.94%, the highest $F_{1}$ score of 97.33%, and the highest accuracy of 97.34%, besides least trainable parameters of 0.22M and smallest floating point operations of 210M FLOPs.


I. INTRODUCTION
A. Background C ANCER is the leading cause of deaths all over the world [1], [2]. Breast cancer occurrence rates continue to rise approximately 0.5% year by year. The breast cancer, accounting for 30% in female cancers, causes the most deaths among the female in the United States [3]. Histopathology diagnosis is the gold standard to confirm carcinoma [4], [5]. Deep learning technology has been successfully applied on histopathology images diagnosis and surpassed pathologists in efficiency and accuracy [6], [7], [8], [9]. The histopathology image usually contains gigapixel, which make it difficult training a deep neural network to detect the tumor region on mobile-edge devices [10], [11], [12].
With the rapid advancement of the Internet of Things (IoT), lightweight research has become significant for resourceconstrained IoT devices. The application of IoT-based deep learning in the healthcare service has grown substantially in recent years, resulting in the advancement of diagnostic equipment and opening up new avenues for medical treatment [13]. When IoT devices share sensitive information between patients and centralized cloud servers, privacy security is a serious problem. Some lightweight smart healthcare frameworks and lightweight authentication protocol are proposed to protect the privacy of patient data for mobileedge computing [14], [15]. In addition, deep neural networks rely on high-performance hardware resources, which impede the deployment of state-of-the-art model on limited memory mobile devices. Therefore, the lightweight of the model is particularly crucial for deployment on resource-constrained devices. Quantization, pruning, and knowledge distillation are common techniques for lightweight models. Depthwise separable convolutions [16], group convolution [17], and parallel 1-D depthwise convolutions [18] are usually used to instead of standard convolutions to reduce parameters. Some lightweight convolutional networks, combining cloud-edge This  computing with AI technology, are proposed to meet the real-time requirement [19], [20].
The previous research concentrates on the region of interest (ROI) in the histopathology images [21]. Related algorithm processed the histopathology image at low resolution and performed more complicated analysis at high magnifications in the way the pathologists analyze the whole slide image [22]. Concentrating on ROI can avoid many unnecessary regions from the background area, save much computational time, and gain more accuracy [23]. However, only using the ROI data, they lost tremendous amount of organizational structure and spatial information. Facing the extremely high resolution of the histopathology images, most researchers first removed the background region from the histopathology images, and then extracted small patches from tissue region. Finally, they train the deep convolutional neural network to classify these patches as normal or tumor patches [24], [25], [26]. Some researchers screened more discriminant patches as more effective training data to improve the accuracy [27]. However, they only took patches arbitrarily as training data and did not consider the spatial relationship of patches. Some researchers applied the vision transformers to medical images [28], but these vision transformers were often computationally and model size-demanding due to the exponential complexity of selfattention, despite stronger variations with better recognition performance. Some researchers found that texture information and spatial connections between patches are extremely important [29], [30], [31]. Considering the spatial relationship of adjacent patches, these researchers achieved more accurate tumor identification. However, it was time consuming to identify tumor with all the patches and spatial relationship among adjacent patches in super-resolution images. Their model cannot produce the diagnosis result in the expected time due to super-resolution images and complex model. Their models were not suitable for being transferred to tiny portable equipment. Some researchers focused on the portable device [16], [32], such as smart phone, unmanned aerial vehicle, and medical equipment. But their methods achieved a little worse result than the state-of-the-art methods.
In this article, a tumor fast identification (TFI) framework was proposed to address these difficulties. In the TFI framework, we removed the unnecessary background regions and obtained the tissue mask ROIs with weight accumulation method. The tissue mask was obtained from combining the result of the RGB color model, the HSV color model and threshold segmentation, which reduces largely the workload of neural network without reducing the performance. A novel attention loss (AttLoss) function was proposed to increase the weights of positive samples and the indistinguishable samples. The AttLoss function improved classification performance. A lightweight fast neural network (FastNet) was proposed for patch classification in the TFI framework. Compared with the well-known lightweight neural networks (MobileNet V2 [33], ShuffleNet V2 [34], and EfficientNet [35]), the FastNet proposed by us achieved the highest recall and F 1 score with the least trainable parameters and smallest computational complexity.

B. Related Work
The early researches concentrated on the ROI in the histopathology images. A super pixel clustering method based on segmentation and ROI search was proposed by Li and Huang [23]. This method combined coarse-to-fine super pixel clustering and boundary update. A novel method was proposed to automatically detect the ROIs in histopathology images by Bejnordi et al. [22]. The proposed method targeted ROIs at low resolution and then performs further analysis in high resolution, which simulates the way that a pathologist analyzes the histopathology image.
Facing the extremely high resolution of the histopathology images, most researchers extracted small patches from histopathology images and trained deep convolutional neural network to predict patches. Vang et al. [36] used Inception V3 to classify patches level and predicted the patches by utilizing logistic regression, gradient elevator (GBM), and majority voting in image level. Their experiments achieved a 12.5% higher performance than the state-of-the-art model. Li et al. [27] proposed a patch filtering approach grounded on convolutional neural network and clustering method to screen more discriminant patches. Their method got accuracy of 88.89% on the whole test data and 95% accuracy on the partial test data.
Some researchers found that texture information and spatial connections between patches are extremely important. An NCRF framework was designed for considering spatial correlation among adjacent patches [30]. Li and Ping connected the NCRF framework directly to a fully connected conditional random field to detect cancer metastasis in histopathology images. Karimi et al. [37] used three different convolutional neural networks to predict tumor in multiple patch resolution. Combining the prediction results, their methods extracted more contextual information in different scales. Ding et al. [38] proposed a deep residual network to detect the small target with spatial structure information. Le Trinh et al. [39] achieved a excellent tumor classification by training deep neural network on multiscale histopathology images.
A dilated fully convolutional network was applied to pixelwise prediction by Dou et al. [40] in biomedical image segmentation. Chiang et al. [41] proposed an independent sliding window detector for volumes of interest extraction. They used candidate aggregation algorithm, 3-D CNN, and sliding window detector to expedite the tumor detection. Yuan et al. [42] proposed a novel end-to-end deep learning method which can extract more significative features for a small target. Ale et al. [43] proposed a novel edge computing scheme to proactive cache data, which makes it practical for the rural people to receive remote histopathology analysis. Chen et al. [44] proposed an event-driven attack method, which exploit users' behavior to collect their private information. Inspired by this, remote histopathology analysis must ensure the privacy of users. An outstanding deep CNN was proposed by Chen et al. [45] to segment medical images. In their method, the model performance was improved by using prior knowledge. Yang et al. [11] sampled multiple scales patches from histopathology image and then fine-tuned the three pretrained deep convolutional neural network. They combined the three fine-tuned models to form an ensemble model and achieved 91.75 ± 2.32% accuracy. Lin et al. [46] designed a novel approach that can leverage the potency of the fully convolutional framework to accelerate the prediction and improve accuracy.
Some researchers focused on the lightweight neural network. SqueezeNet [32] had 50 × fewer parameters than AlexNet while maintaining accuracy. Howard et al. built lightweight neural networks named MobileNet [16] by using depthwise separable convolutions. They demonstrated that their MobileNets achieved strong performance compared to other well-known methods in object detection. ShuffleNet [47] was designed for mobile devices with limited computation resource. Group convolution was applied in the ShuffleNet to reduce parameters while maintaining the performance. The EfficientNet [35] studied impact on model with different network depth, width, and resolution. Using a compound coefficient, EfficientNet can scale all dimensions of depth/width/resolution. Kumar et al. [13] presented an effective and portable CNN model for histopathology image categorization, and demonstrated its performance on the Raspberry Pi and three mobile devices. MobileViTs [48] were developed specifically for mobile devices and combine the advantages of CNNs with vision transformers. With the same amount of parameters, MobileViT outperformed MobileNets in image classification. Chen et al. proposed the Mobile-Former [49], which enables bidirectional fusion of local and global features and outperforms MobileNetV3 at low FLOPs. It was a parallel architecture of MobileNet and transformer with a two-way bridge in between. Pan et al. [50] introduced a novel class of lightweight ViTs (EdgeViTs) that compared with the best lightweight CNNs in the tradeoff between accuracy and ondevice efficiency. They directly took into account latency and energy consumption of various models as opposed to depending on floating point operations or parameters. Despite stronger variations with superior recognition performance, these vision transformers required frequently computation and huge model size due to the exponential complexity of self-attention.

C. Contributions
To address these key challenges, we propose a TFI framework which consists of a weight accumulation method and a lightweight FastNet to largely reduce the computation cost and accelerate tumor prediction. The main contributions of our work are summarized as follows.
1) A weight accumulation method consisting of ROIs locating and patch generating from ROIs of histopathology images is proposed to greatly reduce the redundant computation cost and speeds up the prediction of tumor regions.  the FastNet proposed by us achieves the highest recall and F 1 score with the least trainable parameters and smallest computational complexity. The remainder of this article is organized as follows. The data sets used in our experiments is describe in detail in Section II. The histopathology images are visualized in this section. The proposed TFI framework for tumor detection is introduced in Section III. In Section IV, we show all experimental implementation details and results. The conclusion of this article is given in Section V.

II. DATA SETS AND HISTOPATHOLOGY IMAGES
Our experiments were implemented on the data sets from CAMELYON16 competition, which consist of two different subset data sets from the medical center of Radboud University and Utrecht in The Netherlands. There were totally 400 histopathology images obtained from breast cancer patients. The details are described as follows.

A. Training Data Sets and Test Data Sets
The training data sets contain two subsets. One data sets, consisting of 170 histopathology images (100 healthy histopathology images and 70 lesion histopathology images), was collected by Radboud University Medical Center. The other data sets, consisting of 100 histopathology images (including 60 healthy histopathology images and 40 lesion histopathology images), was collected by the University Medical Center Utrecht. The test data sets, consisting of 130 histopathology images, were collected from both of medical centers. The data set class compositions are as shown in Table I.
The data sets were annotated by experienced pathologists. The ground-truth labels contained the expert's annotations of tumor area on histopathology images. There were two kinds of ground-truth label being provided.
1) The vertex of the annotated outlines of the regions from tumor metastases is provided in xml format. 2) Histopathology images binary masks of the regions from the cancer metastasis.

B. Visualization of Histopathology Images
Ordinarily, histopathology images were scanned and saved as a multilevel pyramid framework. The multilevel pyramid image files consisted of original image and nine downsampling versions of the original image, which is shown in Fig. 1. Each downsampling image in the pyramid consisted of lots of interest of regions. Since the huge size of the histopathology images, it was laboursome to train a convolutional neural  network model with gigabyte pixel image. So we extracted patches from the ROI from downsampling images for training.
Automated slide analysis platform (ASAP) is a famous and excellent tools to visualize, analyze, and annotate the histopathology images. We visualized the histopathology images with the ASAP platform and analyze the annotations. The visualization result is as shown in Fig. 2.

A. Overview of the Tumor Fast Identification Framework
To achieve TFI on mobile-computer-assisted devices, we proposed a TFI framework presented in Fig. 3. Our proposed TFI framework consists of three main parts: 1) weight accumulation method; 2) AttLoss function; and 3) a lightweight FastNet. The weight accumulation method has two main steps. First step is obtaining ROIs in histopathology images, and the second step is patch generating from ROIs with weight accumulation formula. The AttLoss function is used to solve the problem of extremely imbalanced data distribution. The FastNet has two main components: 1) the multiscale block and 2) the multibranch block. The multiscale block is designed for different scales features extraction, and the multibranch block is applied to extract more details from the previous layer.
As shown in Fig. 3, in the training step, first, we removed the above 70% background area of histopathology images with the weight accumulation method, which greatly reduce the redundant computation cost. Many patches are extracted from the ROIs as the training data sets. Additionally, an AttLoss function was also designed to increase the weights of positive samples and the indistinguishable samples for better classification performance in the FastNet. Next, the FastNet model was trained with the training data sets and AttLoss function. In the prediction step, the test data sets were removed the background area with the same weight accumulation method and split into many patches for tumor probability prediction. Finally, all the tumor prediction probabilities were embedded on the histopathology image in the order of patches, which form a whole tumor probability heatmap. In the probability heatmap, the red region shows the tumor area with extremely high probability, and the green region represents the normal tissue area. In the following section, we introduce the weight accumulation method, AttLoss function (AttLoss), and FastNet in detail.

B. Weight Accumulation Method for Image Processing 1) Regions of Interest in Histopathology Images:
In histopathology images, haematoxylin and eosin (H&E) staining are the most commonly used method of tissue staining. H&E is the abbreviations of hematoxylin and eosin. The hematoxylin stains the nucleus dark blue, the eosin stains the cytoplasm and extracellular matrix pink, and other structures show different outlines and similar colors. The pathologist can distinguish between the nucleus and cytoplasm easily in the histopathology images stained by H&E.
There are large amounts of useless background areas in histopathology images. Locating the ROIs and removing the background areas in the image can reduce vast redundant computation cost. Otsu's method [51] is an efficient algorithm for image binaryzation, which calculates the threshold to divide the original image into foreground and background images, proposed by Japanese scholar OTSU in 1979. Therefore, we apply the Otsu's method to obtain the tissue mask ROIs in histopathology images.
Typically, 70% of histopathology images are background region, as shown in Fig. 4(a). To obtain the tissue mask ROIs, Otsu's method was automatically applied in the RGB color model shown in Fig. 4(b) and the HSV color model shown in Fig. 4(c), respectively, to binarize the images, and we adjusted the threshold of the RGB color value to obtain the threshold tissue mask, as shown in Fig. 4(d). Finally, we combined the RGB tissue mask, HSV tissue mask, and threshold tissue mask to obtain the final Fig. 4(e). More details are described as follows.
First, Otsu's method was leveraged in the RGB color model, in which three mask images were obtained from R channel,  G channel and B channel. Then, ANDing the three mask images, the tissue mask was obtained in the RGB color model, marked as RGB_MASK and as shown in Fig. 4(b). According to the previous research, Otsu's method performs better in the HSV color model. As a result, we obtained one more new tissue mask in the HSV color model, marked as HSV_MASK and as shown in Fig. 4(c). An experimental analysis was designed to find optimum background segmentation threshold. It was found that, when the RGB value is between 50-240, the background segmentation works pretty well. Taking the threshold as 50 and 240, we obtained the third tissue mask, marked as THRESHOLD_MASK and as shown in Fig. 4(d). Finally, we ANDed the RGB_MASK, HSV_MASK and THRESHOLD_MASK to obtain the tissue mask ROIs, as shown in Fig. 4(e). We calculated all the tissue mask ROIs, the average percentage of ROIs is approximately 30%. Obviously, removing the useless background areas can reduce an ocean of redundant computation cost.
2) Patch Generating From Regions of Interest With Weight Accumulation Formula: Removing the numerous invalid background regions, we had obtained the tissue mask ROIs. Considering the huge size of the histopathology image, it is difficult feeding the whole image into the convolutional neural network for training. We extracted a multitude of patches from the ROIs weight accumulation method, and trained the neural network with patches. The main procedure is as shown in Fig. 5.
Ground Truth: Each histopathology image was annotated by expert pathologists, and the annotations were stored in.xml files consisting of vertex in the annotated outlines. We converted the.xml files to.json files for further processing. In the normal images, the whole tissue mask is the normal region of interest. Conversely, in the tumor images, the normal region of interest is the rest region of removing the TROI from the whole tissue mask. The NROI is stored as normal mask image.
Extract Patches From TROI and NROI With Weight Accumulation Formula: For this part, we describe the method extracting patch from TROI and NROI. The patches extracted from TROI and NROI were considered as positive samples and The four points formed a rectangle that containing the whole TROI or NROI. Second, we defined the patch's resolution as 256 pixels × 256 pixels (WIDTH = 256 pixels). We split the rectangle region into patches with padding in the original image.
Definition 1: I means there are I rows patches in total, defined as follows: Definition 2: J means there are J columns patches in total, defined as follows: Hence, we extracted I × J patches from the rectangle region in the TROI or NROI. Nevertheless, not all patches meet the true labels. Each patch's mask image contained 256×256 pixel values, which is 0 or 1. We proposed a weight accumulation method, in which each mask image pixel value is accumulated as a weighted value W.
Definition 3: W of weight accumulation formula is defined as where, W i,j denotes the weighted value of the ith row, jth column patch (i, j). V i,j m,n represents the pixel value at row m and column n in the patch (i, j).
Definition 4: The patch (i, j) is considered as training data following the rule below: If W i,j ≥ (1/2) × 256 × 256, the patch(i, j) is considered as the training data. Contrarily, W i,j < (1/2) × 256 × 256, the patch(i, j) is discarded. The main procedure of extracting patches from the tissue mask ROIs in histopathology images is shown in Fig. 5.

C. Imbalanced Categories and Loss Function
For extremely imbalanced categories, neural network will tend to predict negative samples. The prediction probability of negative samples will be very high, and the gradient of negative samples will be very large. Usually, the tumor regions takes up only a extremely small part of the histopathology image. We just extracted a few positive samples from the tumor region. In our data, the negative patches was far more than the positive patches. The cross-entropy loss function is usually applied in the binary classification. Equation (5) is the standard formula of cross-entropy loss function for binary classification To solve the problem of extremely imbalanced data distribution, we designed our AttLoss function with reference to the outstanding focal loss function [52]. We added a hyperparameter α base on cross-entropy loss function to increase weight from the positive samples. In generally, each sample contributes the same weight loss as others when trained. In order to further improve the train accuracy, we added a hyperparameter γ to pay more attention on the indistinguishable samples and less attention on the distinguishable samples. Equations (6) and (7) are the focal loss function and our AttLoss function, respectively The AttLoss function is as defined as  In (7), p i represents the probability that the ith sample is predicted as positive class. y i is the labels (0 or 1) of ith patch. α and γ are the hyperparameters. We increased weight of the positive samples by setting a bigger alpha(α > 1), and decreased the weight of the positive samples by setting a smaller alpha (0 < α < 1).
This AttLoss function evolves from the basis of the standard cross-entropy loss. α y i decide the loss weight of the positive samples and the negative samples. When y i is the label 1, α y i is equal to α, the positive sample has the α times loss compared to the original loss. When the y i is the label 0, α y i is equal to 1, the negative sample loss remain the original loss.
To decrease the weights of the distinguishable samples, we used a factor 1−p γ i for positive sample and a factor 1−(1−p i ) γ for negative samples, respectively. For positive samples, the bigger p i is, the smaller the factor 1 − p γ i is. For the negative samples, the bigger the 1−p i is, the smaller the factor 1−(1− p i ) γ is. As a result, the more easily the samples are classified, the less weights the samples have, which means more weights on the indistinguishable samples.
Compared with the outstanding focal loss function, we replaced the factor α and 1 − α with α y i , which only increase the weights of positive samples. We also replaced the factor (1 − p i ) γ and p γ i with 1 − p γ i and 1 − (1 − p i ) γ , respectively. In order to more clearly compare the difference between focal loss and AttLoss, we made two tables to investigate the impact of gamma on loss weight.
In the focal loss function, when P = 0.5, the (1−p i ) γ represents the weights of indistinguishable samples; when P = 0.9, the (1 − p i ) γ represents the weights of distinguishable samples. As shown in the Table II, when γ = 1, the weights of indistinguishable samples (P = 0.5) are five times the weights of distinguishable samples (P = 0.9). When γ = 2, the weights (P = 0.5) are 25 times the weights (P = 0.9). When γ = 10, the weights (P = 0.5) are 9.76 million times the weights (P = 0.9). The factor (1 − p i ) γ lower the weights of the distinguishable samples extremely fast, which cause gradual deterioration of the performance of the distinguishable samples.
In our AttLoss function, as can be seen from the Table III, when γ = 1, the weights of indistinguishable samples (P = 0.5) are 5 times the weights of distinguishable samples (P = 0.9). When γ = 2, the weights (P = 0.5) are 3.95 times the weights (P = 0.9). When γ = 10, the weights (P = 0.5) are 1.53 times the weights (P = 0.9). In contrast, the factor 1 − p γ i in the AttLoss decreased the weight of distinguishable samples at a more reasonable rate than (1−p i ) γ . The γ parameters are easier to tune in our experiments. The experiments demonstrated that our AttLoss function outperforms the focal loss function in our experiments.

D. FastNet Details
Deep convolutional neural network is trained for classifying patches to normal patch or tumor patch. However, it cost a large amount of computation to train and identify the tumor regions due to large-scale gigapixel histopathology image. It is demonstrated that increasing depth with 3 × 3 convolution filters has significant improvements on performance [53]. Shortcut connections can solve the degradation problem in deeper network [54]. Multibranch in the inception module can obtain different receptive field and extract multiscale features. It is proved that factoring convolutions and using aggressive regularization is efficient to improve computational efficiency and reduce parameter count when scale up the networks [55]. Depthwise separable convolution [56] is confirmed to be Fig. 6. Standard convolution illustration. Input shape: 7 × 7 × 3, output shape: 5 × 5 × 8, parameters: 216, and FLOPs: 10 800. Fig. 7. Depthwise separable convolution illustration. Input shape: 7 × 7 × 3, output shape: 5 × 5 × 8, parameters: 51, and FLOPs: 2550. smaller and faster in MobileNets [16]. Inspired by the former studies, we proposed a FastNet (fast convolutional neural network) to train a model for identify tumor region rapidly. The FastNet was stacked alternately by multiscale blocks and multibranch blocks. Multiscale block was designed for different scales features extraction, and multibranch block was applied to extract more details from the previous layer. Both multiscale blocks and multibranch blocks added shortcut connections to solve the degradation problem when stacked deeper. ReLU6 was used as the nonlinearity activation function in multiscale blocks and multibranch blocks because of its robustness. Depthwise separable convolutions were applied in multiscale blocks and multibranch blocks to reduce parameters and computation cost. It was demonstrated that expanding the channels before depthwise separable convolutions can decrease the case where the parameter is zero [33]. The depthwise separable convolution, multiscale blocks, multibranch blocks, and FastNet architecture were described in detail in the following section.
1) Depthwise Separable Convolution: Depthwise separable convolution is applied in many well-known neural network [16], [33], [57] to reduce parameters and computation cost. Depthwise separable convolution consists of depthwise convolution and pointwise convolution. Each input channel is extracted feature by a single filter in depthwise convolution. Pointwise convolution uses standard 1 × 1 convolution to concatenate each channel. The depthwise separable convolutions has (1/N) + (1/D 2 K ) times less parameters and computation cost than standard convolution [16]. The depthwise separable convolutions in Fig. 7 has 51 parameters and 2550 FLOPs, which is approximately four times less parameters and FLOPs than the standard convolution of 216 parameters and 10 800 FLOPs in Fig. 6. The standard convolution process is shown in Fig. 6, and the depthwise separable convolution process is shown in Fig. 7.
2) Multiscale Block: The multiscale block architecture is shown in Fig. 8(a). First, we used standard 1 × 1 convolution to expand the input channels from M to M × N. Then, we used one 1 × 1 depthwise separable convolutions, one 3 × 3 depthwise separable convolutions, and one 5 × 5 depthwise separable convolutions (Factorization into two 3 × 3 depthwise separable convolutions [55]) to extract multiscale features. Next we concatenated 1 × 1 branch, 3 × 3 branch and 5 × 5 branch. Finally, we add standard 1 × 1 convolution to reduce the channels and added residuals on the result. The output shape was the same as the input shape. W represents the length and width of the input. M is the input channels, N is the channel expansion factor, and B is the channels in each scale branch. Increasing the N and B, the multiscale block has more robustness and multiscale feature extraction capability but more parameters and computation cost.
3) Multibranch Block: In the multibranch block architecture is shown in Fig. 8(b). First, we used same standard 1 × 1 convolution to expand the input channels from M to M × N. Then, B 3 × 3 depthwise separable convolutions branches are applied to extract more features. Next, we concatenated B depthwise separable convolutions branches and also added residuals on the result. The output shape is also the same as the input shape. W also represents the length and width of the input. M is also the input channels, N is also the channel expansion factor, but B is the branch number used in the multibranch blocks. Increasing the N and B, the multibranch block has more robustness and detailed feature extraction capability but more parameters and computation cost. We adjust the channel expansion factor and branch number B to tradeoff the robustness, efficiency and computation cost.

E. FastNet Architecture
In the previous section, we investigated the effects of different scales and branch on parameters, computation cost and experimental performance. Then, we build the FastNet after comprehensively considering the parameters and computation cost of the multiscale and multibranch blocks. The network architecture details is presented in Table IV. First, One convolutional layer was used to extract the low-dimensional features and reduce the size of the original image with 2 × 2 stride. Second, we stacked convolutional layer, multiscale block and multibranch block alternately three times to downsample the feature maps, extract multiscale feature and more details. The details of convolutional layer, multiscale block and multibranch block is shown in Table IV. In the multiscale block, M represents the input channels, N represents the channel expansion factor, and B represents the channels in each multiscale branch. In the multibranch block, M is also the input channels, N is also the channel expansion factor, but B is the branch number used in each multibranch block. Third, a global average pooling was used to aggregate spatial dimensions features. Finally, the fully connected layer was used to output the binary classification results. The FastNet architecture is shown in Fig. 9.

A. Data Sets and Experimental Environment
We used the CAMELYON16 challenge data sets as our experiment data sets. The data sets consisted of two different subset data sets from the medical centers of Radboud University and Utrecht in The Netherlands. There were altogether 400 histopathology images obtained from breast cancer All experiments were implemented on a workstation with Intel Core i7-8700k CPU, a GPU of Nvidia GTX 3090 and 64 GB of system RAM. Ubuntu 20.04 LTS was installed on the workstation. The project was implemented with deep learning framework Keras. Cross-entropy, focal loss and AttLoss are compared as the loss function, and Adam was used as the optimizer for our experiment models.

B. Evaluation Metrics
We used params, FLOPs, recall, precision, F 1 score, and accuracy to evaluate our model. Params is the trainable parameter of model. FLOPs is floating point operations, which is a measure of algorithm or model complexity and represents the total number of floating-point computations for the CPU and GPU. The recall, precision, F 1 score, and accuracy are defined as (8)-(11), respectively. The recall is a percentage of how many positive samples in the positive samples are correctly predicted. The precision represents the percentage of how many true-positive samples that predicted as positive sample in the all samples. In our experiments, we preferred high recall as well as high precision. We introduced F 1 score which is the harmonic mean of precision and recall. In extremely imbalanced categories classifications, F 1 score is more important than accuracy Precision = TP TP + FP (9) Fig. 9. FastNet architecture. Numbers 2, 4, and 8 are the branch numbers used in each multibranch block.  "ID" MEANS THE NUMBER OF EXPERIMENTS; "FLOPS" IS MEASURED  BY FLOATING POINT OPERATIONS; "TRAINING ACC" IS THE TRAINING  ACCURACY; "VAL ACC" IS THE VALIDATION ACCURACY; AND THE BOLD  FONT INDICATES THE STATE-OF-THE-ART RESULT)   TABLE VI

C. Experimental Setup
We used weight accumulation method to remove the background from the training data. Tissue images were split into 256 × 256 patches in two classes at high-resolution level. The data sets were divided into training data sets and testing data sets randomly. All the experiments are trained on training data sets and evaluated on testing data sets. First, we investigate the effect of different scales and branches on the experimental performance. Considering the parameters and computation cost, we used the suitable multiple scales and different branches based on the experimental results in FastNet. Second, we first evaluated the experiment performance performed on depthwise separable convolution and standard convolution. Additionally, we investigated the impacts of shortcut using in the multiscale block and multibranch block on the network performance. Third, we compared our AttLoss with cross-entropy loss and focal loss on four popular lightweight networks, respectively. Finally, we compared our FastNet with MobileNetV2, ShuffleNet V2, and EfficientNet-B0 in three loss functions, respectively.

D. Performance Comparison of Different Scales and Different Branches
We designed a network with single multiscale block to investigate the effects of different scales on experimental results. In the experiment 1, we used only one 1 × 1 convolution branch in the multiscale block as the baseline. In the experiment 2, 1 × 1 convolution branch, and 3 × 3 convolution branch were concatenated in the multiscale block. 1 × 1 convolution branch, 3 × 3 convolution branch, and 5 × 5 convolution branch were concatenated in the experiment 3. In the experiment 4, we concatenated 1 × 1 convolution branch, 3 × 3 convolution branch, 5 × 5 convolution branch, and 7 × 7 convolution branch in the multiscale block. The  FUNCTION IS CROSS-ENTROPY LOSS FUNCTION;  "FLOPS" IS MEASURED BY FLOATING POINT OPERATIONS; AND THE BOLD FONT INDICATES THE STATE-OF-THE-ART RESULT)   TABLE VIII  PERFORMANCE COMPARISONS  experimental results are shown in Table V, which demonstrate that more multiscale branches in the multiscale block can significantly improve the network performance but more parameters and computation cost. Considering model parameters and complexity, we made a tradeoff between performance and complexity. Finally, we adopted the 1 × 1 convolution branch, 3 × 3 convolution branch and 5 × 5 convolution branch to extract multiscale feature in the multiscale block. Similarly, we also designed a network with single multibranch block to investigate the effects of different branches on experimental results. We conducted five experiments with 1, 2, 4, 8, and 16 branches, respectively. The experimental results are shown in Table VI, which demonstrate that more branches can significantly improve network performance but bring more parameters and computation cost. Consequently, we appropriately adjust the number of branches in FastNet as required.

E. Comparison Among Standard Convolution, Depthwise Separable Convolution and Shortcut
First, we evaluated our FastNet architecture using standard convolution and depthwise separable convolution, respectively. As shown in Table VII, comparing experiments 1 with 3 and experiments 2 with 4, the FastNet using depthwise separable convolution achieve a higher accuracy, precision, F 1 score, and a competitive recall than FastNet using standard convolution with extremely few parameters and small computational complexity. Second, we investigated the impacts of shortcut using in the multiscale block and multibranch block.
Comparing the experiments 1 with 2 and experiments 3 with 4, we found that FastNet with shortcut outperform the FastNet without shortcut in accuracy, recall, precision and F 1 score. Considering efficiency of depthwise separable convolution and excellent performance of shortcut, in the next experiments, the FastNet utilized the depthwise separable convolution instead of standard convolution and included the shortcut by default.

F. Performance Comparison for Attention Loss Function With Cross-Entropy Loss Function and Focal Loss Function
In the tumor detection scene, both recall and precision are extremely important. F 1 score is the harmonic mean of precision and recall, so we considered the F 1 score as the first evaluation metrics. Second, we considered the accuracy as the second evaluation metric. We evaluated the performance of the AttLoss function and compared it with the cross-entropy loss function and focal loss function. The loss function comparison experiments were performed on the MobileNet V2, ShuffleNet V2, EfficientNet-B0, and FastNet, respectively. Experimental results are shown in Table VIII. In the 4th group experiments, we can clearly see that the AttLoss function proposed by us achieved the best accuracy, recall, F 1 score and competitive precision. The 1st, 2nd, and 3rd group experiments are the ablation experiments. In the ablation experiments, our AttLoss function also achieved the best F 1 score and accuracy at a competitive recall and precision in different neural networks.

G. Performance Comparison for FastNet With MobileNet V2, ShuffleNet V2, and EfficientNet-B0
We evaluated and compared the performance of FastNet with MobileNet V2, ShuffleNet V2, and EfficientNet-B0 using  "ID" MEANS ID-TH GROUP EXPERIMENTS; "FLOPS" IS MEASURED BY FLOATING POINT OPERATIONS;  AND THE BOLD FONT INDICATES THE STATE-OF-THE-ART RESULT) different loss function. The experimental results are shown in Table IX. In the 1st group experiments, using crossentropy loss function, our FastNet achieved the best accuracy, precision, F 1 score, and suboptimal recall with the least trainable parameters and smallest computational complexity. In the 2nd group experiments, using focal loss function, our FastNet achieved the best accuracy, precision, F 1 score, and competitive recall with the least trainable parameters and smallest computational complexity. In the 3rd group experiments, using AttLoss function proposed by us, our FastNet achieved the best accuracy, recall, F 1 score, and suboptimal precision with the least trainable parameters and smallest computational complexity. Compared with the 1st group and the 2nd group experiments, the FastNet, with AttLoss function proposed by us in the 3rd group experiments, achieved the best accuracy, recall, and F 1 score with the least trainable parameters and smallest computational complexity. Therefore, the experimental results demonstrated that the FastNet proposed by us outperforms MobileNet V2, ShuffleNet V2, and EfficientNet-B0 in our medical images classification. The least trainable parameters and smallest computational complexity make our FastNet higher efficiency in portable machines.

V. CONCLUSION
In this article, a TFI framework was presented to detect tumor regions fast. In the TFI framework, a weight accumulation method for histopathology images was designed to greatly reduce the redundant computation cost. We removed the background area in the RGB color model and the HSV color model, respectively. We combined tissue masks from different color model, which increases the accuracy of tissue mask. A novel AttLoss function was proposed to increase the weights of positive samples and the indistinguishable samples. In the AttLoss function, we added a hyperparameter α based on cross-entropy loss function to decrease weight from the negative samples and added a hyperparameter γ to pay more attention on the indistinguishable samples and less attention on the distinguishable samples. A lightweight FastNet was proposed for patch classification in the TFI framework. Compared with the MobileNetV2, ShuffleNet V2, and EfficientNet-B0, the FastNet proposed by us achieved the highest recall, accuracy, and F 1 score with the least trainable parameters and smallest computational cost. While there are some drawbacks. It is not precise enough to regard the whole patch as tumor or normal. At the edge of tumor regions, the patch contains both tumor pixels and normal tissue pixels. The future work will focus on pixel-level tumor segmentation, which will achieve more precise tumor boundary.