A Regional-Attentive Multi-Task Learning Framework for Breast Ultrasound Image Segmentation and Classification

Breast ultrasound (BUS) imaging is commonly used in the early detection of breast cancer as a portable, valuable, and widely available diagnosis tool. Automated BUS image classification and segmentation can assist radiologists in making accurate and fast decisions. Recent studies illustrate that tumor, peritumoral, and background regions of BUS images provide valuable information for BUS image segmentation or classification. However, few studies have investigated the influence of these three regions on multi-task learning. In this study, we propose an RMTL-Net to simultaneously segment tumor regions and classify tumors in BUS images into benign or malignant categories. To improve both segmentation and classification performance, we design a regional attention (RA) module that employs the predicted probability maps to automatically guide the classifier to learn important category-sensitive information in the tumor, peritumoral, and background regions and seamlessly fuse them to obtain a better feature representation. We conduct detailed ablation experiments of the proposed RA module and comparative experiments with four recent state-of-the-art peer multi-task learning methods, three single-task segmentation methods, and four single-task classification methods on two public BUS datasets. Experimental results show that the proposed RMTL-Net achieves the best overall segmentation and classification accuracy in terms of five segmentation metrics and six classification metrics.


I. INTRODUCTION
Breast cancer is a significant threat to women's health and is the most commonly diagnosed cancer and the leading cause of cancer mortality among women worldwide in 2020 [1]. Mortality rates are much higher in low-and middle-income countries than in high-income countries due to the delayed detection and treatment [2], [3]. Mammography and breast ultrasound (BUS) are two popular screening modalities for early breast cancer detection, which leads to appropriate treatment and increased survival rates. BUS has been commonly used in the early diagnosis of breast cancer in women of The associate editor coordinating the review of this manuscript and approving it for publication was Ravibabu Mulaveesala . all ages, especially in low-and middle-income countries, because it is portable, widely available, low-cost, and highly sensitive [4], [5]. Computer-aided-diagnosis (CAD) systems are proposed to help radiologists interpret BUS images, make a more accurate diagnosis, and reduce their workload [6], [7]. In general, a CAD system for breast cancer detection includes automated segmentation and classification as two primary steps for further processing. Automated analysis of BUS images can help radiologists make efficient diagnoses of breast cancer. However, it is still challenging due to the lack of public training data and the high variability of tumors in shape, size, and location [8], [9].
BUS image segmentation methods can be classified into semi-automated [10], [11], [12] and fully automated methods [13], [14] based on human intervention. Fully automated BUS image segmentation is the trend in future BUS CAD systems since it is reproducible and suitable for large-scale tasks [15]. Fully automated deep learning-based methods, especially U-Net [16] based methods [9], [17], [18], have recently gained increased popularity. For example, Wang et al. [9] propose a fusion deep learning network to address issues of unclear boundaries and large variations in tumors in BUS images. It uses an encoder to capture the context information, a decoder to localize prediction, and a fusion to combine information from the encoder and the decoder. Amiri et al. [17] propose a two-stage U-Net architecture: one for tumor detection and one for tumor segmentation. They also prove that detection and its evaluation in the first stage improve segmentation results in the second stage.
Convolutional neural networks (CNNs) have recently achieved superior performance compared to traditional machine learning classification methods such as support vector machine [19], K-nearest neighbors [20], random forest [21], and Gaussian mixture models [22]. Among them, VGG [23], ResNet [24], and their variants are widely used for BUS image classification. Liao et al. [25] adopt a supervised block-based segmentation algorithm to separate tumor regions from BUS images and then use VGG-19 to classify segmented tumor regions as benign or malignant. Cui et al. [26] propose to use ResNet-34 as the backbone feature extractor and design a fused network to combine features of tumor, peritumoral, and combined-tumoral (combination of tumor and peritumoral) regions to achieve better classification results.
Multi-task learning (MTL) for simultaneous BUS segmentation and classification has recently been extensively studied in the computer vision community. Benign and malignant breast tumors have different characteristics [27], [28]. For example, benign tumors tend to be smooth, round, and well circumscribed whereas malignant tumors are typically rough and spiculated. In addition, malignant tumors tend to have spiculated margins and posterior acoustic shadows. Based on these observations, many MTL [29], [30], [31], [32] studies are proposed to join BUS image segmentation and classification tasks in one network to encourage feature sharing during training to improve both tasks. These MTL methods are mostly based on a U-Net structure (i.e., an encoder-decoder network for segmentation) and some of them [30], [32] include attention mechanisms to achieve better classification performance. For example, Zhou et al. [29] propose an MTL framework with a light-weight multi-scale network to iteratively refine features to highlight tumor regions for better 3D BUS image classification. Chowdary et al. [31] propose an MTL framework with a dense branch to combine multi-scale features from different layers of the network for efficient classification of BUS images. Zhang et al. [30] propose an MTL framework with soft and hard attention mechanisms to guide the model to pay more attention to tumor regions to boost classification accuracy. Xu et al. [32] propose an MTL framework with a context-oriented self-attention (COSA) module to incorporate prior medical knowledge to guide the model to learn contextual relationships for better segmentation and classification performance.
Recently, several studies have demonstrated that tumor, peritumoral (the tumor-adjacent area surrounding the tumor), and background regions in BUS images help to improve the diagnosis accuracy of breast cancer in CAD methods [26], [33], [34], [35]. Lee et al. [34] use the mask R-CNN to extract tumor regions from BUS images and obtain peritumoral regions via a dilation operation. They then use a deep learning model to train tumor, peritumoral, and their combined-tumoral regions to predict axillary lymph node (ALN) metastasis status, which is important in guiding treatment in breast cancer. Sun et al. [33] build two models based on tumor, peritumoral, and combined-tumoral regions and compare their performance to show that peritumoral and combined-tumoral regions achieve significantly better performance in predicting ALN metastasis in BUS images for both models.
Tumor, peritumoral, and background regions of a BUS image have been further studied to provide important category-sensitive information to improve the aforementioned methods to achieve better segmentation or classification results. Specifically, the peritumoral region in BUS images was discussed in the BUS image classification task [26] and the ALN metastasis prediction task [33], [34] to further improve their accuracy. Cui et al. [26] use an encoder-decoder structure to obtain three tumoral regions 5378 VOLUME 11, 2023 FIGURE 2. Illustration of two examples of BUS images, their ground truth and pseudo ground truth regions, and three probability maps generated by the proposed RMTL-Net. First column: Original BUS images with a benign tumor shown at the top row and a malignant tumor shown at the bottom row. Second column: Pseudo ground truth regions produced by the proposed pre-processing method, where the peritumoral region is shown in green and the background region is shown in black. The ground truth tumor region is shown in red. Third column: Three regions containing category-sensitive information overlaid on the original image, where the tumor region is within the red line, the peritumoral region is between green and red lines, and the background region is outside the green line. Fourth column: Probability map of the tumor region. Fifth column: Probability map of the peritumoral region. Sixth column: Probability map of the background region.
at different resolutions to extract tumor features (e.g., component, internal echo, and aspect ratio), peritumoral features (e.g., tumor boundary patterns), and background features (e.g., contextual relationship between the tumor and surrounding tissues). These features lead to higher computational costs but better classification results. Despite the success of the utilization of three tumoral regions, they have hardly been employed in simultaneous BUS image segmentation and classification. To the best of our knowledge, the research work of Xu et al. [32] is the pioneer in this direction. They employ three tumoral regions in a BUS image to improve the MTL performance. However, their extracted peritumoral region is small, which may not provide sufficient information for simultaneous BUS image segmentation and classification.
In this paper, we propose a regional attention (RA) module to learn corresponding category-sensitive features from three regions (e.g., tumor, peritumoral, and background regions) in BUS images and investigate their influence on MTL. We also apply the proposed RA module to a two-stage MTL framework to demonstrate its efficacy in BUS image segmentation and classification. The proposed regional-attentive multi-task learning framework (RMTL-Net) consists of an encoder-decoder network for segmentation and a light-weight network for classification. Both segmentation and classification share features extracted from the encoder. In addition, the RA module utilizes the predicted probability maps to guide the classification network to learn weighted region attentive features for more accurate classification. The overall framework of the proposed RMTL-Net is illustrated in Fig. 1. We conduct extensive experiments on two public BUS datasets that include 810 BUS images in total to evaluate the performance of RMTL-Net and its variants and compare RMTL-Net with several state-of-the-art singletask and multi-task methods. Experimental results show that RMTL-Net boosts the performance of both segmentation and classification tasks. Our main contributions are summarized as follows: • We design a novel MTL framework, named RMTL-Net, for simultaneous tumor segmentation and classification in BUS images. The proposed RMTL-Net outperforms recent state-of-the-art segmentation and classification methods on two public BUS datasets.
• We propose a RA module to improve both segmentation and classification performance. It employs the predicted probability maps to automatically guide the classifier to learn important category-sensitive information in the tumor, peritumoral, and background regions.
• We conduct extensive experiments on two public BUS datasets. Experimental results prove its MTL efficacy in BUS image segmentation and classification and the importance of tumor, peritumoral, and background regions of BUS images.

II. MATERIALS AND METHODS
In this section, we first present the materials in terms of two datasets and the proposed pre-processing method to prepare the training images and their pseudo ground truth images. We then describe the proposed method in terms of its network architecture and the regional attention (RA) module.

A. MATERIALS 1) DATASETS
Two public BUS datasets used in this study are UDIAT [36] and BUSI [37]. Dataset   We use 647 images with benign or malignant tumors in this dataset for binary classification in this study. Ground truth is labeled by radiologists from Baheya.

2) PRE-PROCESSING
In the proposed method, all images are resized to 256 × 256 by bilinear interpolation before being fed into RMTL-Net. Data augmentation techniques are carried out to augment images during the training process using four transformations: (i) rotation of an angle between -5 and 5 degrees at the image center, (ii) random flipping horizontally, vertically, or both, (iii) Gaussian blur, and (iv) Median blur. We perform these four transformations in the above order on each input BUS image to augment the training images during the training procedure. Given a ground truth BUS image that contains the tumor contour, we generate two pseudo ground truth regions: peritumoral and background regions. First, we employ a Laplace edge detector on the ground truth image to find the contour of the tumor region. Second, we dilate the tumor region by 32 pixels and subtract the tumor region from the dilated result to obtain the peritumoral region. We choose 32 pixels in dilation to ensure the peritumoral region remains at the lowest resolution when a series of down-sampling operations take place in RMTL-Net. Third, we treat the remaining region as the background region. The first three columns in Fig. 2 present BUS example images, their ground truth tumor region labeled by radiologists and their pseudo ground truth peritumoral and background regions produced by the proposed pre-processing method, and three regions as shown on the original images. An image containing the ground truth tumor region, the pseudo ground truth peritumoral region, and the pseudo ground truth background region is further used during the training process to learn the boundaries delineating tumor, peritumoral, and background.

B. METHODS
The proposed RMTL-Net improves its peer MTL-COSA [32] from the following five aspects: • Unlike MTL-COSA that generates a binary segmentation result, RMTL-Net generates a binary segmentation result and three probability maps for tumor, peritumoral, and background regions, respectively.
• Unlike MTL-COSA that uses the contour of segmented tumors to find binary segmentation masks for tumor, peritumoral, and background regions, RMTL-Net uses probability maps generated from the network to estimate tumor, peritumoral, and background regions in BUS images and feed them as estimated prior medical knowledge into the RA module to guide the classification task.  • Unlike MTL-COSA that extracts the peritumoral region by dilating the segmented tumor boundary, RMTL-Net is trained to generate respective probability maps for tumor, peritumoral, and background regions to gather more detailed categorical information than the binary masks extracted by MTL-COSA.
• Unlike MTL-COSA whose peritumoral region has a ring area of width of 5 pixels evenly covering the background and tumor areas, RMTL-Net extracts a bigger peritumoral region with a ring-like area of width of 32 pixels outside of the tumor to provide sufficient information at the lowest resolution to facilitate classification.
• Unlike MTL-COSA that uses self-attention to learn important classification features, RMTL-Net replaces it with the RA module to significantly reduce network parameters by 14.40% and reduce both training and testing times yet achieve better overall segmentation and classification performance.

1) NETWORK ARCHITECTURE
The detailed network architecture of the proposed RMTL-Net is illustrated in Fig. 3. RMTL-Net is a two-stage framework that consists of a segmentation stage and a classification stage. The segmentation stage utilizes a U-shape architecture consisting of an encoder, a decoder, and skip connections to extract multi-scale features and predict three respective probability maps for tumor, peritumoral, and background regions, as shown in the last three columns in Fig. 2. The classification stage uses shared features extracted from the encoder and three probability maps generated from the segmentation stage to produce classification results. Specifically, we use the peritumoral region to capture boundary characteristics, which are useful to differentiate benign and malignant tumors. We use the tumor region to capture the shape properties of tumors, which are useful for both tumor segmentation and classification. We use the background region to capture posterior acoustic shadowing, which is observed more for malignant lesions and less for benign tumors due to attenuation of the sonographic signal [27], [38]. Sharing features makes segmentation and classification promote each other during the training process. In addition, It addresses the problem of having insufficient training images for classification. Each pixel is a training sample in segmentation. Sharing features with the segmentation stage with sufficient training samples improves the overall accuracy and robustness of the classification stage.
We use ResNet-101 [24] as the backbone of the segmentation stage of RMTL-Net due to its great performance in BUS image segmentation and classification [18], [26]. The architecture of ResNet-101 remains the same. Specifically, the encoder utilizes one convolutional layer Conv1 together with four residual blocks (Conv2_x to Conv5_x) to perform five down-sampling operations to extract multi-scale features from input images. Multi-scale features extracted by Conv1 to Conv5_x are of sizes 128 × 128 × 64, 64 × 64 × 256, 32 × 32 × 512, 16 × 16 × 1024, and 8 × 8 × 2048, respectively. The decoder symmetrically utilizes four deconvolutional blocks (Deconv4 to Deconv1) and one convolutional layer (Conv2) followed by bilinear interpolation and softmax operations to perform up-sampling operations. Skip connections between the encoder and decoder combine feature maps in different scales to compensate for the loss of spatial information during down-sampling operations and to refine segmentation outcomes. As a result, multi-scale features are restored to the original input size and are further interpreted to predict three probability maps.
We use three probability maps generated from the segmentation stage of RMTL-Net and multi-scale high-level features shared by both segmentation and classification stages to produce classification results.

2) REGIONAL ATTENTION MODULE
Unlike classical image classification networks (e.g., VGG [23] and ResNet [24]), we add a regional attention (RA) model to further encourage information sharing. This RA model outputs a weighted feature vector of size 1 × 2048 that is passed to a fully connected layer to generate more accurate classification results.
We observe benign and malignant tumors exhibit different characteristics. For example, benign tumors tend to be smooth and round and malignant tumors are always rough with an aspect ratio of greater than 1 [27], [28]. Benign VOLUME 11, 2023 tumors tend to have smooth, thin, and regular margins and malignant tumors tend to have spiculated, thick, and irregular margins. Benign tumors tend to have less posterior acoustic shadowing in the background region than malignant lesions. As a result, we propose to utilize tumor, peritumoral, and background regions to learn their inherently important characteristics including tumor features (e.g., component, internal echo, and aspect ratio), tumor boundary patterns (e.g., smoothness, shape, and contextual texture between tumor and surrounding tissues), and background features (posterior acoustic shadowing) [27], [35] to help with the joint segmentation and classification tasks. In addition, we propose to include a RA module in the classification stage of the RMTL-Net to encourage information sharing and output a weighted feature vector to facilitate classification. This RA module combines multi-scale high-level features with three probability maps generated from the segmentation stage to guide the learning of category-sensitive features from three regions, namely, tumor, peritumoral, and background regions. Category-sensitive features are represented as a weighted feature vector, which is passed to a fully connected layer to generate more accurate classification results. Fig. 4 shows six examples of BUS images from each of the two datasets that contain benign and malignant tumors, respectively. Tumor regions with high variability in shape, size, and location are delineated by red lines. When using these images as training images, we generate their pseudo ground truth peritumoral and background regions using the pre-processing method explained in Section II-A2. When using these images as testing images, RMTL-Net predicts their probability maps as shown in Fig. 2.
The structure diagram of the proposed RA module is shown in Fig. 5. The algorithmic view of the RA module is summarized below: Input: C 5 (the feature map of size 8 × 8 × 2048 extracted by Conv5_x of the encoder) and P (the probability map of size 256 × 256 × 3 generated by the last convolutional layer Conv2 of the decoder).
Output: A weighted feature map F W of size 1 × 2048. 1) Split P into three probability maps P T , P P , and P B of size 256 × 256, where subscripts T , P, and B represent tumor, peritumoral, and background, respectively. 2) Employ the nearest neighbor method to resize P T , P P , and P B to obtain coarse probability maps P ′ T , P ′ P , and P ′ B of size 8 × 8. 3) Utilize a threshold of 0.5 to filter coarse probability maps P ′ T , P ′ P , and P ′ B to obtain three noise-free probability maps P ′′ T , P ′′ P , and P ′′ B , respectively. Specifically, values greater than 0.5 in coarse probability maps are kept intact, and values smaller than or equal to 0.5 are set to 0: where subscript x can be replaced with T , P, or B. 4) Individually and elementwisely multiply P ′′ x with each channel of C 5 to generate multi-channel weighted regional feature maps C x .
5) Apply the global average pooling (GAP) on C x to capture weights of each region in its corresponding G x of size 1 × 2048: 6) Concatenate G T , G P , and G B to construct a new feature vector F of size 3 × 2048: 7) Apply a 1 × 1 convolution filter to F to generate a weighted feature map F W of size 1 × 2048.
It should be noted that all non-zero pixels in P ′′ T , P ′′ P , and P ′′ B have high likelihood values larger than 0.5, which indicate high strength of tumor, peritumoral, and background features, respectively. We choose 0.5 as the threshold because it classifies a pixel into one of the three classes. The multiplication of C 5 and P ′′ T , P ′′ P , and P ′′ B leads to multi-channel weighted tumor, peritumoral, and background features C T , C P , and C B . The GAP operation further finds the features in each channel of C T , C P , and C B to best represent three respective regions. The concatenation operation followed by the 1×1 convolution constructs a weighted sum of multi-view features from three parallel channels (i.e., G T , G P , and G B ), which can be formulated as: where w 1 , w 2 , and w 3 indicate the importance of tumor, peritumoral, and background regions, respectively. These weights are automatically learned during the training process. Finally, F W is passed to a fully connected layer followed by a softmax activation function for automated tumor classification. F W captures the importance of each region for better feature representation and therefore leads to better classification results than using a non-weighted feature map (i.e., convolving C 5 with a feature vector of 1×2048). In summary, the proposed RA module follows the perspectives of radiologists to learn multi-view features from three regions in BUS images to achieve better segmentation and classification performance. Specifically, the tumor region helps to extract the basic features of breast tumors. The peritumoral region helps to capture tumor boundary patterns. The background region helps to collect contextual information.

3) LOSS FUNCTION
The overall loss of RMTL-Net is computed by the weighted sum of the loss of the segmentation task L seg and the loss of the classification task L cls .
where λ and 1 − λ are contribution weights of losses from segmentation and classification tasks, respectively. Cross entropy is employed to compute both L seg and L cls . Let K denote the number of classes in a given task, N denote the number of images, and P denote the number of pixels in an image. In the segmentation task, there are 3 classes representing tumor, peritumoral, and background regions. In other words, K = 3. The pixel-wise cross entropy L seg of the segmentation task is computed as follows: where y p,k andŷ p,k represent the true and predicted probability of pixel p belonging to class k, respectively. The true probability y p,k is either 0 or 1 since each pixel belongs to one of the three classes. The predicted probabilityŷ p,k is in the range of [0, 1].
In the classification task, there are 2 classes representing benign and malignant tumors. In other words, K = 2. The image-wise cross-entropy L cls of the classification task is computed as follows: where y n,k andŷ n,k represent the true and predicted category of image n belonging to class k, respectively. Both y n,k and y n,k are either 0 or 1.

III. EXPERIMENTAL SETUP AND RESULTS
In this section, we first present the implementation details. We then describe the performance evaluation metrics followed by the competing methods. Finally, we present the experimental results of the proposed RMTL-Net method, its ablation study, and its comparison with the competing methods.

A. IMPLEMENTATION DETAILS
The implementation of the proposed method is based on the public platform PyTorch 1.4. All experiments are conducted on Ubuntu 18.04 system, Intel(R) Core(TM) CPU i5-11600K 3.9. All models are trained and tested on a GeForce RTX 3080 Ti graphics card with 12GB memory using the ADAM optimizer with momentum β 1 of 0.9, momentum β 2 of 0.99, a weight decay of 0.0001, and a learning rate initialized at 0.0001 and decayed at 10% after every 20 epochs.
In the training procedure, the batch size is set as 16 and the number of training epochs is set as 100. Following the empirically optimal setup [24], we adopt batch normalization right after each convolution and before activation. To reduce overfitting, we adopt dropout with a probability of 0.5 in the fully connected layer of the classification network. The contribution weight of loss from the segmentation task (i.e., λ) is empirically set to be 0.9. All competing methods, including ResNet, UResNet, MTL-Net, MTL-COSA, and RMTL-Net models, are pre-trained on ImageNet and fine-tuned with training images selected from datasets UDIAT and BUSI. To evaluate the performance of different methods, we conduct five-fold cross-validation in all experiments, including multi-task learning, ablation, and comparative studies. Because the size of dataset UDIAT is small, there is 3% classification performance differences between multiple runs even if we use five-fold cross-validation to train and test on it. To increase the credibility of experimental results, we train all competing methods on two datasets together and test on two datasets separately. Specifically, for each dataset, we split the data into five groups, where each group keeps the same proportion of benign and malignant cases as in the original dataset. In each fold experiment, four groups of each dataset are combined and used as the training set, and the other group is used as the testing set. In this study, all experimental results are reported by averaging the five-fold cross-validation performance.

B. PERFORMANCE EVALUATION
We employ commonly-used BUS segmentation metrics [9], [15], [17], [18], [30], [31] including sensitivity (SEN), specificity (SPE), accuracy (ACC), dice similarity coefficient (DSC), and intersection over the union of tumor (tumor IoU) to quantitatively evaluate the segmentation performance. Higher values of these metrics represent better segmentation performance. Specifically, SEN and SPE measure the ability of a model to correctly identify all tumor pixels and background pixels in BUS images, respectively; ACC reports the percent of correctly segmented tumor pixels in BUS images; both DSC and tumor IoU are positively correlated and measure the spatial overlap between the predicted segmentation result and ground truth. However, DSC tends to measure the average-case performance and tumor IoU tends to measure the worst-case performance. These metrics are calculated as follows: where TP represents true positives (i.e., the number of true tumor pixels that are correctly predicted to be tumor pixels), FP represents false positives (i.e., the number of true background pixels that are wrongly predicted to be tumor pixels), FN represents false negatives (i.e., the number of true tumor pixels that are wrongly predicted to be background pixels), and TN represents true negatives (i.e., the number of true background pixels that are correctly predicted to be background pixels). Since only two kinds of pixels (tumor and background) are involved in evaluating the segmentation performance, we consider all the pixels in the predicted background and peritumoral regions as background pixels and all the pixels in the predicted tumor region as tumor pixels.
We employ commonly-used BUS classification metrics [19], [22], [26], [30], [31] including SEN, SPE, ACC, precision (PRE), F1-score (F1), and area under receiver operating characteristic curve (AUC) to quantitatively evaluate the classification performance. Higher values of these metrics represent better classification performance. Specifically, SEN, SPE, and ACC are computed in the same manner as the segmentation metrics of the same names. However, TP, TN , FP, and FN are defined differently when evaluating classification. TP and TN respectively represent the number of BUS images that are correctly predicted as benign images (i.e., a positive class) and malignant images (i.e., a negative class). FP and FN respectively represent the number of BUS images that are incorrectly predicted as benign and malignant images. F1-score is the same as DSC. AUC is a summary of the receiver operating characteristic (ROC) curve, which shows the performance of a model at all classification thresholds. A higher AUC value represents better classification performance. PRE computes the ratio of correctly predicted positive samples to the total predicted positive samples. It is computed as follows: C. COMPETING METHODS Table 1 briefly summarizes the task nature and enhanced features of the proposed RMTL-Net and 11 state-of-theart (SOTA) methods. Specifically, we compare RMTL-Net with three recent single-task classification methods (e.g., VGG-16 [23], ResNet-101 [24], and DenseNet [39]), four recent single-task segmentation methods (e.g., FCN [40], PSPNet [41], Deeplab v3+ [42], and U-ResNet), and four recent MTL methods (e.g., MTL-Net, MTL-COSA [32], SHA-MTL [30], and Residual U-Net [31]). U-ResNet is a U-Net [16] with ResNet-101 as its backbone. MTL-Net passes features extracted by Conv5_x of U-ResNet into a GAP layer followed by a fully connected layer for classification. Table 1 shows that some of these compared methods employ feature enhancement strategies such as attention mechanism and skip connections to improve segmentation and classification performance.

1) MULTI-TASK LEARNING
All compared multi-task learning (MTL) methods including MTL-Net, MTL-COSA [32], SHA-MTL [30], Residual U-Net [31], and the proposed RMTL-Net compute their total loss as the weighted sum of both segmentation and classification losses. In other words, they use the hyperparameter λ in (7) to balance segmentation and classification performance during MTL. In this section, we evaluate the segmentation and classification performance of RMTL-Net under different λ values. We anticipate observing similar trends for the other compared multi-task methods since MTL-Net, MTL-COSA, and RMTL-Net use U-ResNet and others use a similar network as their backbones. Fig. 6 compares the segmentation results of RMTL-Net under five λ values (e.g., 0.1, 0.3, 0.5, 0.7, and 0.9) on two datasets. We calculate all five segmentation metrics to evaluate the segmentation results on two datasets under five λ values. It is interesting to observe that SPE and ACC segmentation metrics yield similar values when using different λ values. Specifically, SPE oscillates between a range of 98.97% and 99.25% on dataset UDIAT and between a range of 97.75% and 98.02% on dataset BUSI. Similarly, ACC oscillates between a range of 98.20% and 98.79% on dataset UDIA and between a range of 94.96% and 96.28% on dataset BUSI. As a result, we remove SPE and ACC results in Fig. 6 to show values of segmentation metrics SEN, DSC, and IoU, where the narrow bar near the top of each bar indicates the standard deviation and the values above two selected narrow bars present the largest and smallest metric values obtained under five λ values in five-fold experiments. It demonstrates that SEN, DSC, and IoU values increase on both datasets when λ increases, except for λ = 0.7 on dataset UDIAT. Fig. 7 compares the classification results of RMTL-Net under five λ values (e.g., 0.1, 0.3, 0.5, 0.7, and 0.9) on two datasets. We calculate all six classification metrics to evaluate the classification results on two datasets under five λ values. We re-scale AUC to the range of [0, 100] to ensure all classification values are in the same range for easy display and better understanding. Similar to Fig. 6, we use a narrow bar to indicate the standard deviation for each metric and present the largest and smallest metric values obtained under five λ values in five-fold experiments. It is clear that the overall classification performance of RMTL-Net tends to increase on both datasets when λ increases, except for the SEN values on both datasets.
RMTL-Net uses predicted probability maps to guide the classification task to learn better feature representations and achieve better classification results. As a result, accurate segmentation may lead to a better classifier. Fig. 6 and Fig. 7 confirm that both segmentation and classification accuracy tends to improve hand in hand when λ increases. Therefore, we set λ = 0.9 for RMTL-Net to ensure that more weights are given on the dominating task in the MTL framework. We also use the same setting for all MTL methods to ensure a fair comparison.

2) ABLATION STUDY OF RA MODULE
The regional attention (RA) module is a crucial component of RMTL-Net. It utilizes predicted probability maps to guide the classification network to learn multi-view features from tumor, peritumoral, and background regions in BUS images. To validate the effectiveness of the proposed RA module, we conduct a detailed ablation study by combining information from different region combinations. We list all variants of RMTL-Net below: For Variant 1, the feature map extracted by Conv5_x of the encoder is directly passed to a GAP layer followed by a fully connected layer for classification. For Variants 2, 3, and 4, the weighted regional feature maps C P , C T , and C B are respectively passed to a GAP layer to obtain a new feature vector G P , G T , and G B of size 1 × 2048, which are then respectively passed to a fully connected layer for classification. For variants 5, 6, and 7, multi-channel weighted regional feature maps C T and C P , C P and C B , and C T and C B are respectively passed to a GAP layer and concatenated to obtain a new feature vector F of 2 × 2048. Their corresponding F is then filtered by a 1 × 1 convolution to get their associated weighted feature vector F w of 1 × 2048. Lastly, their corresponding F w is passed to a fully connected layer for classification. Tables 2 and 3 present the segmentation results of eight systems in the ablation study in terms of SEN, SPE, DSC, ACC, and Tumor IoU on datasets UDIAT and BUSI, respectively. Tables 4 and 5 present the classification results of eight systems in the ablation study in terms of SEN, SPE, PRE, ACC, F 1 , and AUC on datasets UDIAT and BUSI, respectively. We observe the following from the results shown in these four tables: (1) Variant 1, which does not incorporate RA, achieves the worst overall segmentation performance when compared with the other seven variant systems. It achieves comparable overall classification performance as VOLUME 11, 2023      For most BUS images, we observe that the background region has the biggest size and the peritumoral region has the smallest size. As a result, we assume that the larger the region, the more information it can provide for both segmentation and classification tasks. The experimental results shown in Tables 2, 3, 4, and 5 seem to support this assumption. First, either background, tumor, or peritumoral region plays an important role in the segmentation task since variants 2, 3, and 4 outperform variant 1 without using the RA module in all segmentation metrics. Second, the background region of C 5 provides the most valuable information for both segmentation and classification tasks since variant 4 achieves the best performance among variants involving one region in the RA module. The tumor region of C 5 provides the second most valuable information followed by the peritumoral region.
Third, variants involving two regions in the RA module outperform variants involving one region in the RA module since a combined larger region provides more information to facilitate the learning process. Fourth, the variant involving three regions in the RA module achieves the best performance. Fifth, the weighted feature vector F W , which obtains valuable information from multiple regions, better represents BUS images than C 5 without using the RA module.

3) COMPARISON WITH COMPETING METHODS
We implement all compared methods except for SHA-MTL and Residual U-Net and conduct experiments using the same parameters to ensure a fair comparison. The authors of SHA-MTL and Residual U-Net did not provide sufficient VOLUME 11, 2023     details on their methods and did not publish their code either. As a result, we directly use their reported segmentation and classification results on dataset BUSI in our comparison. We use the symbol of ''-'' to represent a missing result since they did not report their results on each metric. Both methods did not provide any results on dataset UDIAT. So they are not included when comparing segmentation and classification results on dataset UDIAT. Table 6 summarizes the segmentation results of RMTL-Net and six methods in terms of five metrics on the dataset UDIAT. Among four single-task segmentation methods, Deeplabv3+ achieves the best overall segmentation performance with the highest values of SEN, DSC, ACC, and tumor IoU. PSPNet achieves the second-best overall segmentation performance, followed by UResNet and FCN. Among three MTL methods, the proposed RMTL-Net achieves the best segmentation performance in all metrics except for SPE. It improves the second-best method MTL-COSA by 2.54%, 1.62%, 0.02%, and 1.79% for SEN, DSC, ACC, and tumor IoU, respectively. Table 7 summarizes the segmentation results of RMTL-Net and eight methods in terms of five metrics on the dataset BUSI. Single-task segmentation methods exhibit similar performance trends on dataset BUSI as on dataset UDIAT. The three MTL methods including MTL-Net, MTL-COSA, and RMTL-Net exhibit similar performance trends on dataset BUSI as on dataset UDIAT. The proposed RMTL-Net achieves the best overall segmentation performance and improves the second-best method MTL-COSA by 3.23%, 1.14%, 0.06%, and 1.28% for SEN, DSC, ACC, and tumor IoU, respectively. Two MTL methods residual-U-Net and SHA-MTL seem to lack credibility since residual-U-Net did not report its standard deviation values for five runs on all evaluation metrics and SHA-MTL reported different values for two equivalent metrics DSC and F 1 without giving any explanation. In addition, residual-U-Net seems to have an overfitting issue since its AUC values of five runs are 0.98, 1, 0.99, 0.97, and 1. As a result, we do not include these two methods here for comparison and list their results in tables for completeness. Table 8 summarizes the classification results of RMTL-Net and five methods in terms of six metrics on the dataset UDIAT. Among three single-task classification methods, ResNet achieves the best overall classification performance with the highest values of SEN, ACC, F 1 , and AUC. DenseNet achieves the second-best overall classification performance, followed by VGG-16. Among three MTL methods, the proposed RMTL-Net achieves the best classification performance in all metrics. It improves the second-best method MTL-COSA by 3.68%, 5.82%, 2.83%, 4.36%, 3.34%, and 1.02% for SEN, SPE, PRE, ACC, F 1 , and AUC, respectively. Table 9 summarizes the classification results of RMTL-Net and seven methods in terms of six metrics on the dataset BUSI. Single-task classification methods exhibit similar performance trends on dataset BUSI as on dataset UDIAT. The three MTL methods including MTL-Net, MTL-COSA, and RMTL-Net exhibit similar performance trends on dataset BUSI as on dataset UDIAT. The proposed RMTL-Net achieves the second-best overall classification performance and MTL-COSA outperforms RMTL-Net by a little bit in all metrics. Due to the lack of credibility, residual U-Net and SHA-MTL are not included here for comparison and are listed in tables for completeness. Tables 6, 7, 8, and 9 demonstrate that RMTL-Net achieves the best overall segmentation and classification results on both datasets. It incorporates the RA module to improve MTL-COSA by learning the importance of three predicted probability maps representing tumor, peritumoral, and background regions. MTL-COSA incorporates self-attention to improve MTL-Net by learning the importance of three regions constructed from the predicted binary segmentation mask. MTL-Net decreases the values of three segmentation metrics including SEN, DSC, and tumor IoU (i.e., decreasing the segmentation performance) when compared with the best single-task segmentation method UResNet. This decrease in performance is caused by reduced segmentation weight, which was added to the classification task. Therefore, less weight is employed in training to reduce segmentation errors. However, incorporating attention to MTL-Net addresses this issue to achieve comparable or better segmentation results than UResNet and achieve comparable or better classification results than ResNet. Table 10 lists the number of trainable parameters of all compared methods. It shows that MTL-Net increases trainable parameters of UResNet by 0.004% via adding a light-weight classification task. This simple addition utilizes segmentation results to guide the classification task, which leads to comparable segmentation results as single-task segmentation methods and better classification results than single-task classification methods. Table 10 also shows that both MTL-COSA and RMTL-Net increase the different amounts of trainable parameters in networks such as ResNet and UResNet by adding attention modules to learn important regions. RMTL-Net has a simpler attention mechanism than MTL-COSA and therefore leads to a reduction of 16.8% trainable parameters when compared with MTL-COSA. It also outperforms MTL-COSA in segmentation on both datasets and in classification on dataset UDIAT. Fig. 8   row contains a small tumor with a blurry boundary. This small tumor locates on the right side towards the middle row. MTL-Net segments a completely wrong tumor region and obtains the lowest IoU value of 0.00%. FCN, Deeplabv3+, and MTL-COSA segment a partial tumor region and mistakenly segment another tumor-like region.

A. ADVANTAGES AND POTENTIAL USEFULNESS
In this paper, we propose a novel MTL framework with a RA module for BUS image segmentation and classification. In general, advantages and potential usefulness of RMTL-Net can be summarized as follows: First, RMTL-Net simultaneously performs segmentation and classification by utilizing predicted probability maps to guide the classification task to focus on regions of different importance. Single-task segmentation and classification methods have been well-studied in the BUS research community. However, simultaneous segmentation and classification is more practical and appealing than single segmentation and classification tasks, as it provides both tumor boundaries as well as tumor category. As a result, MTL in BUS image segmentation and classification is a promising direction that is worthy of more exploration. Our study clearly shows that adding a light-weight classification branch on most existing segmentation methods, at least U-Net-based ones (e.g., URe-sNet), increases very few parameters but yields both good segmentation and classification results.
Second, RMTL-Net incorporates a three-region-based attention module (i.e., RA module) to automatically assign appropriate weights to tumor, peritumoral, and background regions during the training procedure. The learned weights help to find regions of importance for better feature representations and therefore improve both the segmentation and classification performance of an MTL method. The RA module aligns well with doctors' clinical perspectives on the importance of tumor, peritumoral, and background regions. The proposed RA module can be easily applied to any existing MTL methods to incorporate prior medical knowledge into the attention model to improve the performance of multiple tasks.

B. LIMITATION AND FUTURE WORK
The proposed RMTL-Net has some limits. First, a preprocessing step is needed to generate pseudo ground truths of peritumoral and background regions, which are indispensable in the training procedure to help the network to learn and produce three regions in any test images. Second, more comparison between the proposed RA module and other traditional spatial or channel attention modules needs to be further conducted to prove the effectiveness of the RA module.
Due to the limited number of public BUS images, we do not have a separate testing set and use five-fold cross-validation to have every BUS image in the dataset validated and tested. As a result, more BUS images need to be collected to generate a large testing set to thoroughly test the generalization ability of RMLT-Net and other competing methods on new BUS images. These experiments are needed to prove the superiority of RMTL-Net without overfitting concerns.
In the future, we will test our RMTL-Net on larger nuclei segmentation and classification datasets and explore more strategies to improve its generalization ability. We will also compare the proposed RA module with more recent spatial and channel attention modules to not only validate its effectiveness but also find a new perspective to improve it.

V. CONCLUSION
In this study, we propose a regional-attentive multi-task learning framework (RMTL-Net) for simultaneous BUS image segmentation and classification. The proposed RMTL-Net adopts ResNet-101 as its backbone to extract features and utilizes a regional attention (RA) module to automatically learn weighted category-sensitive information from the tumor, peritumoral, and background regions in BUS images to more accurately represent each BUS image for better segmentation and classification performance. We conduct extensive five-fold cross-validation experiments on two public BUS datasets DIAT and BUSI. Extensive experiments show that RMTL-Net outperforms recent state-of-the-art single-task segmentation methods, single-task classification methods, and most MTL methods on two datasets.
Our proposed RMTL-Net sheds light on the new research direction toward multi-task learning (MTL) in general and simultaneous segmentation and classification for BUS images in particular. To this end, we can easily convert any existing segmentation network architecture to its counterpart MTL network architecture at a low cost by adding a classification branch to achieve comparable segmentation results and better classification results. Adding a RA module to incorporate prior medical knowledge regarding the importance of tumor, peritumoral, and background regions in BUS images can help to learn a better feature representation for better segmentation and classification results. The proposed RA module can be easily applied to any existing MTL methods and be easily modified based on different prior knowledge.