Multiscale Attention U-Net for Skin Lesion Segmentation

Skin cancer is the most common type of cancer in the world and it is more treatable if diagnosed early. The diagnosis process usually starts with segmenting the skin lesion area and planning a follow-up treatment by the dermatologists. Thus, the segmentation process plays a critical role in the treatment process. In recent years, machine learning methods, especially deep convolutional neural networks are proposed to address the segmentation challenge. The common segmentation methods (e.g., U-Net) deploy a series of encoding blocks to model the local representation and subsequently a series of decoding blocks to capture the semantic relation. However, these structures are usually limited to model multi-scale objects with large variations in texture and shape. To address these limitations, we propose a Multi-Scale Attention U-Net (MSAU-Net) for skin lesion segmentation. In particular, we improve the typical U-net by inserting an attention mechanism at the bottleneck of the network to model the hierarchical representation. The attention module aggregates the multi-level representation in a non-linear fashion to selectively adjust the representative features. Then it deploys a Bidirectional Convolutional Long Short-term Memory (BDC-LSTM) structure to fetch the common discriminative features and suppress the less informative ones. We incorporate the resulted features in each block of the decoding path to highlight the important regions. We have evaluated our proposed network in three public skin lesion datasets, including ISIC 2017, ISIC 2018, and PH2 datasets. The experimental results demonstrate that the proposed pipeline outperforms the existing alternatives.


I. INTRODUCTION
The skin is the largest organ in the body that plays important roles such as protecting the body from the outside environment, receiving sensory stimuli from the external environment, regulating body temperature through sweating, and highlighting hair growth when cold. When skin cells become disordered due to symptoms of the disease and grow out of control, they can turn into skin cancer and sometimes even spread to other parts of the body. Skin cancer is the most common type of cancer in the United States [1] and worldwide that threatens the lives of many people every year. Skin cancer can be divided into two groups, melanoma ,and non-melanoma types. Melanoma skin cancer is the most dangerous type of skin cancer and is reported as the most lethal skin cancer [2]. This type of skin cancer is the result of the unusual growth of melanocytes [3]. Melanocytes are cells located in the lower part of the skin epidermis and The associate editor coordinating the review of this manuscript and approving it for publication was Aasia Khanum . are responsible for making melanin pigments. Any change in the number of melanocytes or an increase or decrease in their activity causes disorders. Although melanoma skin cancer is not as common as other types of skin cancer, it is a too dangerous type of cancer due to its high spread rate to other parts of the, which holds a mortality rate of 1.62% [2]. According to the World Health Organization, approximately three million non-melanoma skin cancers and 132,000 melanoma skin cancers are recorded worldwide annually [4].
Like many cancers, the best treatment for melanoma is early detection since it is more treatable in the early stages of the disease. According to the studies [5], for the localized stage melanoma, the five-year relative survival rate is 98% which drops to about 14% in the latest stage. Therefore, rapid detection of melanoma or the suspected skin lesions is important and requires a method that can detect the disease as quickly as possible. In this regard, dermatologists use dermoscopic images to diagnose the disease. However, the examination of these images by dermatologists is not only VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/  associated with a significant error rate but also is very timeconsuming, and in some cases not enough specialists are available. In recent years, machine vision methods have had many applications in the examination of pathology images [6]. Among the many methods, automatic image segmentation is very useful and efficient for detecting disease [7]. In these methods, dermoscopic images are given to the deep learning model, and after processing these images by the network, places in these images that have a disease pattern appear as segmented in the output so that later dermatologists can focus directly on the disease areas and apply appropriate treatment methods. Figure 1 shows some examples of dermatology images, as inputs, and the skin lesion segmentation results generated by a deep segmentation model. However, medical images segmentation, which separates the affected areas from other surrounding healthy tissues, is a challenging task due to some factors such as low contrast in medical images, the presence of multiple tissues that are similar, lesion sizes, color shift, and non-uniform lighting system between different laboratories. Moreover, in skin lesion segmentation,other obstacles such as body hair, air bubbles, blood vessels, ebony frames, color illumination, and patient-specific properties that may change skin colors make this task more complicated. Figure 2 shows some typical challenges in dermoscopic images [8].
Several methods have been proposed in the literature to address the semantic segmentation task in the medical domain. Among these approaches, deep-learning strategies have made significant advances in medicine, making them the best available methods for processing medical images. One of the first convolutional networks introduced for the image segmentation task is the fully convolutional network (FCN) [9]. This deep model is an end-to-end and pixels-topixels network for generating a semantic segmentation map through the input image. In FCNs, all fully connected layers are replaced with convolution and deconvolution layers to keep the original resolution. Ronneberger et al. [10] further extended the idea of FCN into a U-shape structure. This network architecture consists of symmetric encoding and decoding paths. The encoder reduces the dimensionality of input data and extracts a large number of feature maps. On the other hand, the decoder part applies a hierarchical series of up-convolutional layers to model the semantic information and produce the segmentation maps.
Many extensions of U-Net have been proposed [11]- [20] to improve its performance. These methods have tried to strengthen the original U-Net using techniques such as recurrent residual strategies, applying probabilistic functions to resolve uncertainty [21], inserting attention mechanisms, or using other non-linear functions in the convolutional layers.
Nevertheless, CNN facilitates the learning of representing abstract data, which robust the network to transfer local features. However, in semantic segmentation, the abstraction of spatial information may be undesirable. To address this issue, several methods have been proposed. Chen et al. [22] utilized ''Atrous spatial pyramid pooling'' (ASPP) and introduced Deeplab. This method uses several parallel ASPPs to capture contextual information at multiple scales [23]. Furthermore, The approach improved by utilizing the skip connection in the decoding path, similar to the U-Net approach. Although the pyramid representation improved the performance, it lacks to capture the common representation shared among the hierarchy of the deep model (no attention mechanism incorporated) to model robust and noise invariant features.
In recent years, attention-based techniques have been introduced to the deep models and have been widely used in various computer vision tasks [24]. Unlike conventional methods that use multiple similar feature maps, the attention strategy increases network performance, mostly in semantic segmentation tasks [24]- [27], by avoiding the use of similar feature maps and selecting the most informative features for a given task without additional supervision. In this paper, we propose a Multi-Scale Attention U-Net (MSAU-Net) for skin lesion segmentation. In particular, we improve U-net by inserting an attention mechanism at the bottleneck of the network. These attention modules aggregate the multi-level representation in a non-linear fashion to selectively adjust the representative features and finally deploy BDC-LSTM to fetch the common discriminative features and suppress the less informative ones. We incorporate the resulted features in each block of the decoding path.
We perform the attention mechanism in two steps process: firstly, to re-calibrate the feature map and pay more attention to more informative channels in each layer, we applied the channel-wise attention process. In other words, by assigning different weights to different channels of feature maps, the network focuses more on a channel with more discriminative information. In the second step, to aggregate the features extracted by the different blocks of the encoder module, 59146 VOLUME 10, 2022 we apply the BDC-LSTM module. The objective of this module is to use the hierarchical representation to jointly encode both local and global representation into a unique transformed space, where the feature map can be used by the decoder path to effectively emphasize the informative regions. Furthermore, the hierarchical representation provided by the encoder module helps the BDC-LSTM layer for learning objects in multi-scale and multi-level. Thus, the resulted features are less sensitive to the variation in shape and texture. The main contributions of the paper are as follows: • Multi-scale attention mechanism to capture hierarchical representation • Including Bi-directional Convolutional LSTM module to capture discriminative features • Significant improvement over the state-of-the-art methods The rest of the paper is organized as follows. Section 2 reviews related work. The proposed network is presented in Section 3. The experimental results are described in Section 4. Finally, Section 5 concludes the paper.

II. RELATED WORK
The semantic segmentation task plays one of the most important roles in dermoscopic image processing. Numerous automatic and semi-automatic methods for skin segmentation have been proposed. Like other research lines in the computer vision field, skin lesion segmentation methods can be categorized into handcrafted and deep learningbased approaches. The earlier approaches focus on designing the specific feature to learn discriminative patterns from the image itself. Histogram thresholding methods [28]- [30] try to find a threshold that divides the images into two sections: skin lesions and adjacent tissues. Unsupervised color-based methods [31]- [33] try to use the color space properties of RGB dermoscopic images to determine a homogenous region for skin lesion areas and other tissues and perform segmentation accordingly. Region-mergingbased approaches [34]- [36] compare neighboring regions and merge them if they are close enough in some properties. Active contour methods [37]- [39] segment lesion areas by utilizing algorithms like metaheuristic, genetic, and snake algorithms. Morphological operations-based methods [40], [41] rely on the relative ordering of pixel values for segmentation. However, these traditional image segmentation methods do not show satisfactory results and cannot overcome problems such as fuzzy lesion borders, hair artifacts, low contrast, and ebony frames.
In recent years, deep learning methods have returned to the field of artificial intelligence with more power and they have achieved outstanding results in many machine learning tasks [42], particularly semantic segmentation tasks. These deep learning methods, especially CNNs, have become standard baselines in many semantic segmentation problems. The majority of the CNNs breakthroughs are resulted from their capability of learning hierarchical as well as higher-level features that are more robust than normal raw image features.
The state-of-the-art CNN segmentation architectures include but not limited to: Fully Convolutional Neural Network (FCN) [9], U-Net [10], SegNet [43], hourglass [44], and DeepLab [22]. Recently, many researchers have used CNN architecture for skin lesions semantic segmentation because of their high capability of learning diverse datasets. Some of the State-of-the-art methods based on CNNs are reviewed in the following.
Xie et al. [45] proposed MB-DCNN for improving skin lesion segmentation performance by using a collaboration between segmentation and classification. Each task facilitates the other in a bootstrapping way. This method mutually transfers coarse masks and location information between a coarse segmentation network (coarse-SN) and a mask-guided classification network (mask-CN). Maninis et al. [46] proposed a Deep Extreme Cut (DEXTR) model which combines original RGB images and extreme points (corner points on the contours) to feed the network's input. Although this method requires the input of extreme points in which their quality has an impact on the segmentation performance, they have shown this combination can improve the performance of instance segmentation. Abhishek et al. [47] designed a novel algorithm that improves skin lesion semantic segmentation by utilizing illumination invariant of different tissues. They combined information from illumination invariant grayscale images, specific color bands, and shading-attenuated images.
Based on the classical encoder-decoder architecture, Wu et al. [8] utilized a feature adaptive transformer network (FAT-Net) that effectively captures global context information and long-range dependencies by integrating an extra transformer branch. Their approach uses a feature adaptation module and a memory-efficient decoder to enhance the feature fusion between the adjacent-level features. In this regard, they activate the effective channels and restart the irrelevant background noise. The Laplacian Pyramid Super-Resolution Network (LapSRN) proposed by Lai et al. [48] is capable of progressively reconstructing the sub-band residuals of highresolution images for image super-resolution. It predicts the high-frequency residuals by taking coarse-resolution feature maps as input.
Azad et al. [20] proposed a two-stages attention mechanism for skin lesion segmentation. They set a weight for each channel, which is determined by a set of feature maps to capture the relationship between the channels. Similar to the bi-directional strategies [14] this context gating mechanism network is capable of emphasizing more on the informative and meaningful channels. In addition, they use a second-level attention strategy to integrate the different layers of Atrous convolution, allowing the network to focus on a more goal-related field of view. Liu et al. [49] used auxiliary information based on the edge prediction technique for the skin lesion segmentation task. To make the network focuses on the boundary region of the segmentation task they used a cross-connection layer module. This module fed the intermediate feature maps of each task into the subblocks of the other task. They also used a multi-scale feature aggregation module to increase network performance using different scale features. Dai et al. [50] segmented a variety of skin lesions by taking the advantage of multi-scale residual encoding and decoding fusion (MS RED) to fuse multi-scale features adaptively. Furthermore, they proposed a multi-resolution and multi-channel feature fusion module to enhance the capability of learning the feature representation. In the down-sampling stages, they used a new pooling module (Soft-pool) which retains more helpful information and enhances the segmentation performance. One central limitation of these multi-level fusion strategies is related to their poor aggregation strategies, which are not capable of combining different level features. To address this problem, we include the attention mechanism on top of the multi-level features to capture discriminative features.

III. PROPOSED METHOD
We propose MSAU-Net, attention incorporated U-Net model for skin lesion segmentation. The overview of our proposed network is shown in Figure 3. In our structure, we apply the encoder module to extract the hierarchical representation, then by utilizing the attention mechanism we perform the feature re-calibration process in a non-linear fashion. The description regarding each section of the proposed method is detailed in the following subsections.

A. ENCODER
Our proposed method utilizes a U-Net structure to model the segmentation problem. The U-Net model follows a symmetric structure and applies an encoder and decoder modules to learn the segmentation map [10]. Although the U-Net model is capable of capturing local information, its structure does not pay more attention to the boundary area [51], thus, it is less precise in separating skin lesions from the overlapped background. In other words, to accurately segment the skin lesion from other surrounding parts, both the local appearance and the entropy of the area should be learned through the training process. To model such regionsensitive representation, we include an attention mechanism on top of the encoder blocks. The purpose of the attention layer is to model the multi-scale representation and highlight the importance of each activated feature map during the recognition process [52]. The resulting feature map from the attention module can bring rich and scale-dependent descriptions, which is crucial for skin lesion segmentation tasks with various scales on the lesion patterns.

B. FEATURE RECALIBRATION
In conventional CNN networks, the resolution of the spatial feature is significantly reduced due to the use of a set of consecutive max-pooling and down-sampling functions. In addition, images can contain objects with different scales [53]. To diminish this problem, we propose to use multi-scale representation results from each block of the encoder module. In our design, we concatenate the different feature maps resulting from the encoder block to form a multi-scale representation. To scale the different feature maps into the same shape we use an Atrous convolution. To this end, on top of the last convolutional layer of each encoder block, we use Atrous operation to up-sample the representation filters. For up-sampling the filters, a hole convolutional filter applies to the full resolution image, i.e., inserting zeros between the filters' values. In this operation, the number of parameters stays constant due to the fact that non-zero filters' values are only considered in the calculations. The Atrous convolution provides a way to control the spatial resolution of feature responses. In addition, to calculate feature responses in each layer, we can enlarge the field of view of the filters, which results in a combination of larger context information. The Atrous convolution [22] for one-dimensional signal is calculated as: where x is the input feature map, x is the output feature map, i refers to a spatial location on y and w is a convolution filter. Moreover, r refers to the Atrous rate and determines the stride which we sample the input signal. By applying the Atrous convolution we build a feature pyramid to form a multi-scale representation (shown in Figure 4). To normalize the feature pyramid, we utilize the squeeze and excitation module [54]. Using this strategy, the network uses the global information of the input data to selectively empathize the informative features and suppress the less useful ones. For producing each input channel's weight, the model exploits the global context information of the input features. Therefore, the global average pooling is calculated for each channel as: where H × W is the size of the channel, x f is the f th channel, and z f is the output of the global average pooling. Moreover, we learn nonlinear interaction and also the nonmutually-exclusive relationship between channels at the next step. To capture the channel-wise dependencies two fully connected layers are then utilized. The output of these layers is calculated as:

C. BI-DIRECTIONAL ConvLSTM
Standard LSTM uses full connections in input-to-state and state-to-state transitions which is its main disadvantage due to the fact that these networks do not consider the spatial correlation. ConvLSTM [55] has been proposed to address this problem. This method utilizes convolution operations into input-to-state and state-to-state transitions. An input gate i t , an output gate o t , a forget gate f t , and a memory cell C t form the ConvLSTM. Input, output and forget gates act as controlling gates to access, update, and clear memory cell. The ConvLSTM formula is written as follows, for simplicity we have avoided writing subscript and superscript.
where * states the convolution, and • denotes Hadamard functions. H t is the hidden sate tensor, and X t is the input tensor. C t indicates the memory cell tensor, and, W x * and W h * are 2D Convolution kernels corresponding to the input and hidden state, respectively. Finally, the bias terms are indicated In the proposed model, we utilize BConvLSTM [56] for encoding the recalibrated feature pyramid into a single multiscale representation. In fact, BConvLSTM consists of two ConvLSTMs, one for processing input data in the forward path and the other for processing data in the backward path direction. Unlike a standard ConvLSTM that only processes the dependencies of the forward direction, the BConvLSTM considers data dependencies in both directions and makes a decision for the current input. Cui et al. [57] have proved that considering both forward and backward temporal perspectives boost the network performance. Since . Attention mechanism proposed in our method to learn hierarchical representation. This attention mechanism applies the squeeze and excitation module to calibrate the feature pyramid based on the informative channels and then uses a bi-directional convolutional LSTM to aggregate different levels of the pyramid into a single representation.
the BConvLSTM consists of two standard ConvLSTM, we have two sets of parameters for backward and forward states. The output of the BConvLSTM is calculated as where H t indicates the hidden state tensors for forward and H t denotes the hidden state tensors for backward states. Y t ∈ R F l ×W l ×H l denotes the final output considering bidirectional Spatio-temporal information. b shows the bias term. We utilized hyperbolic tangent tanh for combining the output of both forward and backward states through a nonlinear way. The detailed structure of the proposed mechanism is shown in Figure 4.

D. DECODER
In our proposed model, the decoder is implemented according to the regular U-Net. The features up-sampled from the previous decoder layer are concatenated with features that are imported directly from the encoder along with the multi-scale representation derived from the attention module. We use two Convolutional layers followed by the batch-normalization and activation layer in each block of the decoding path to learn the semantic representation. Finally, at the last decoding block, we deploy a softmax activation to produce the segmentation map. VOLUME 10, 2022

IV. EXPERIMENTAL RESULT
In this section, we provide (i) details about the training process, (ii) the evaluation metrics we used to evaluate our approach, and (iii) a description of each dataset we used during our experimental evaluation.

A. TRAINING PROCESS
The proposed method is implemented in the Pytorch library and has been carried out on an NVIDIA RTX 3090GPU with a batch size of 8 without any data augmentation. We trained all the models with initial learning rate 1e − 3 and the decay rate 1e−4 for 100 epochs. For model weight initialization we used a standard normal distribution, which provides a stable start point for the network. Furthermore, during the training process, in case the validation performance does not change in 10 consecutive epochs, we stop the training process. The baseline network in our experiments has the same structure as a U-Net model without the proposed attention mechanism.It is worthwhile to mention that during the training process on each dataset, the optimization algorithm steadily decreased the loss value on both train and validation sets and eventually converged to the optimal solution. Thus, we did not observe any instability during the training process.

B. EVALUATION METRICS
To experimentally evaluate our method performance, we have employed commonly well-known metrics including accuracy (AC), sensitivity (SE), specificity (SP), F1-Score, and Jaccard similarity (JS). The terminologies used to describe how metrics are calculated are given below. True-Positive (TP) refers to the predicted label that is correctly predicted as a lesion class. False-Positive (FP) refers to the predicted label that is falsely predicted as a lesion class. True-Negative (TN) refers to the predicted label that is truly labelled as a background pixel. False-Negative (FN) refers to the predicted label that is falsely labelled as a background pixel. Accuracy shows the percentage of correct prediction, Specificity measures the proportion of FP that are correctly identified by model, Sensitivity measures the proportion of predicted TP that are correctly identified by model, F1 score also known as balanced F-score or F-measure, is a weighted average of the precision and recall, Jaccard similarity is also known as a mean intersection over union (mIoU) in segmentation tasks, measures the similarity between the predicted valuesŷ and real values y by comparing members of two sets to see which members are shared and which are distinct. Jaccard similarity = |y ∩ŷ| |y| + |ŷ| − |y ∩ŷ| (10)

C. DATASETS
The proposed method was evaluated on three publicly available datasets ISIC 2017 [58], ISIC 2018 [59], PH 2 [60]. In the next subsection we will provide more details about each dataset.

1) ISIC 2017 DATASET
The International Skin Imaging Collaboration (ISIC) 2017 dataset is one of the most well-known datasets in skin cancer diagnosis. This dataset consists of 2,000 dermoscopic images of the skin taken using the technique of eliminating the surface reflection of skin that brings a deeper level of skin visualization [58]. For each instance, an expert clinician has annotated the ground-truth label, using either a semi-automated or manual process. The annotation data provides further information for three subtasks: lesion segmentation, localization, and skin disease classification. In this research work, we focus on the segmentation task. Following the literature work [13], we divided the original dataset into a training set with 1250 samples, validation sets consist of 150 samples, and a test set with 600 instances. Furthermore, we used a resize function to reduce the spatial dimension of the input data into 256*256 pixels.

2) ISIC 2018 DATASET
The ISIC 2018 dataset, like the former ISIC datasets, includes a large collection of quality-controlled dermoscopic images of skin lesions, introduced by an international collaboration to improve melanoma diagnosis [59]. This dataset contains 2594 images, each of which is accompanied by a corresponding grand truth mask. Similar to ISIC 2017, this dataset also defines three sub-tasks: lesion segmentation, lesion attribute detection, and disease classification. We have categorized the dataset into three sub-sections: train data with 1815 images, evaluation data with 259 images, and test data with 520 images. Furthermore, to reduced the computational and network training cost, we have resized the input images from 2016×3024 pixels to 256×256 pixels.

3) PH 2
The PH 2 dataset consists of 200 dermoscopic images of skin lesions region, acquired at the dermatology services of Pedro Hispano Hospital, Matosinhos, Portugal. The main objective of this dataset is to enable future researches on classification and segmentation of cancerous regions in dermoscopic images. Similar to [13], we have randomly divided the dataset into two categories of 100 instances, one  of which is used as training data and the other set for the evaluation purpose.

D. RESULTS
The quantitative results of the proposed method on ISIC 2017 are illustrated in Table 1. The results show that the proposed method outperforms the state-of-the-art (SOTA) methods in almost all metrics. Compared to the recent MCGU-Net model which utilizes an attention mechanism inside the network, our strategy produces a better segmentation map which further proves the effectiveness of our method. In Figure 5, we depicted some visualization results of the proposed method on the ISIC 2017.
We further evaluated our method on ISIC 2018 to compare the results with SOTA approaches. As clear from Table 2, our method marginally increases the performance compared to the counterpart approaches. On the other hand, incorporating the attention mechanism proposed in our paper increases the U-Net F1 score by 0.25 as it is shown in Table 2. To further demonstrate the effectiveness of the proposed method from a qualitative perspective, we provide Figure 6.
During the third experiment, we evaluated our approach on the PH 2 dataset. Obtained results compared to the SOTA strategies are shown in Table 3. We can observe that our   method significantly improves (DSC 0.937) the performance compared to both baseline (DSC 0.867) and recent MCGU-Net [13] approaches (DSC 0.926). We also provide Figure 7 to represent some segmentation results of the proposed method. As it can be seen from the visual results, our network produced a smooth segmentation output on the boundary area, which is remarkably useful from a clinical perspective.
To further compare the qualitative results of the proposed method to the SOTA approaches, we visualized a sample of the segmentation results in Figure 8, achieved by applying different methods on ISIC 2018 dataset. It is crystal clear that our proposed method pays more attention to the boundary area compared to the U-Net model and outperform this approach. Additionally, compared to the BCDU-Net method, our network produces a smooth segmentation boundary without an extra noisy area.

E. ABLATION STUDY
This section provides an ablation study regarding the effect of the proposed modules. To analyze the contribution of modules individually we experimented with different settings. In our settings, we designed a possible combination of the proposed modules to provide a clear picture of how these modules can effectively be incorporated to increase the model generalization performance on a skin lesion segmentation task. We further included the one-direction version of the ConvLSTM to show the capability of the bidirectional form on encoding a stronger representation and consequently boosting the model performance. Our founding indicates that each module contributes to the model performance and together they provide a strong features representation for the network. Table 4 shows the obtained results.
The conducted experiments (Table 4) show that adding the ConvLSTM module on top of the hierarchical features provided by the seminal U-Net (baseline) model helps the model to learn a rich and generic multi-scale representation and increases the performance considerably. In addition, modifying the direction of the ConvLSTM into a bi-directional further enhances the generalization performance. This fact is in line with the previous research work [14] which included the bi-directional ConvLSTM in the skip connection of the U-Net model and obtained a significant improvement. Besides the ConvLSTM module, we can observe that incorporating the SE block inside the proposed pipeline also increases the model performance. Finally, the combination of these modules with the U-Net model provides a strong feature learning strategy for the medical image segmentation task, which is novel and unique in its design. It is also worthwhile to mention that the processing time for each batch of the eight samples in our pipeline only takes four seconds which demonstrates the suitability of the suggested network for real-time and commercial application.

F. DISCUSSION
The proposed method has been evaluated using both quantitative and qualitative studies to demonstrate its capability in learning rich and generic representation for skin lesion segmentation tasks. The contribution of each proposed components are also evaluated to ensure the effectivness of the suggested design. Although our pipeleine uses U-Net based model, the entire proposed strategy does not have any restrictions on the selection of the segmentation network (e.g., U-Net) and it can be incorporated into any segmentation network, which further supports our contribution in terms of the generalizability and scalability of the network design. Moreover, in Figure 9, we provided sample results of the proposed method where the model fails to segment the skin lesion area. The model performance is largely impacted by the accurate annotation of the images. Therefore, noisy annotation, which is common in the clinical domain, degrades the training performance. Our model cannot detect the inaccurate annotation, and consider all images even if they have inaccurate annotation. As we discussed in the ablation study, the proposed method contains several components which gradually increase the overall performance of the model. One drawback of these modules is their need for computational resources. Hence, although these modules increase the model generalization performance, in the meanwhile they increase 59152 VOLUME 10, 2022 the number of parameters and consequently require more computational powers. In this case, there is a trade-off between the performance and the complexity of the model.

V. CONCLUSION
In this paper, we proposed a multi-scale attention mechanism to learn a hierarchical representation. Our attention module receives multi-level feature maps from the encoding model and applies a channel-wise normalization method to recalibrate the feature vectors based on their contribution to the object recognition level, then it utilizes a bi-directional ConvLSTM to learn a hierarchical non-linear representation. By including the resulted feature in each block of the decoding path we incorporate the scale-invariant features inside the network to further boost the performance. The experiment results described throughout the paper proved the effectiveness of our proposal. One possible direction for future work to extend our idea is to model the underlying uncertainty in the skin lesion annotation task. More specifically, with precise modelling the weak annotation of the skin lesion during the training process, the model can further increase its performance.