CU-Net: A New Improved Multi-Input Color U-Net Model for Skin Lesion Semantic Segmentation

Melanoma is considered one of the most dangerous skin cancer diseases that threaten human health and life. Early diagnosis of melanoma is a big challenge, especially with the presence of color variations across similar lesion types. Automatic skin lesion segmentation is an essential step to build a successful skin disease classification system. Recent deep learning architectures significantly improve the skin lesion segmentation results. Especially, U-Net deep convolutional neural network (CNN) is considered one of the state-of-the-art models with promising performance. Most deep CNNs and particularly U-Net model utilize a single input RGB color image for skin lesion semantic segmentation. However, RGB color space is not usually the best choice to represent the invariant characteristics of skin lesion chromatic information. The selection of the optimal color space significantly affects the performance of segmentation results. In this paper, three novel variants of U-Net model with single, dual, and triple inputs, namely, Single Input Color U-Net (SICU-Net), Dual Input Color U-Net (DICU-Net) and Triple Input Color U-Net (TICU-Net) are proposed. The structure of SICU-Net, DICU-Net, and TICU-Net contains single, dual, and triple encoder sub-networks connected with only a single decoder path. Each encoder sub-network is fed with different color space of the input image. A channel-wise attention module is utilized to fuse the contribution of the learned feature maps from each encoder sub-network which is fed to the decoder sub-network to generate segmented image map. Moreover, a composite loss function is designed to improve the performance of the proposed CU-Net models. Three public benchmark datasets, namely, International Skin Imaging Collaboration (ISIC 2017, ISIC 2018) and PH2 datasets, are utilized to evaluate the performance of the proposed models. Experimental results reveal that the proposed models significantly improve the performance of the original U-Net model and achieve comparable performance with other state-of-the-art methods.


I. INTRODUCTION
S KIN cancer occurs when some skin cells are grown abnormally. It is commonly appeared in areas of the body that are most exposed to ultraviolet (UV) radiation [1]. Skin cancer is divided into three major types, namely, squamous cell carcinoma, basal cell carcinoma, and melanoma [2]. Melanoma is the most dangerous type of skin cancer since it appears and grows in the melanocyte cells that produce melanin [3]- [5]. The main cause of melanoma is not known yet, but it is scientifically proven that direct exposure to ultraviolet rays from sunlight or tanning lamps and beds increases the risk of melanoma. Early detection and diagnosis are necessary to preserve human survival [6]. Skin cancer diagnosis, like many diseases, is vulnerable to human error or it may be costly. Thus, recent researches [7]- [10] tend to rely on Computer Aided Diagnosis (CAD) systems based on dermoscopic images to reduce the probability of errors. Skin lesion image analysis is an essential stage for early skin cancer disease diagnosis. As a crucial step, skin lesion image segmentation helps to distinguish lesions from the background by labeling each pixel in the image as healthy or Infected ones. Skin Lesion segmentation is a challenging problem due to the low contrast of skin lesions images, irregular borders, and the existence of some extraneous elements such as hair, pen markers, oil drops, etc. Other factors such as the irregular color, size, scale, and precision of the lesion area add another challenge to the segmentation problem. Figure 1 shows some examples of skin lesion dermoscopy images with various variations. All those factors pose major obstacles to the success of many skin lesion segmentation methods. Increasing the number of collected skin lesion images makes it imperative to use computers to perform automatic segmentation easily and effectively. However, manually labeling such a large number of images by an expert is a tedious work and very costly.
A few research works have carried out to study the color contrast variations in skin lesion image segmentation [11]- [16]. Most of these works utilize color space conversionbased methods to efficiently represent the infected parts and discriminate them from background pixels. These methods convert the RGB input image to another color space like YCbCr, Lab, and HSV. A significant number of research applied a specialized pre-processing operation on the input image [11], [14], [17]- [19] while other works employed post-processing [11], [12], [17], [20], [21] to enhance the segmentation results of the proposed architectures. On the other hand, recent works using deep learning methods [22]- [28] rely on extending the architecture of existing deep convolutional neural networks to improve the skin lesion image segmentation results.
One of the main issues in skin lesion segmentation is the color contrast variations among similar lesion types. Although previous works handled this issue by exploiting various color spaces using single input deep CNN, this paper investigates the effect of combining multiple color spaces of the input image using multi-input deep CNN. A new modified Color U-Net (CU-Net) model with multiple encoders and one decoder sub-network is developed to take advantages of combining multiple features extracted from various color space representations of the input image. The U-Net model [29] is used as a backbone architecture to build a three variants of CU-Net models. The proposed models depends on connecting multiple encoders into a single decoder sub-network to capture different color features of the skin lesions to overcome color variation problem of skin lesion image segmentation. Three different models are proposed based on U-Net architecture using single, dual, and triple inputs, namely, Single Input Color U-Net (SICU-Net), Dual Input Color U-Net (DICU-Net) and Triple Input Color U-Net (TICU-Net). The proposed DICU-Net and TICU-Net models comprise two and three separate encoder sub-networks based on the traditional U-Net structure, where each encoder path of the proposed network accepts different color space of the input image and interconnected with each other to learn more rich features. The learned feature maps from each encoder sub-network are fused through a Channel-wise Attention Network (CAN) and fed into the decoder sub-network to learn the contribution of each color space into the segmentation results. Various combinations of color spaces are examined to find the optimum color spaces which achieve the best performance. Due to the significant effect of loss function in the segmentation results, a new hybrid binary-weighted loss function is proposed from the combination of three different loss functions, namely, cross-entropy, generalized dice, and sensitivity-specificity. We considered various performance metrics such as: accuracy, sensitivity, specificity, Jaccard index and Dice coefficient to evaluate the segmentation results. The contributions of the paper can be summarized as follows: • Exploring the effect of changing input image color space on the performance of the new proposed color U-Net model segmentation results. • Proposing two multi-input networks based on color U-Net architecture. These networks are composed of multiple encoder sub-networks and a single decoder to combine various color features of the input image. • Using channel-wise attention module to interconnect the encoder and decoder paths of the proposed models. • A new hybrid loss function is designed as a combination of cross-entropy, generalized dice, and sensitivityspecificity loss to improve the performance of the proposed models. • Experiments are conducted using three standard skin lesion databases to validate the proposed models and a comparison of the results with other state-of-the-art methods is performed. The rest of the paper is organized as follows: the related works are reviewed in section II. Our proposed models are detailed in section III. Then, we come to the experimental results section IV followed by the analysis of the obtained results. Discussion is presented in section V. Finally, the conclusion is given in section VI .

II. RELATED WORK
Several methods have been introduced to solve the automatic skin lesions segmentation problem over the last decade. In this section, we briefly explain the latest developments in skin lesion segmentation which are closely related to our work. Currently, recent skin lesion segmentation methods are transferred from adopting shallow hand-crafted feature extraction algorithms into deep feature learning using Convolutional Neural Network (CNN) architecture. These methods can be classified into three different categories: Traditional, deep learning, and color-based skin lesion segmentation methods.

A. TRADITIONAL SKIN LESION SEGMENTATION METHODS
An efficient skin lesion image segmentation technique helps to enhance the early detection and diagnosis of melanoma. Several approaches are adopted according to the nature of the segmented lesion. Region growing approaches [30], [31] and thresholding approaches [32]- [34] are used to find the fine border in lesion images. While clustering approaches [35], [36] are used to address the unclear border lesions problem. Other approaches in [37], [38] are based on employing histogram-based cluster estimation to distinguish between healthy and infected parts in the images. Active contours based methods [39], [40] were among the proposed approaches that focused in the separation of the required pixels from the image for further processing and analysis by using energy forces with some constraints. In addition, edge-based methods [41] were suggested to identify the region of interest. All previously mentioned segmentation techniques are considered primitives as they depend only on low-level pixel-wise features. Therefore, they could not achieve the desired performance in comparison with deep learning methods.

B. DEEP LEARNING SKIN LESION SEGMENTATION METHODS
Recently, Deep Convolutional Neural Networks (DCNN) is a powerful technique that play a prominent role in developing new medical image segmentation methods [10], [17]- [19], [21], [42]- [47] due to the high-accuracy results in segmentation operations. In 2011, Yuan et al. [17] utilized a 19layer deep convolutional neural network that is trained endto-end. They did not have any prior knowledge of the data to present a fully automatic method for skin lesion segmentation. Furthermore, they proposed a novel loss function based on Jaccard distance to dispense with sample re-weighting. Their technique could eliminate the need for data re-balancing when the numbers of foreground and background pixels are unbalance.
In 2018, Poap et al. [46] presented a smart home system that using in-built sensors and proposed artificial intelligence methods to diagnose the skin health condition of the residents of the house. They compared the results of their proposed method with the results of some similar methods. In 2019, Hashemi et al. [45] used an asymmetric similarity loss function to train a fully convolutional deep neural network in order to overcome the data imbalance issue and achieved a much better tradeoff between precision and recall. Moreover, they developed a 3D fully convolutional densely connected network (FC-DenseNet) with large overlapping image patches as input and an asymmetric similarity loss layer based on Tversky index. They also used large overlapping image patches as inputs for intrinsic and extrinsic data augmentation, a patch selection algorithm, and a patch prediction fusion strategy using Bspline weighted soft voting to account for the uncertainty of prediction in patch borders.
An automatic semantic segmentation network for skin lesion segmentation named Dermoscopic Skin Network (DSNet) and a new loss function that combines a binary cross-entropy and intersection over-union are presented by Hasan et al. [10] in 2020. They succeeded to reduce the number of parameters and make the network lightweight. they used a depth-wise separable convolution instead of standard convolution to stand out discriminatory features in the pixel space at different steps of the encoder. Their proposed loss function played an important role in semantic segmentation and achieved higher true positive rates in the conducted experiments. While, in 2020, a novel segmentation methodology is proposed by Al-masni et al. [18]. They used Full resolution Convolutional Networks (FrCN) that do not require any pre-or post-processing operations but it learns the full resolution features of each individual pixel of the input data to improve pixel-wise segmentation performance. They achieved high evaluation metrics value in the tested datasets. Xie et al. [8] devised a high-resolution feature block with three branches, the first one is the main branch that accepts high-resolution feature maps to extract spatial details around boundaries. The second and the third branches are the spatial attention and the channel-wise attention branches, which are used to enhance the discriminative features in the main branch VOLUME 4, 2016 regarding the spatial and channel-wise dimensions. Robust features with detailed spatial information were yielded and extracted, and accurate skin lesion boundaries have obtained by fusing the branch outputs.
In 2021, Khan et al. [43] proposed a fully automated computerized aided diagnosis system that is based on the deep learning framework. In their proposed scheme, they preprocessed the original dermoscopic images using the decorrelation formulation technique. Then, the resultant images are passed to the MASK-RCNN which is trained using the segmented RGB images generated from the ground truth images of the used datasets. Next, the resultant segmented images are passed to the DenseNet deep model for feature extraction. They combined the output of average pool and fully connected layers for feature extraction and the resultant vector is forwarded to the feature selection block for down -sampling using proposed entropy-controlled least square SVM. Their proposed model was the lowest in computational time compared with other works. Kadry et al. [44] extracted the skin melanoma section by employing the VGG-SegNet scheme to the digital dermoscpy image. Then, they executed a relative assessment between the segmented skin melanoma and the ground-truth, and computed the essential performance indices. Their proposed technique is significant in evaluating the clinical grader of digital dermoscopy image. Khan et al. [47] extended their previous works and proposed a fully automated approach for multi-class skin lesion segmentation and classification that used the most discriminant deep features. They used local color-controlled histogram to enhance the image intensity values. Then, they used a novel Deep Saliency Segmentation method that is a custom convolutional neural network (CNN) of ten layers to estimate Saliency. By using a thresholding function, they got a binary image from the generated heat map. They used the segmented color lesion images for feature extraction via a deep pre-trained CNN model. They implemented an Improved Moth Flame Optimization (IMFO) algorithm to select the most discriminant features to avoid the curse of dimensionality, and the resultant features are fused using a Multi set Maximum Correlation Analysis (MMCA) and classified using the Kernel Extreme Learning Machine (KELM) classifier. Their approach showed an improvement in accuracy but with high computational time.
Currently, several researches in skin lesion segmentation field employed a model of Fully Convolutional Network called U-Net [9], [48]- [53]. In 2019, a psoriasis lesion segmentation network (PsLSNet) was introduced by Dash et al. [51] as an automated method based on a modified U-Net architecture. The architecture has 29-layer deep fully convolutional network for extracting spatial information automatically. Their proposed structure accelerated the training by reducing the co-variate shift through the implementation of batch normalization and could segment the lesion even in challenging cases. Qamar et al. [48] combined DenseNet [54] and ResNet [55] and presented an encoder-decoder-based CNN for skin lesion segmentation that is based on UNet architecture. Their aim was to improve the performance of skin lesion segmentation.
In the encoder path, they used atrous spatial pyramid pooling (ASPP) to generate multi-scale features from different dilation rates and used dense skip connections to combine the encoder and decoder feature maps. They proposed a deep learning method to model lesion patterns to perform melanoma detection and lesion segmentation. They were able to exploit multiscale contextual information, retrieve accurate information, and improve segmentation performance. Pham et al. [49] combined multiple hypotheses into a single decision point. For melanoma detection and seborrheic-keratosis classification, they trained Inception-v4 [56], ResNet-152, and DenseNet-161 [54]. While for lesion segmentation, U-Net and U-Net with VGG-16 Encoder [57] were trained to produce segmentation masks. Their model ranked 5 th in classification and 8 th in segmentation among 23 and 21 international teams, respectively. Also, Tang et al. [50] exploited the advantages of U-Net architecture and the separable convolutional block to propose a skin lesion segmentation method that is based on the separable-U-Net with stochastic weight averaging to get higher semantic feature information. They introduced a scheme based on stochastic weight averaging to obtain an optimum broader and better generalization. They enhanced the pixel-level discriminative representation capability.
Azad et al. [53] proposed a frequency re-calibration U-Net (FRCU-Net) for medical image segmentation to reduce the effect of texture bias, and get better generalization for a low data regime. They applied the Laplacian pyramid in the bottleneck stage in the U-structure. They used a channel-wise attention mechanism to capture the relationship between the channel features maps in a layer of the frequency pyramid. Then the extracted features of each level of the pyramid are combined through a non-linear function based on their impact on the final segmentation output. Their proposed net achieved state-of-the-art results, However, employing high frequency information can exaggerate the noise information exists in the skin images which may deteriorate the performance of the system in real-time scenarios. Alom et al. [58] utilized the power of U-Net to propose a Recurrent Convolutional Neural Network (RCNN) based on U-Net as well as a Recurrent Residual Convolutional Neural Network (RRCNN) based on U-Net models, and named RU-Net and R2U-Net respectively. All the proposed models had several advantages for segmentation tasks. Their proposed network showed better performance in segmentation tasks with the same number of network parameters when compared to existing methods including the U-Net and residual U-Net (or ResU-Net) models. However, computational Time for training and testing increased due to the recurrence operations. Asadi et al. [59] took the full advantages of U-Net , Squeeze and Excitation (SE) block, bidirectional ConvLSTM (BConvLSTM), and the mechanism of dense convolutions to create a new model called (MCGU-Net) as an extension of U-Net for medical image segmentation. The network was able to capture more discriminative information and more precise segmentation results. However, using dense connections in the bottleneck increase the complexity of the network.
Few researchers [11], [12], [14], [60] have adopted the idea of using color spaces in the segmentation process that relying on convolutional neural networks. Schaefer et al. [19] presented an effective approach that combines an enhancement stage and two different segmentation algorithms. The enhancement stage was a pre-processing step to counter the weak contrast and lack of color calibration in the dermoscopy images. They proved that applying a color normalization technique was necessary to reduce color variations and enhance image contrast to improve skin lesions segmentation. Pour et al. [11] developed a segmentation model for skin lesion segmentation tasks and dermoscopic feature segmentation. They trained the network from scratch in spite of the limitation of data size without applying any data augmentation or any pre-processing techniques to remove artifacts or enhance images. Instead of that, they increased the depth of input to convolutional layers by using the efficient feature maps concatenation from transform domain in addition to using CIELAB colour space with RGB colour channels. Their proposed method improved the segmentation process in two ways: (1) concatenated feature maps to the network provided an excellent realization of the input to the model. (2) applying CIELAB colour space with RGB colour channels provides more information for the network. Likewise, a deep fully Convolutional Deconvolutional Neural Network (CDNN)-based framework is proposed by Yuan et al. [60] to automatically segment dermoscopic images of skin lesions. They did not use pre-or post-processing algorithms and even any hand-crafted features. They focused on designing a suitable network architecture and an effective training method such that it can handle images under various acquisition conditions. Besides using RGB color for the input image, they used HSV color space with L channel from Lab color space as the training input image. Their effective training strategies were able to handle images under different acquisition conditions.
De Angelo et al. [12] presented a methodology using a combination of deep learning, color space, and conditional random fields to segment a dataset created in partnership with a group of dermatologists. For this purpose, they developed an application to collect skin lesion images using smartphones' camera and created a new clinical dataset. They presented an investigation regarding the color spaces and post-processing that had enabled them to raise some important remarks about the ground truth images for skin lesions that affect the final segmentation results. Abbas et al. [14] enhanced the quality of dermoscopic images before segmentation by applying three pre-processing operations. Those operations include, firstly remove image noise by using median filter and morphological operations. Secondly, they selected the green channel as the optimal color channel from RGB values. Thirdly, they utilized a combined Spline and B-spline to enhance the image before segmentation. After image pre-processing, they used the empirical threshold value of the optimal color channel to complete the lesion segmentation. Finally, post-processing (morphological operation) was utilized to fuse the smaller regions and the main lesion region and extract the lesion border. Although their approach achieved a good performance and obtained a high accuracy value, it depends on the quality of the pre-and post-processing results. Table 1 shows a summary of the most related methods listed in this section that concerned the skin lesion segmentation in terms of the advantages and the disadvantages the proposed methods with the detests employed in the testing.

III. PROPOSED METHOD
In this section, we explain the details of the three proposed color U-Net models which is based on the well-known U-Net architecture. Firstly, we explain the image pre-processing technique employed in this work. Then, we explain the structure of the proposed single input color U-Net (SICU-Net). Next, the other two proposed architectures, namely DICU-Net and TICU-Net are explained in details. Finally, the proposed composite loss function is presented.

A. IMAGES PRE-PROCESSING
The performance of the proposed skin lesion segmentation method can be improved by applying a simple pre-processing technique on the images before performing the training/testing process. Since the skin lesion images used for segmentation contain a lot of noise, complexities, and different textures, it was necessary to use a pre-processing to relieve the effects of these obstacles. We used two-steps pre-processing process including, color space transformation and image contrast normalization.

1) Color Space Transformation
The captured skin lesion images are usually available as 3channel images or tri-stimulus colors, red (R), green (G), and blue (B) called RGB color space. RGB color space is the most commonly used color space in the image processing field. Although RGB model is commonly used, it is a devicedependent color space and not uniform. Also, it is difficult to determine a specific color in that model and there are non-linear differences among similar colors. Figure 4 shows existing irregular color contrast representation in the ISIC dataset. The problem of color contrast lesion variations can be alleviated by selecting appropriate color space representation of the input image. Various color spaces are commonly used to solve skin lesion segmentation problems such as: Lab, HSV, YCbCr, XYZ, etc.
While RGB represents colors as a combination of red, green, and blue signals, the YCbCr color space is defined as a transformation from RGB color space. YCbCr represents colors as a combination of luma component of the color signal and two chroma signals (blue-difference and red difference). HSV is a map of the RGB primary colors into easier and understandable dimensions for humans which are: hue, saturation, and value. HSV is a cylindrical color which is more natural than the additive (RGB) color components. XYZ VOLUME 4, 2016 TABLE 1. Summary of the advantages and limitations of the most related methods concerned with skin lesion segmentation.

Reference Advantages Tested Dataset Limitations
Arora et al. [9] Using Group Normalization (GN) led to extract the feature maps efficiently, and using Attention Gates (AG) distinguish high dimensional information from low-level irrelevant background regions

ISIC 2018
This method relied on a specific pre-processing and post-processing techniques to improve the results.
Al-masni et al. [18] Using the full spatial resolutions of the image enabled the proposed FrCNN to learn better features and improving the pixel-wise segmentation performance.

ISBI 2017 & PH2
Expensive computational cost due to the elimination of subsampling layers.
Unver et al. [21] Combined a deep convolutional neural network named as You Only Look Once (YOLO) and the GrabCut algorithm allow segmentation images with higher resolution and dimension independent.

PH2 & ISBI 2017
The method has a less perfect segmentation than the deep learningbased methods. Also, it does not detect the lesion at a low contrast situation, and when the lesion occupies large area.
Yuan et al. [60] Developed a deeper network architecture with smaller kernels to increase the discriminant capability of the network and combining information from multiple color spaces to improve segmentation performance.

ISBI 2017
Proposed method is evaluated using single dataset only.
Lin et al. [52] Using U-Nets achieved a significantly higher Jaccard Index, and using the histogram equalization improved the U-Net segmentation results

ISIC 2017
The algorithm performs poorly when artifacts are present, and incorrectly identifies the black border, the medical gauze, and other dark objects as a lesion.
Reza et al. [61] Proposed an extension of U-Net called Bi-directional ConvLSTM U-Net with Densely connected convolutions (BCDU-Net). They accelerated the convergence speed of the proposed network by employing batch normalization (BN).
The densely connected convolutional blocks highly increase the computational cost of the model.

Garcia et al. [62]
They developed a method based on fuzzy classification of pixels and histogram thresholding is efficient, and obtains very good results in terms of reliability, fast, and has low computational cost.

ISBI 2016 & ISBI 2017
The method can not handle sever imaging condition variations. color space is the conversion of the channels of the RGB image to CIE 1931 XYZ values. Y channel refers to luminance, Z channel is approximately equal to blue channel, and X is a mix of response curves chosen to be non negative and orthogonal to luminance. Lab is a color model that is introduced to obtain near uniform spacing to the differences of perceived color. The distance between the two points shows the difference of the colors in luminance, chroma, and hue. YIQ color space is the model that is used in NTSC color TV system. The Y channel represents the luma information or brightness of the image. While, I is the amount of blue or orange tones in the image. Q (Quadrature) is the amount of green or purple tones in the image. The gray-scale image is the image with pixels values equal to the amount of light for each pixel.
Each color space of the skin lesion image provides a different representation which help to capture the invariant color properties of the lesion. Figure 2 (left part) shows the appearance of the components of each color space. It is clear from the figure that some channels carry discriminative skin lesion information with high contrast between lesion and normal skin such as: B, Y, V, X, Y and L channels in RGB, YCbCr, HSV, XYZ YIQ, and Lab color spaces, respectively. In addition, some channels like Cr, IQ, and ab have invariant characteristics of the lesion despite their low contrast appearance.

2) Image Contrast Normalization and Enhancement
In order to increase the contrast of skin images after converting them into another color space, we applied a simple image contrast enhancement technique to improve the contrast between background and foreground (skin lesion) in each channel. The contrast enhancement method maps the intensity values of each component of the input image to new values. The contrast enhancement technique also, control the relationship between the values in the input and the output images by defining the shape of the curve that describes this relationship to produce more brighter, darker, or linear mapping output values. The intensity of the normalized image have 1% values saturated at low and high intensities. In this paper, we chose the linear mapping method and investigated its performance on the proposed models. Figure 2 (right part) shows the appearance of each color space component after applying contrast normalization. It is clear that, the distinction between foreground and background pixels is improved by applying the simple contrast enhancement technique.

B. SINGLE INPUT COLOR U-NET ARCHITECTURE
The use of convolutional neural networks (CNN) in the image classification field has been prevalent for a long time. CNN is a hierarchical model used to learn multi level features of the image and these feature maps are transformed into a vector

Channels of each color space
Channels of each color space after applying image adjustement FIGURE 2. Different component representation of each color space. On the left without using image adjustment and on the right after using image contrast normalization and enhancement. VOLUME 4, 2016 which is then used for the classification task. However, with the presence of complex objects and the need to perform segmentation tasks which requires not only converting feature maps into a vector but also re-model the image from this vector, the use of traditional convolutional neural networks has become useless and can not achieve any acceptable progress. Therefore, the need for an alternative structure that can perceive these complex images has emerged and this is the beginning of the U-Net architecture. The performance of classical U-Net is not satisfactory when applied to skin lesion segmentation due to the low contrast and color discrepancy of skin lesions. To overcome these problems, we propose three network architectures based on U-Net structure, which exploit different color spaces at each input path. The proposed networks enclose single, dual, and triple inputs and one output path. The inputs can be fed with any color space or any selected channel. Proposed networks are denoted as single input color U-Net (SICU-Net), dual input color U-Net (DICU-Net), and triple input color U-Net (TICU-Net).
U-Net was suggested by Ronneberger et al. [29] and designed to deal with biomedical image segmentation. It was focused on the image segmentation task where its input is an image and the output is a single label for each pixel in this image. U-Net is able to localize and recognize borders as it performs classification on every pixel, so the input and output images have the same size. The idea of U-Net is that the learned feature maps from the input image can be transformed again into another image instead of a vector of classes. Figure 3 shows the architecture of the proposed single input color U-Net (SICU-Net) model. The design of the SICU-Net looks like a 'U' shape, hence it gets its name. The SICU-Net structure consists of three parts: The contracting/down-sampling (encoder path), bottleneck, and the expanding/up-sampling (decoder) path. The encoder path consists of 4 blocks. Each block contains two 3x3 convolution layers followed by an activation function and a 2x2 Max Pooling layer. The number of feature maps is doubled after each pooling. The network begins with 64 feature maps in the first block, then it is doubled to 128 for the second, and so on. The function of the encoder path is to capture the context information of the input image, then producing feature maps. Those feature maps will be transferred to the up-sampling through a channel-wise attention module instead of the simple skip connections (dashed gray arrows in Fig. 3) in the classical U-Net model. The second part is the bottleneck which is located between the contracting and expanding paths. The bottleneck is also built from two 3X3 CNN layers followed by a 2X2 up-convolution layer with dropout. The third part is the decoder path and again, is composed of 4 blocks each one of them consists of a deconvolution layer with stride 2, concatenated with the corresponding cropped feature map from the contracting path, and two 3x3 convolution layers with ReLU activation function. At the end of each block, the number of feature maps decreases by half to keep symmetry. The function of the decoder path enables precise localization using transposed convolutions. However, every time the input is appended by attentioned feature maps from the corresponding contraction layer. In this way, the features which are learned from contracting the input image will be passed through a Channel-wise attention module and used to reconstruct it. The number of expansion blocks is equal to the number of contraction blocks. In the end, the attentioned feature maps passes through another 1 × 1 convolutional layer with the number of feature maps equal to the number of classes desired to produce the segmentation map by assigning a label for each pixel in the image.
In the single input color U-Net (SICU-Net) structure, the color space/channels of the input image can be chosen according to the type of medical input image. The segmentation results can be improved if we choose an appropriate color space to represent the class information. Thus, SICU-Net model allows to manually choose the appropriate color space/channel to improve segmentation results. The contraction path of the SICU-Net network serves as a deep feature extractor which is connected to the expansive path through channel-wise attention modules. Expansive path enables precise localization information to be combined with contextual information coming from the contracting path. SICU-Net model inherits all advantages of traditional U-Net and provides a general (context and localization ) information necessary to predict segmentation map. Input images can be of any size and color space/channels. The possibility of using data augmentation due to the limited number of annotated samples can further improve the results.

C. PROPOSED DUAL INPUT CU-NET (DICU-NET) AND TRIPLE INPUT CU-NET (TICU-NET)
Similar to all convolution neural networks, SICU-Net architecture accepts only one input image which is processed by consecutive convolution and pooling layers. This paper extends the capabilities of the proposed SICU-Net architecture to accept multiple inputs. The proposed network architectures namely, dual input color U-Net (DICU-Net ) and triple input color U-Net (TICU-Net) are adapted from the proposed SICU-Net structure with multiple encoders and single decoder path. The contracting path in each network has four (stages) comprising convolution blocks followed by a bottleneck stage. Each block contains two convolution layers with ReLU activation function followed by a max-pooling layer for downsampling the input by a factor of two. Each encoder subnetwork receives the same input image but with different color spaces. The output from each max-pooling layer in each stage of the encoder sub-network is added to the output of the corresponding stage in the other encoder sub-network. Then, the output of each addition layer of each stage from every encoder sub-networks are passed regularly to the next stage. In contrast, the feature maps resulted from the encoding path in each stage from encoder sub-network are fed to a channel attention network in the expansive path. Then, the output is concatenated with output of each sub-network bottleneck stage passed through decoding blocks that begin with a de-  convolution layer followed by two convolutional layers to decrease the number of feature map by a factor of two. Figure 5 and Figure 6 show in details the architecture of DICU-Net and TICU-Net, respectively. The extracted features in the decoding path are passed to the channel attention network at the end of the decoding path and fed to the final classification layer. The combination of various feature maps from different input color spaces leads to improve the performance of the segmentation results. Also, the entire context of the input images is kept due to the end-to-end training process from the multiple input images to produce its corresponding segmentation map. The structure of the attention network is shown in detail in section III-D.
The proposed DICU-Net and TICU-Net structures are fed with an RGB input image and followed by a distribution layer which convert the input image into another color space and fed it to each encoder path of multi input CU-Net. The structure of the proposed encoder sub-network including the number of layers, size and number of convolutional filters are identical to each other. The feature maps from the last de-convolutional layer of the decoding path are fed to an attention network before entering the final 1 × 1 convolution layer, which produce the semantic segmentation map.
The main characteristics of the proposed SICU-Net, DICU-Net, and TICU-Net models can be summarized as follows: 1) At the beginning, the input image is resized and normalized then converted to the desired color spaces. 2) All the channels that are resulted from this conversion is stacked together one after another to form an input image with multiple channels from different color spaces.
3) The constructed multi-channel input image is sent to the distribution layer. This layer is responsible for dividing the input image channels into single, dual or triple subset(s) of channels according to the type of network. 4) In the single input color U-Net model (SICU-Net), all input channels are fed into the CU-Net. 5) In the dual input color U-Net model (DICU-Net), the first subset of channels is fed into the first encoder subnetwork while the second subset is fed into the second encoder sub-network. 6) In the triple input color U-Net model (TICU-Net), the first subset of channels is fed into the first encoder subnetwork, while the second and third subsets are fed into the second and third encoder sub-networks, respectively. 7) The output feature maps from SICU-Net or from each encoder sub-networks of DICU-Net and TICU-Net models are fed to a simple channel-attention module until they reach the final classification layer.   equal 1 × 1 to accomplish the semantic segmentation task. 9) The training process of the proposed models utilize backpropagation learning algorithm with gradient decent optimization and a composite loss function.

D. CHANNEL-WISE ATTENTION MODULE
The proposed multi-input deep convolutional neural network generates multiple feature maps from each input. It is beneficial to fuse the extracted features by focusing the attention on significant feature maps to emphasize informative features and suppress redundant features. A channel-wise attention module is employed to draw the convolutional neural network attention towards important color features. Channel-wise attention modules are usually based on learning the interrelation between feature channels. The input to the channel-wise attention module is a set of feature maps F M c ∈ h×w×c where h, w, and c are the number of rows, columns, and channels, respectively. The feature maps fed into the channel-wise attention module are generated by applying a set of convolutional and pooling operations on the contraction path of the input image with different color spaces. The channel attention module used in this work is inspired from [63] by applying global max pool and global average pool operations to get F max and F avg , respectively. Then, the pooled features are concatenated and forwarded by two fully connected layers that reshape the output features into the required number of channels to be passed to the Relu and Sigmoid activation functions. The resulted channel attention map M c ∈ 1×1×c . where c represents the number of channels is multiplied with the input feature maps to get the required channel weighted map. Figure 7 shows the structure M c = σ(Relu(F c 64 (Relu(F c 32 (concat(F max , F avg )))))) (1) Where σ denotes sigmoid activation function. The final channel attention map CM c is obtained from: Where ⊗ is the element-wise multiplication and CM c ∈ h×w×c .

E. PROPOSED COMPOSITE LOSS FUNCTION
Plenty of loss functions are used to improve the performance of deep convolution neural networks used to solve semantic segmentation problem. The aim of the loss function is to evaluate the performance of the network and minimize the error resulted from the training. In this section, we will explain our proposed composite loss function, but before that, we will mention some parameters that are usually employed to formulate the metrics used to evaluate the network, namely, True Positive(TP), True Negative (TN), False Positive (FP), and False Negative (FN). Pixels in binary skin lesion image segmentation usually belong to one of the two-class objects: foreground (lesion skin) and background (normal skin). The group of pixels that belong to the foreground and are predicted correctly by the model is known as (TP). While the group of pixels that belong to the foreground but the model predicted them incorrectly are known as (FN). And finally, the pixels that do not belong to the foreground but the model predicted them incorrectly are (FP). Now we can explain our novel proposed composite loss function called Binary Weighted Cross-entropy, Dice, Sensitivity-Specificity (BW-CE-Di-SS) loss function. It consists of three terms combined with binary weights where its components are defined as the following: • Cross Entropy Loss: cross-entropy is the most common effective metric in the binary segmentation tasks. It is derived from Kullback-Leibler (KL) divergence to measure the model performance whose output is a probability value between 0 and 1. Its value is increased when the predicted probability (Y) close to the actual label (T). It is defined as: Where T ic is the ground-truth label of the pixel i that belongs to class c and (Y ic ) is the predicted probability value for pixel i that belongs to the class c. N and C represent the total number of pixels and classes, respectively. Ronneberger et al. [29] introduced a Weighted Cross Entropy (WCE) formula to handle the class imbalance problem in binary segmentation tasks as the following equation: Where w is inversely proportional weight to the class frequencies and it is used to penalize majority classes. In our proposed loss, we assigned 0.7 for the weight w. However, due to high imbalance data in skin lesion segmentation, WCE can not overcome this problem on its own. Hence it is required to combine it with other complementary loss functions. • Dice Loss: Dice loss is used as an alternative to crossentropy to train 3D U-Net [64] and other network architectures. It is derived from Sørensen-Dice coefficient, which is a statistical metric used to evaluate the similarity (overlap) between two samples. This metric ranges from 0 to 1 whereas assigning 1 to the Dice coefficient denotes perfect and complete overlap between the evaluated samples. Milletari et al. [64] utilizes Dice coefficient in computer vision and define it as the area of overlap divided by the mean of the total number of pixels in both images.
Where |T ∩ Y | is the common pixels between T and Y . While |T | + |Y | are all pixels in T and all pixels in Y . In image segmentation problems, |T ∩ Y | is evaluated as element-wise multiplication between the predicted (Y) and the ground truth mask (T) and then sum the elements in the resulted matrix. On the other hand, to evaluate the denominator, some researchers use the regular sum whereas the other researchers prefer to use the squared sum. We tried both and found that the squared sum gave better results. By applying Dice definition (Eq. 5) to the Boolean data and employ the definition of {TP , FP , FN }, we can reformulate Eq. 5 into: Dice loss weighs (FPs) and (FNs) equally to achieve a better trade-off between precision and recall. The value (2) in the numerator is due to the double count of the common pixels from the union calculation in the denominator, one from T and the other from Y . The dice loss function will be: In which Y ic and T ic represent pairs of corresponding i pixel predicted and ground truth values at a specific class c, respectively. Dice loss can deal efficiently with situations where there is a great imbalance between the number of skin lesion and normal pixels and is one of the loss functions that achieved very good results in the semantic segmentation field. However, both WCE and DL can not directly improve the sensitivity and specificity performance metrics. • Sensitivity-Specificity Loss: Due to the imbalance class representation problem in semantic segmentation, a loss function combining sensitivity and specificity is proposed by Brosch et al. [65]. Sensitivity and Specificity are used together to evaluate the classification performance for great unbalanced problems. Sensitivity refers to the ability to detect lesion pixels with the skin disease correctly.
Therefore, if a test skin image has 100% sensitivity, this means that the model correctly identifies all skin lesions from normal skins. On the other hand, the specificity refers to its ability to detect normal skin pixels correctly.
Specif icity = T N T N + F P Therefore, a test skin image with 100% specificity means that all normal skin pixels are correctly identified. Sensitivity (true positive rate) measures the ratio of actual skin lesion pixels that are correctly classified. While Specificity (true negative rate) measures the ratio of actual normal skin pixels that are correctly classified. By combining them, the final error will measure a weighted sum of the mean squared difference of the lesion pixels (sensitivity) and normal pixels (specificity). The final error was formulated in the following form The first term in Eq. (12) is the Sensitivity and the second one refers to specificity.
Due to the class imbalance in the skin lesion images, a binary weighted composite loss function is proposed in this paper in the following way.
L BW −CE−D−SS = 1 * L W CE + 2 * L Dice + 4 * L SS (13) L W CE , L Dice , and L SS are defined in Eqs. 4, 9, and 12 respectively. Each term in the proposed loss function has contributed to balance the results of accuracy, true positive and true negative rates. Whereas Dice gave additional weight to false positives and false negatives, which boosted the performance of segmentation results. The specificity term further enhances the true negative rate. Our proposed loss function works in harmony with the proposed multi-input CU-Net to segment color skin lesion images in different color spaces.

IV. EXPERIMENTAL RESULTS
In this section, the segmentation performance of the proposed CU-Net architectures are evaluated using three data sets, namely, ISIC 2017, ISIC 2018 and PH2. International Skin Imaging Collaboration ISIC dataset 2017 1 contains three subsets, the first training subset has 2000 lesion images in JPG format divided into 3 subcategories, 1372 images for benign images, 374 for Melanoma images, and 254 for Seborrheic_Keratosis (SK) and their 2000 corresponding binary mask images in PNG format. The second testing subset includes 600 lesion images for testing in jpg format, 393 benign images, 117 Melanoma images, and 90 Seborrheic_Keratosis (SK) and their 600 corresponding binary mask images in PNG format. The third validation subset includes 150 lesion images for testing in JPG format, 78 benign images, 30 Melanoma images, and 42 Seborrheic_Keratosis (SK) with their 150 corresponding binary mask images. To evaluate our proposed models on this dataset, all images are scaled to (160×224) Figure 8 shows samples from ISIC 2017 dataset. In addition, the class distribution of each skin lesion type is shown in Table 2  The ISIC 2018 2 dataset was published by the International Skin Imaging Collaboration (ISIC) as a large-scale dataset of dermoscopy images. The training dataset of ISIC 2018 contains 2594 RGB dermoscopic images that have a resolution ranged from (576×768) to (6748×4499) in jpg format. We divided the training images into 80% (2076) images for training and 20% (518) images for testing with size (256×256).

Original Images
Ground truth images PH2 dataset was created at the Hospital Pedro Hispano with the help of the research group Universidade do Porto, Técnico Lisboa in Matosinhos, Portugal. The PH2 dataset consists of 200 skin lesions in total, 80 of them including Common Nevus cases, 80 for Atypical Nevus, and 40 for Melanoma cases. Images in the PH2 dataset were captured under the same conditions, and the size of all images in this dataset was 768 × 560 pixels. Segmentation masks of this dataset were drawn by expert dermatologists. In this study, the PH2 dataset is used only for testing due to its small size for deep CNN training, thus we utilized ISIC 2017 dataset in the model training and apply PH2 for testing. All images are re-scaled to (160×224) . The class distribution of lesion types for PH2 dataset is shown in Table 3. Figure 10 shows some image samples from each class of PH2 dataset.
In order to make a fair comparison with other state-ofthe-art deep learning models, we used the standard testing protocol (train/valid/test) that commonly utilized for ISIC 2017 datasets. The validation dataset of the ISIC 2017 database is utilized to adjust the hyperparameters of the proposed models while we used the test dataset to calculate the evaluation metrics of the proposed models.
The experiments conducted in this paper are implemented  using Matlab 2020a running on Windows 10 using PC with Intel core i7 processor, 32 GB RAM, and (NVIDIA GeForce RTX 2080 Ti). Regarding the overfitting issue, we employed two techniques to avoid overfitting (1) using L2 regularization in the loss function, and (2) using dropout layers. As for training stopping we tried different numbers of training epochs and chose the one that give the best results. The number of epochs was 30, SGDM optimizer was used to train all proposed models with the following training hyperparameters: the learning rate was 0.05, and the mini-batch size was set as 4. The following experiments are conducted to investigate different aspects of our proposed method. The evaluations of the proposed models are presented using common performance metrics such as: true positive rate/sensitivity (SEN) defined in Eq.10, and true negative rate/specificity (SPE) defined in Eq. 11, Dice coefficient (DIC) defined in Eq. 6. While the Jaccard index (JAC) and Accuracy (ACC) metrics are defined as: Where, TP, TN, FP, and FN have been clarified in subsection III-E

A. EFFECT OF CHANGING COLOR SPACE ON PROPOSED CU-NET MODELS
The first experiment evaluates the contribution of each color space without using any attention modules or image enhancement. All three models are trained and tested using ISIC 2017 VOLUME 4, 2016 Table 5 for PH2 dataset show that using LUV color space for the input image achieved the best value for ACC, DIC and JAC coefficients. While the gray image accomplished the best value for TPR, and Lab achieved best TNR. For DICU-Net which accepts two different color spaces, Tables 6 and 7 summarize the results of changing the input color spaces using ISIC 2017 and PH2 datasets, respectively.In this experiment, we tried many color space combinations to find the best ones. For ISIC 2017, Table 6 shows that the XYZ color space achieves superiority over other color spaces when it is combined with other color spaces. While for PH2 dataset, Table 7 shows that the combination of gray and YIQ color spaces give better results than using each of them separately.
For the proposed TICU-Net, a combination of three different color spaces is fed to the network. Table 8 shows the effect of changing the three color space combination on the evaluation metric values. For ISIC 2017 data set, using 'RGB-YCbCr-LUV' color spaces achieved the best ACC and TNR values, while using the 'Gray-YCbCr-XYZ' combination achieved the best ACC, DIC, and JAC values. The best TPR is obtained using 'XYZ-HSV-Gray' color spaces combination. For PH2 dataset, Table 9 shows that the 'Gray-YIQ-XYZ' color spaces combination accomplished the best ACC, SPE, DIC, and JAC values, while using 'RGB-Lab-Gray' combination achieved the best value for TPR, also good results for DICE, and JAC is achieved using 'RGB-XYZ-LUV'.

B. EFFECT OF USING CHANNEL ATTENTION MODULES ON MULTI-INPUT CU-NET
This experiment is conducted by employing same color spaces that are used in previous experiments but by adding all channel attention modules in the proposed CU-Net models. Table 10 shows the comparison between the SICU-Net segmentation results with and without using channel-wise attention modules for ISIC 2017 dataset. In general, adding channel attention modules improves the value of ACC, DIC,  and JAC coefficients in the vast majority of color spaces except some exclusions of Gray and YCbCr color spaces. YIQ color space accomplished the best values for the evaluation metrics. Moreover, for PH2 dataset, Table 11 shows the SICU-Net segmentation evaluation metrics. The performance of the SICU-Net is improved by adding the attention modules in all color spaces. YCbCr color space achieves the best values of ACC, DICE, and JAC evaluation metrics. Tables 12 and 13 show  color spaces as shown in Table 13. Using the attention module with 'LUV-YIQ' color spaces achieves the best results for ACC, DIC, and JAC evaluation metrics. Table 14 depicts the results of using proposed attention modules with the ISIC 2017 dataset to train and test the TICU-Net with the outlined color spaces. Using a combination of 'RGB-YCbCr-YIQ', all evaluation metrics are increased, and using 'RGB-YIQ-Gray' color spaces achieved the best values for ACC, TNR, DIC, and JAC. Using 'Gray-YCbCr-YIQ', color spaces yield the best TPR, and 'XYZ-HSV-Gray' yield the best TNR. Table  15 views the results of using proposed attention modules with the PH2 dataset for testing and ISIC 2017 for training. Using a combination of 'RGB-XYZ-Lab', 'LUV-YCbCr-RGB', and 'LUV-YCbCr-YIQ' increases the values of all evaluation metrics, and using 'Lab-LUV-Gray' color spaces achieved the best values for ACC,TNR, DIC, and JAC, and Using 'RGB-XYZ-Lab' color spaces yield the best TPR.

C. EFFECT OF NORMALIZING INPUT IMAGES ON THE PROPOSED CU-NET MODELS
This experiment is conducted using the same color spaces that are used in subsection IV-B on the proposed CU-Net models except by applying image normalization to the input image. The image normalization technique utilized in this experiment maps the intensity values of each channel of the input image to new values to increase the contrast of the normalized image. The intensity of the normalized image have 1% values saturated at low and high intensities. Using ISIC 2017 dataset, Table 16 shows the comparison between the SICU-Net segmentation results with and without image  normalization. The table shows that all evaluation metrics are improved and the best ACC, TPR, DIC and JAC are yielded using XYZ color space, while the best TNR is yielded by using HSV color space. Table 18 indicates the comparison between DICU-Net segmentation results with and without image normalization. The results reveal that all evaluation metrics are increased and the best TNR is achieved with 'RGB-XYZ' color space, and the best ACC, TPR, DIC and JAC values are achieved with 'Gray-HSV' color spaces. In Table 20, using image normalization with 'RGB-YCbCr-YIQ', 'Gray-YCbCr-YIQ', and 'XYZ-HSV-Gray' color spaces improved the values of ACC, DIC, and JAC, and the best ACC is achieved by 'RGB-YCbCr-YIQ', and the best TPR is yielded by using 'RGB-YIQ-Gray', while using 'Gray-YCbCr-YIQ' achieved the best TNR. and using 'XYZ-HSV-Gray' achieved the best DIC and JAC coefficient. For PH2 dataset, Table 17 shows the comparison between the SICU-Net segmentation results with and without image normalization. The results show that all evaluation metrics are improved and the best ACC,DIC and JAC are achieved with YCbCr color space, the best TNR is yielded from YIQ color space, and the best TPR is yielded from RGB color space. Table 19 displays the comparison between the DICU-Net segmentation results with and without image normalization. The table shows that the best evaluation metrics are achieved with 'LUV-YCbCr' color spaces. Table 21 shows the effect of using image normalization on the segmentation results of TICU-Net. Using image normalization improves the results of TPR for most of all the outlined color space combinations. The best TNR,DIC and JAC are achieved with LUV-YCbCr-YIQ color spaces, the best TPR is yielded from RGB-XYZ-Gray combination color spaces, and the best ACC is achieved from Lab-LUV-Gray color spaces combination. Figure 11 illustrates examples of the resulted predicted images obtained from SICU-Net (Third row), DICU-Net (fourth row), and TICU-Net (fifth row) models for both ISIC 2017 and PH2 datasets. The images in the figure are chosen randomly from each class of the two dataset

D. STUDDING SEGMENTATION PERFORMANCE ACCORDING TO EACH LOSS FUNCTION COMPONENT
This experiment investigates the effect of the proposed composite loss function which contains Dice loss, sensitivity and specificity loss, and Cross Entropy loss function. Table 22 displays the effect of applying each loss function separately on the 'XYZ-HSV-Gray' color space combination for TICU-Net model. The results reveal that the proposed composite loss function outperforms each individual loss function which leads to improve the accuracy by 0.62%, Dice by 1.95% and Jaccard coefficient by 2.94 %.

E. COMPARISON WITH STATE-OF-THE-ART METHODS
This experiment introduces a comparison between our proposed models and other state-of-the-art methods that utilize the same dataset. The compared methods were obtained from [18], [23], [58], [66], [67]. Tables 23 and 24 present the performance of our proposed models with other recently developed fully convolutional networks such as: FCN [68], U-Net [29], SegNet [69], FrCN  The results of some of the state-of-art methods such [53], [58], [59] utilize different partitioning for the benchmark databases provided by the creator as shown in the second column table 27 and 28. The differences in data partitioning make somehow unfair comparison between these methods and our proposed method. Although our proposed method could not achieve the best performance compared with other recent works, the idea of using multi-input paths with different color spaces can be added to any of the existing models to improve the robustness of these models due to color contrast variations.

V. DISCUSSION
The proposed color U-Net models (CU-Net) using single, dual, and triple color inputs are designed to overcome the problem of color variations in skin lesion images. Most of the existing deep network architectures utilize only a single input path, while this work explores the effectiveness of combining multiple inputs with different color spaces. The proposed CU-Net variants are fed with different color spaces of the input image to exploit distinct features that appear in some color channels in specific color spaces. The proposed models help to significantly improve the performance of U-Net semantic segmentation deep model. The interconnections between encoder and decoder paths using attention network enrich the features and hence improve the classification results. Since each channel of the extracted feature maps is working as a feature detector, we take the advantage of the interrelationship of the channel features to focus on the meaning of an input image using the channel attention module. However, one drawback of the proposed method is that the complexity of the network increases as we add more inputs and their corresponding encoder paths. Although our proposed method could not achieve the best performance compared with other recent works, the idea of using multi-input paths with different color spaces can be added to any of the state-of-the-art models to improve the robustness of these models against color contrast variations.
The experiments conducted using three different benchmark datasets reveal some interested remarks, which can be summarized as follows: • No specific color space is the best among others. However, the combination of different color spaces improves the results of various performance metrics. • Using channel attention modules in the proposed CU-Net models in the interconnection between the encoder and decoder paths improves the results of some evaluation metrics. • Using image normalization improves the results of TPR metric for most of all color-space combinations which is considered the most important metric between all metrics. • The proposed composite loss function outperforms each individual loss function. • The proposed TICU-Net achieves the best values for Dice and jaccard coefficients compared with SICU-Net and DICU-Net.

VI. CONCLUSION AND FUTURE WORK
Color contrast variations of dermoscopy images posed a significant obstacle for an accurate diagnosis of the infected regions from the healthy ones. To address this problem, this paper presented three new convolution neural network models for skin lesion segmentation. Unlike other existing deep learning models, the proposed CU-Net models solve the low contrast and color discrepancy skin lesion segmentation problems. The three proposed models including single, dual, and triple input color U-Net (SICU-Net, DICU-Net, TICU-Net) are fed by multiple input images with different color spaces. The combination of multiple color spaces not only improves the segmentation results but also increases the robustness of the model to skin lesion color variations. The optimal selection of color space is significantly affect the performance of the proposed color U-Net models. Deep convolutional network with multiple input paths enriches the extracted features and improve the performance of its counter single input network. Moreover, channel-wise attention modules are used to focus on the interested features extracted from each input path. We also utilised the preprocessing by using (1) color space transformation to choose the optimum color space, and (2) applying a simple image normalization to enhance image contrast. The performance