Accurate Pixel-Wise Skin Segmentation Using Shallow Fully Convolutional Neural Network

Skin segmentation plays an important role in human activity recognition, video surveillance, hand gesture identification, face detection, human tracking and robotic surgery. The accurate segmentation of the skin is necessary to recognize the human activity. Segmentation of skin is easy to realize in ideal situations because of similar backgrounds. But it becomes complicated because of presence of skin-like pixels, background illuminations, and certain changes in environment. These problems are addressed by incorporating preprocessing stages in current studies, but this raises the total cost of the system. However, there are some limitations associated with these methods in terms of accuracy and processing speed. In this work, we propose a skin semantic segmentation network (SSS-Net) that is able to capture the multi-scale contextual information and refines the segmentation results especially along object boundaries. Moreover our network helps to reduce the cost of the preprocessing as well. We have performed experiments on the five open datasets of human activity recognition for the segmentation of skin. Experimental results show SSS-Net outperforms the state-of-the-art methods in skin segmentation in terms of accuracies.


I. INTRODUCTION
Skin segmentation aims to detect the region of a human skin in an image. It is one of the important tasks which works as a step for pre-processing in various systems and applications, such as hand gesture analysis, face recognition, face tracking and detection, content based image retrieval, etc. [1]. Skin detection is a process of identifying the pixels of a given image that correspond to human skin. Skin detection is also very helpful to humans while performing complex tasks through human computer interaction. As in case of hand gesture recognition, it provides help in recognizing certain actions [2]. In recent years, with the advancements in deep neural networks, the networks used for other detection tasks have been adapted as skin detection methods as well [3], [4].
Recognition of human activity is an important area and has received a great deal of attention due to the growing The associate editor coordinating the review of this manuscript and approving it for publication was Md. Asikuzzaman . demands of many applications. These include, but are not limited to identification of individual activity, interaction between multiple persons and analysis of crowd behavior. Recognition of human posture in single person activity helps detect the nature of the activity. Nonetheless, these task are inherently challenging since human poses vary enormously. These problems get compounded when the activity of multiple subjects is involved. This is an area of active interest, for instance crowd monitoring to detect antisocial behaviour is being tested and deployed extensively [5]. Activity monitoring is also being used in sports, for instance recognition of the actions of players during a tennis game [6]. Human activity recognition depends quite critically on accurate skin segmentation [7]. This challenge is compounded by the structural variations within a single human's limbs and body parts, making consistent skin segmentation difficult. The problem becomes significantly complex with multiple subjects in a frame. Therefore traditional machine learning algorithms fail to detect multiple features at one time where skin segmentation is being used. The Skin Semantic Segmentation Network (SSS-Net) presented in this research deals with the limitations of skin segmentation innovatively by capturing multi-scale contextual information and refining the segmentation results especially along object boundaries.
In this paper, SSS-Net is used for skin segmentation tasks for the semantic labeling of pixels in a pixel-wise classification framework. The contributions of this work are: 1) The task of skin segmentation is modeled as a semantic pixel-wise segmentation problem. For this reason, a SSS-Net with reduced tunable hyper parameters is considered. We believe this work will help bridge the gap between skin segmentation and semantic segmentation; 2) Low-level semantic information is preserved and the preservation of edge information results in robust detection of skin information; 3) The proposed method is robust to skin detection; 4) A much smaller (in terms of tunable parameters) deep neural network is proposed for skin segmentation that does not require additional pre-processing steps; 5) Low computational time overhead during inference in both train and test stages. This paper is organized as follows; In section 2, relevant literature is discussed. The methodology is reviewed in section 3. Section 4 presents the experimental results. This is followed by a discussion of the proposed network in section 5. Section 6 concludes the findings of this research.

II. RELATED WORK
Skin detection is being used extensively for a variety of applications in image processing and visual computing. Many studies based on skin detection use a variety of different methods. These methods can be divided into different categories, i.e. thresholding, traditional handcrafted features and deep neural network. To separate the skin and no-skin areas, different image channels are used by these procedures. In [8] skin and non-skin areas are detected by using two detectors that are based on color channels and thresholding. Thresholding and these channels are dynamically selected and are based on agreement maximization framework. Thresholding concentrate on selecting a certain region in color spaces, thus if a pixel belongs to that region it will be treated as skin. But there are several challenges involved in detecting skin and non-skin pixels. This is primarily due to the similarity of background objects with the color of the skin due to various reasons, making skin detection a very challenging and difficult task. Reference [1] proposed a method where an eye detector has been shown to improve the accuracy of skin detection regardless of variations in illumination and ethnicity. In [9], a method is introduced for handling skin like pixels in the background. Proposed method significantly helps in reducing the error in the detection thus reducing the false detection of skin color. Interested readers are referred to [24] and [25] for more information for the selection and weighting. For the detection of skin color, multi-color spaces have been introduced for the skin color model. For instance, [11] performed dynamic skin detection using multi-color space instead of using the single color space. The proposed method improves the precision rate as well as reducing the error in skin color detection. As skin detection is an important step at the time of pre-processing of images, [12] proposed a method that used a clustering technique which makes clusters of similar pixels in the image. The proposed method is able to produce good results with effective skin detection of human images irrespective of the ethnicity. Moreover the proposed method performs well with the illumination changes as well. Reference [26] proposed a network for improving the segmentation results specially in terms of accuracy for the large scale objects. The network uses several scales that enable it to achieve the detailed information with increased sensitivity. Reference [27] proposed an algorithm for the object detection that is effective in detecting the small areas as well as the occluded ones using different scales.
Skin detection plays a very important role in various medical application of visual computing systems such as those used for the detection of certain diseases related to skin. Reference [13] proposed an approach for the detection of skin regions in human images using the specific color space. The proposed approach provides promising results related to detection and shows good detection rate. Reference [14] proposed a method for the classification of human skin pixels under the varying illumination conditions and shows good results. Reference [16] presents the comparative study of the two color spaces for the detection of human skin color and selected the specific threshold for detecting skin color to evaluate these color spaces. The overlapping of skin and non-skin pixels is one of the constraints in detection of human skin. To improve the accuracy in the skin detection process, [18] proposed a method based on color space that includes the texture features of human skin. In the field of biometric security, palm-prints are being used extensively over other methods that depend on accurate skin detection. Reference [28] presented a method for the segmentation of palm print to achieve accurate and improved detection compared to existing methods. Reference [29] solved a problem related to technical issue involved in the non-contact palmprint system by developing a system on personal computer. Reference [30] also worked for developing a system for the pre-processing of palm-print in the contact-less scenario.
In order to handle the problem due to changes in illumination causing similarity of background color to the skin color, [19] proposed a method that used combination of two techniques which improves skin detection performance. To improve the skin detection a method is proposed by [31], which uses a neural network for the detection of skin and body. However current methods that are based on machine learning or traditional neural networks have some limitation regarding performance under certain illumination conditions. To overcome this problem deep learning based methods have been introduced. Using skin segmentation, the tasks like hand detection is performed which may be used in interpreting the sign language. Reference [20] developed a technique for detecting the hand in the human images with the cluttered environment and they performed this task by using deep learning approaches. Reference [21] introduced the deep learning network with reduced number of parameters for the skin detection, and has produced good results as compared to state of the art networks. References [7], [22], [23] have introduced certain schemes based on deep learning to handle such problems that leads to better skin detection and segmentation while reducing the error rates. In this work, we propose skin semantic segmentation network (SSS-Net) for skin segmentation that eliminates the pre-processing steps and uses a reduced number of parameters compared to existing solutions. SSS-Net is able to capture the multi-scale contextual information and provides results with sharper object boundaries.

III. NETWORK ARCHITECTURE A. DESIGNING AND LEARNING
The challenges of skin segmentation are cluttered background, objects at multiple scales and small and deformable objects. In the work presented, we have treated skin segmentation as a semantic segmentation problem due to shared challenges in both. Therefore, for skin segmentation, we adapted the well-known DeepLabv3+ architecture which is state-ofthe-art in semantic segmentation. The inherent nature of the DeepLabv3+ architecture is tailored to scenes with cluttered background. The image is first subjected to residual learning to tackle few challenges of skin segmentation including color similarity of foreground and background, skin reflectance variations due to illumination conditions. We start by shredding off residual blocks by keeping only four residual blocks in the proposed SSS-Net. The underlying reasons for reduced residual blocks are two-fold: First, to preserve the details of small that are otherwise lost in the repetitive convolution process. Reducing the number of layers to preserve object and semantic information for semantic segmentation is supported by previous works [37]. Second, to reduce the number of parameters and the computational load. In order to preserve feature information, downscaling is not introduced in the residual learning process. Contextual information is of utmost importance in skin segmentation due to the deformable nature and small extent of skin regions. Therefore, we employ atrous spatial pyramid pooling (ASPP) to capture image context at multiple scales. Due to its intuitive local feature processing and subsequent fusion, ASPP has proven to be robust in detecting objects at multiple scales as well as efficient in mitigating background clutter in object recognition scenarios [38]. Spatial pyramids have successfully been employed for dense prediction tasks [39], [40] owing to the multi-scale contextual information contained in them. We note that the ASPP if not carefully designed, can miss small skin regions. Therefore, we experimented with different dilation rates such that our filters simultaneously cater small and large skin regions alike. We found in our experiments that limiting the number of residual blocks to four preserves vital semantic information of regions which then passed though the specifically designed ASPP module do not result in loss of small objects. Table 7 shows the valuable differences of our proposed model from Deeplabv3+ architecture.

1) PROPOSED ENCODER
In Table 4 encoder details of SSS-Net are presented. SSS-Net encoder consists of a total of 4 residual blocks, each block consists of convolution layers in sequence followed by separate batch normalization and ReLu activation layers. Every residual block comprises of two 3 × 3 convolutions and to reduce the size of the image each of the block is interpolated with max-pooling operation. A shortcut connection is provided to each residual block, which combines the input with result of residual block before applying ReLU in second convolution of the block. This connection enables the previous layers to get the powerful gradient signal which makes training easy for the deeper networks. Figure 10 shows a residual block of SSS-Net.
Instead of simple convolution, the final residual block uses atrous convolution [41]- [45] that enables the expanded filter's view. We used different dilation rates i.e. rate=2, rate=4 in the last two blocks. Atrous convolution track the resolution where we measure feature responses. In addition, atrous convolution provides a broader context without increasing computational expense or the number of parameters.
As down-sampling is not implemented at the atrous block, atrous spatial pyramid pooling (ASPP) [46], [47] is performed on the size same as feature response. In SSS-Net, ASPP captures multi scale contextual information and applies various dilation rates to a sequence of atrous convolutions. These rates are designed to capture the longer context. In addition, ASPP integrates image-level features to add global context information. As shown in Figure 11, there are 4 parallel operations in ASPP consisting of one 1 × 1 convolution and three 3 × 3 convolution performed with dilation rates 4,12 and 16. The stride we used for the feature maps is 16.

2) DECODER NETWORK
Decoder of SSS-Net used transposed convolution layer to up samples the features coming from the encoder part resulting in high resolution image from a low resolution image. This is followed by concatenation with the resulting low-level network features of the same resolution. On these low level features 1 × 1 convolution with 256 filters is applied in order to lessen the number of channels, as the resultant low-level features usually have a large number of channels and make the training of network harder. A factor of 4 is applied after concatenation to refine the features following another simple bilinear upsampling. In Table 5 decoder details of SSS-Net are presented. The diagram of a SSS-Net is shown in Figure 1.

A. EXPERIMENTAL DATA AND ENVIRONMENT
SSS-Net was tested for skin segmentation using five datasets of human activity recognition that are publicly available [8].
Following are the datasets that are used for the task of skin segmentation in this paper; 1) Augmented multi-party interaction (AMI) 2) In-house (SSG) 3) Event detection EDds) 4) UT-interaction (UT) 5) Laboratoire d'informatique en image et systèmes d'information (LIRIS) These five datasets contain only a few training images. Therefore, we used data augmentation to artificially increase the amount of training data. In Table 2, a detailed description of these datasets is provided. Figures 2 to 6 show examples of segmentation results that are predicted right by our network for all the datasets. Here, the red color represents segmented skin area. As shown in these figures, our model is able to detect the skin area correctly in images in the datasets that VOLUME 8, 2020 includes indoor and outdoor scenes.Also, there are some relatively poor examples of segmentation by our network that are presented in Figure 7. Cases where our network does not give good results are the skin areas near hair and beard. whereas, Skin-like background pixels are the main reason for causing false positive error, while some unfamiliar skin pixels leads to false negative errors. We also performed the performance comparison of SSS-Net with Deeplabv3+ on EdDs dataset which is presented in Table 6, and the visual results are presented in Figure 8. Table 3 shows the segmentation results by our method with other methods on the five datasets. The network was trained on the computer with Intel(R) Xeon(R) W-2133 CPU 3.60GHz, 32 GB RAM, and Nvidia 2080TI GPU, we did training and testing of the proposed network on a desktop computer. We also considered the EdDs dataset for the performance comparison of SSS-Net with Deeplabv3+. The table for the performance comparison of Deeplabv3+ and SSS-Net are shown in Table 6 TABLE 2. Details of human activity dataset for the evaluation of SSS-Net.

B. DATA AUGMENTATION
In this paper, skin semantic segmentation network SSS-Net is proposed. An augmentation scheme is used for the training data that included: 1) Rotation 2) Contrast enhancement In deep neural networks, training depends on the size of the input data. In order to carry out effective training, a large amount of data is needed. When the size of the training  data is small, the parameters are uncertain and the training of the network is insufficient, which seriously affects the performance of the network. One way to solve this problem is to perform data augmentation that increases the data size, alleviating this limitation. In this paper, we kept the image size same as in the dataset. We used image rotation VOLUME 8, 2020  on these images to generate a composite image using the original training image. Each image rotates 1 degree from 0 to 360. In this way, we obtained 360 rotated images for each image, and a total of 7,200 images were obtained after rotation. To eliminate artifacts in rotated images, we first converted binary images into logical images, then used bi-cubic interpolation when we rotated these images. After rotation, we used the contrast enhancement feature and generated over 1800 images with different contrasts. Therefore, data is synthesized by expanding from 20 images  to 9000 images. In Table 8 data augmentation details are presented.

C. NETWORK TRAINING
For training SSS-Net, we provided the images to network without any pre-processing. Considering optimizers, a very popular technique used with stochastic gradient descent (SGD) is called Momentum. Momentum not only uses the gradient of the current step to lead the search, but also mounts up the gradient of the past step to determine the direction of progress. Whereas Adam is an adaptive learning rate method that calculates individual learning rates for various parameters. SGD with momemtum appears to find a flatter minima than Adam. However, the adaptive method tends to converge to a sharper minima relatively faster. Flatter minima are better generalize than sharper minima. Although adaptive VOLUME 8, 2020 optimizers have better training performance, but this does not mean higher accuracy for different data. Therefore, SGD with momentum is the most popular deep network optimizer [48]. In this study, we used the Stochastic Gradient Descent with Momentum (SGDM) 0.9 with an initial learning rate of 1e-3. We used L2 regularization with a weight decay of 0.0005 for training SSS-Net skin semantic segmentation network. Our network has been trained for 40 epoch, with a minimum batch size of 5 images, and as the convergence rate of our network accelerates, it shuffles after each epoch. The learning-rate decay and mini-batch size are empirically calculated to satisfy minimum loss of cross-validation and the weights and biases of ResNet18 are employed in the initialization stage of the proposed method. As we have also compared the performance of SSS-Net with Deeplab V3+ on EdDs dataset, training time on EdDs dataset was 308 minutes for 40 epochs while testing time on CPU was 1.5 second and 300 ms on GPU. The details for the training stage are presented in Table 9.
Cross-entropy loss is measured as the network training objective function. This objective function is driven by probabilities, where p stands for probability. When the obtained estimate for a certain class deviates from the actual desired class, p (the probability parameter) approaches to 1, whereas the loss is stated as the combined loss of all the pixels. Inherently, the ''non-skin'' pixels in each human activity image outweigh the ''skin'' pixels for the task of skin segmentation.  This vast amount of variance in the number of pixels among dissimilar classes may possibly lead to multiple critical problems when using the cross-entropy loss as an objective function for network training. But this problem can be overcome by class balancing as the weights are formulated to associate with each class in the loss function. Consequently, the classes with high frequency have low weights and classes with low frequency have high weights. Numerous different approaches VOLUME 8, 2020   to assigning these weights can be followed. In the considered approach, the classes association weights were calculated by using frequency balancing for the training of SSS-Net architectures. In this respect the corresponding weights of the classes are determined by dividing the median of the particular class frequency over the class frequency for the complete training set.

D. NETWORK TESTING 1) EVALUATION METRICES
For testing SSS-Net, we considered several assessment metrics such as recall (R), precision (P) and F-measure (F). The formulas for these protocols are given below: where TP represents true positive, FP represents false positive,and FN represents false negative. Here, FN are the pixels in the ground truth images, which are skin pixels but predicted from the network as non-skin pixels. In ground truth images, FP is the wrongly predicted non-skin pixel and TP are the correctly predicted skin pixels.

V. DISCUSSION
In this paper, we proposed a skin semantic segmentation network (SSS-Net) for the pixel wise skin segmentation of the input images. In order to improve the network efficiency, the number of layers have been reduced at encoder level. As the task of skin segmentation is very important for human activity recognition, it is important to accurately perform skin segmentation. SSS-Net is able to capture the multi-scale contextual information and controls the signals destruction. In our network we employ the ResNet-18 architecture with 4 residual blocks only. For capturing multi scale contextual information, ASPP (Atrous Spatial Pyramid Pooling) is used in the model. ASPP applies various dilation rates to a sequence of atrous convolutions. These rates are designed to capture the longer context. In addition, ASPP integrates image-level features to add global context information. Skin segmentation is more challenging because of indoor and outdoor image scenes in the datasets. In order to measure the network efficiency, SSS-Net is evaluated on five open datasets of human activity recognition. Our network performs very well on the datasets and the metrics we chose for the evaluation of our network are P, R and F. Experimental results demonstrate the effectiveness of the techniques in our network showing that our network is outperforming the state-ofthe-art methods as shown in Table 3.

VI. CONCLUSION
This paper proposed SSS-Net for skin segmentation that is able to capture the multiscale contextual information and provide results with refined edge boundaries. SSS-Net has less number of layers which results in reduced number of parameters i.e. 7.3 M which significantly lower compared to other existing networks. Furthermore, SSS-Net does not require any additional pre-processing steps. The uniqueness of this network is its ability to capture the multi-scale contextual information. We tested SSS-Net for skin segmentation on the publically available datasets of human activity recognition (AMI, SSG, EdDs, UT and LIRIS). Since these datasets contains less number of images, we adopted data augmentation techniques to increase the number of training images. The obtained results show high-quality segmentation results, indicating the effectiveness of SSS-Net for skin segmentation.