Enhanced Self-Perception in Mixed Reality: Egocentric Arm Segmentation and Database with Automatic Labelling

In this study, we focus on the egocentric segmentation of arms to improve self-perception in Augmented Virtuality (AV). The main contributions of this work are: i) a comprehensive survey of segmentation algorithms for AV; ii) an Egocentric Arm Segmentation Dataset, composed of more than 10, 000 images, comprising variations of skin color, and gender, among others. We provide all details required for the automated generation of groundtruth and semi-synthetic images; iii) the use of deep learning for the first time for segmenting arms in AV; iv) to showcase the usefulness of this database, we report results on different real egocentric hand datasets, including GTEA Gaze+, EDSH, EgoHands, Ego Youtube Hands, THU-Read, TEgO, FPAB, and Ego Gesture, which allow for direct comparisons with existing approaches utilizing color or depth. Results confirm the suitability of the EgoArm dataset for this task, achieving improvement up to 40% with respect to the original network, depending on the particular dataset. Results also suggest that, while approaches based on color or depth can work in controlled conditions (lack of occlusion, uniform lighting, only objects of interest in the near range, controlled background, etc.), egocentric segmentation based on deep learning is more robust in real AV applications.


INTRODUCTION
M OST computer vision applications are traditionally focused on second or third point-of-view (POV), actions that happen while interacting directly or indirectly with a camera, respectively [6]. With the advent of new wearable devices such as GoPro, Microsoft SenseCam, or even some Head Mounted Displays (HMD) useful for immersive applications, research on first-person POV or egocentric vision attracts some attention [7]. Main research lines in egocentric vision can be categorized into: • Localize egocentric objects usually involving knowing hand position and recognizing which objects are in contact with them. Typical tasks here are recognition [50], detection [32], segmentation [56], tracking, and prediction [63], etc.
• Visual lifelogging, which consists of capturing daily live experiences [8]. Video summarization of people lives is also a related area, which could be used for detecting novel or anomalous events. This line is of special relevance for people with memory loss problems [17].
• E. Gonzalez-Sosa, P. Perez In this study, we explore egocentric arm segmentation as an essential requirement for enhanced self-perception in Mixed Reality (MR) (see Fig. 1). One of the main problems of immersive environments (IE) 1 is the so-called presence factor: the subjective experience of being in one remote place without moving from the physical place. According to Lee [31], the presence concept can be spread into three components: physical, social, and self-presence. In particular, selfpresence involves experiencing the representation of one's own genuine self, physically or psychologically manifested, inside a virtual environment.
Considering MR in particular, there is a different way of reaching self-perception. As stated by Milgram and Kishino [43], [49], AV is a MR subcategory of the virtuality continuum that aims to merge the reality surrounding the user (hereinafter local reality) with an IE. This means, instead of seeing an avatar of the user's body tracking his movements, the user is presented with his real body immersed in the IE. The merge of a real and virtual world can be achieved with the video see-through capabilities of the newest HMD devices such as HTC VIVE Pro or just by attaching a local camera to the HMD. Hence, human body parts (such as hands, arms, lower body, etc.) or local objects (such as keyboards, smartphones, coffee cups, etc.) can be segmented Fig. 1. We propose semantic segmentation networks to segment human body parts (in this study whole arms) to get an enhanced self-perception in AV. Left: local reality; center: segmented arms; and right: AV with egocentric arms.
from the see-through video and merged into the IE. According to the objects being segmented, AV could be used to: i) integrate self-presence and/or awareness of other people to prevent isolation, ii) ease interaction with local objects [41], or iii) both.
Main segmentation approaches proposed in the literature for AV have been based on color or depth. However, they show still some limitations such as very complex physical setups [41], limited field of view [29] or poor depth estimation that prevent AV from reaching its full potential. To overcome these limitations, we explore Semantic Segmentation algorithms (hereinafter Sem-Seg) proposed in the literature based on deep learning to segment egocentric arms. Our main motivation for focusing on whole arms and not just hands is an attempt to study this problem considering real life conditions. Indeed, arms and not just hands are easily visible when wearing a HMD. Besides, we hypothesize also that seeing your whole arms and not just your hands, may have a positive impact on the selfpresence factor of the experience. Aside from the segmentation challenges pertaining to egocentric vision, the reader should notice that arms contain additional variability factors such as clothes or skin color that need to be considered. The proposed work is a continuation of some previous work [25] that we significantly extend by the following contributions: • a comprehensive discussion on segmentation methods for AV, categorized by color-based, depth-based and other approaches.
• an Egocentric Arm Segmentation Dataset, composed of more than 10, 000 semi-synthetic images, which is available for research purposes 2 . We describe the procedure carried out to automatically generate the groundtruth mask.
• proposal of deep segmentation networks designed to segment egocentric arms. To the best of our knowledge we are the first to consider them for AV applications and the first to consider whole arms and not just hands.
2. https://cloud.proinnovation.es/index.php/s/tekqtneGXgrUgFD • a comparison with former approaches based on color or depth, highlighting their pros and cons.
The rest of this article is structured as follows. Section 2 covers related works regarding AV, with an emphasis on the different algorithms that have been proposed to date to segment local reality objects. Section 3 describes the Egocentric Arm Segmentation Dataset and the whole procedure to generate semi-synthetic images while automatically obtaining the segmentation groundtruth. Section 4 presents the Sem-Seg algorithms considered to segment egocentric arms. Then, Section 5 explains the experimental protocol and test datasets considered to conduct the experiments, while Section 6 report the segmentation results and the comparison with former segmentation approaches used for AV. Finally, Section 7 concludes the paper with some discussions.

PREVIOUS SEGMENTATION APPROACHES FOR AUGMENTED VIRTUALITY
Our aim within this section is to undertake a comprehensive and thorough review of related works regarding how to segment local reality objects. Table 1 lists the most relevant works published in the scientific literature in this regard (for information regarding Sem-Seg from the state-of-the-art, we refer to [22]). We proceed now to describe them based on the segmentation method used.

Color-based
One of the preliminary approaches for segmenting objects from local reality was the chroma-key, similar to the concept applied within weather forecast in television for decades. The idea is simple: given an input video with this chromakey color presented, only pixels not sharing this color are retained. Metzger et al. [42], one of the pioneers of the idea of AV, put forward the use of blue chroma-key, to select the user's hands from the local reality. Further, the authors pointed out the importance of having the space uniformly lit to obtain accurate results. Similarly, McGill et al. [41] used a green chroma-key to filter objects from the local reality (see Fig.2 A). The particular task involved typing a keyboard in a VR environment. For this purpose, they designed a scenario with a green chroma surface where the keyboard was placed. The segmentation was performed in two stages: first, both hands and the keyboard were segmented by discarding all pixels that share the green color; then hand detection was carried out using blob detection and the help of some hand markers to recognize keyboard actions. Although results  Fig. 2. Examples of the different segmentation approaches proposed in the literature to segment particular objects from the local reality to be blended with VR. From left to right: i) green chroma-key [41], ii) skin detection [9], iii) and iv) depth information [1], [29] and v) edge detection and statistical classifier [18] obtained with this simple method were almost perfect in terms of segmentation, the application itself is very limited if the local reality appearance is constrained to exhibit a certain chroma-key color.
Focusing particularly on the hand segmentation problem, researchers also proposed the use of skin detection algorithms to segment hands from local reality [9] (see Fig.2 B). The idea behind this algorithm is the following: the local reality image is first transformed to the HSV color space, and then it is filtered out so that only pixels values that are around a certain Hue range (µ±σ) are segmented. Although this approach enhanced the green chroma-key approach in the sense that local reality is not constrained anymore, some false positives may appear having similar skin color such as it is the case of faces in the scene, furniture, boxes, etc. In the same work, lower body part that could be seen from the egocentric view was also segmented with a naive floor subtraction approach. Taking the assumption that the floor appearance was uniform, the body was retained by simply filtering out all pixels not belonging to the floor.
Likewise, Perez et al. [46] used a YCbCr skin detection algorithm based on red chrominance, adding a transparency alpha layer to the local reality. By using less strict thresholds than those normally used for skin detection, objects with yellow and red tones with high saturation such as food were also segmented. This segmentation method allowed them to build a proof of concept of an Immersive Gastronomic Experience using Distributed Reality [60], a new type of Mixed Reality that involves capturing different realities ( at least one remote in the form of 360 • video and a local reality) to foster remote human communication and shared experiences.
There are some inherent limitations in color-based approaches: they require specific physical setups, where no background objects have any of the colors included in the foreground (this is especially restrictive in traditional green chroma), and they are very sensitive to illumination conditions.

Depth-based
Based on the idea of filtering out everything that is below a certain depth threshold value (segmentation of a usercentered bubble), Nahon et al. [45] blended into the IE not only the user's own body but also objects from the local reality and even other people. This way, self-presence is increased and also interaction and communication with other objects or people is feasible. Likewise, Lee et al. [29] used depth information to include the user's own body into an immersed cinema experience (see Fig.2 C). They also incorporated interactivity so that the level of user embodiment was adapted to the content or to user preferences. Alaee et al. [1] also incorporated objects which were in the distance range of 10 − 40 cm, with the aim of interacting with the smartphone in the IE (see Fig.2 D). More recently, Rauter et al. [48] implemented the same idea while estimating depth from the stereo camera of HTC VIVE Pro. They also performed some post-processing of the estimated foreground mask to address pixels with missing depth values.
Depth-based solutions are relatively simple to implement, due to the affordability of RGB-D sensors. However, such sensors have some limitations: on the one hand, depth estimation is noisy and prone to artifacts when handling near objects, specular materials, non-reachable areas, or shadows [44]; on the other, they have a very narrow field of view which also impairs sense of presence [29].

Other approaches
Aside from the mainstream segmentation approaches, alternative ones have been proposed. For instance, Desai et al. [18] proposed a method to segment smartphones or tablets based on two stages: 1) edge-based object detection to select the smartphone; and 2) a statistical classifier based on attributed features decided whether the segmented object was a smartphone (see (see Fig.2 E). The overall aim was to allow interactivity with these devices while being immersed. This algorithm, however, is not scalable to segment other objects.
Korsgaard et al. [28] conducted an AV experience in which the user had to interact with real food placed in front of him. Merge between both worlds was achieved through head orientation. Everytime the head was orientated in a downward angle (where food is normally placed), the local reality was visible, whereas if the user looked straight ahead, the IE became visible. The main limitation of this approach is that no optimal full immersion is achieved but just an angle-based transition approach between the IE and the local reality.
Beyond using skin color, there are other attempts, not specifically designed for AV, to segment hands from an image-based point of view. Serra et al. [56] proposed a hand-crafted method for segmenting skin color based on random forest superpixel classification considering, light, time and space consistency. Although it may be seen as an evolution of color-based methods, this approach still would fail to segment arms containing clothes. There are also some attempts to detect [5] or segment hands [59] using deep learning that shows the feasibility of adapting existing pre-trained models such as RefineNet or CaffeNet (slight modified version of Alexnet). Again, these approaches are focused on segmenting hands, and not whole arms.

EGOARM: EGOCENTRIC ARM SEGMENTATION DATASET
At the present time of writing, there are not databases in the literature suitable for egocentric arm segmentation. There exist some databases that are related to egocentric hand detection or segmentation but not related to the whole arm. Therefore, we introduce the Egocentric Arm Segmentation Dataset (EgoArm), which is designed with a wide range of variations to maximize generalization capabilities. Table  2 describes the main characteristics of EgoArm, containing more than 10, 000 images. We highlight that EgoArm includes images of people with different skin color and gender.
Unlike other supervised learning approaches such as classification or regression, in which required labels or groundtruth are just text labels or a few numbers defining bounding boxes, Sem-Seg labels are images where every pixel contains a particular number accounting for the class information. The acquisition of such databases is time-consuming, which represents a major problem that has already been observed by Bandini and Zariffa [6]. To overcome this issue, we propose a semi-automatic way of labelling images (see Fig.3), composed of the following steps:

Acquisition
An Android application is developed in order to record 30 fps videos from the Samsung-S8 frontal camera while the subject is wearing the Gear VR Samsung headset with the smartphone in front of a chroma-key backdrop (see Fig.3A). Unlike other segmentation datasets, we decided to record videos at 720 × 720 in order to target the high resolution requirements of VR applications (Fig.3B). Each session is  Fig. 3. Procedure to obtain groundtruth and semi-synthetic images: through an Android app installed in the smartphone, images are recorded from the HMD perspective using a chroma-key approach. Subsequently, we applied HSV filtering to obtain the groundtruth images. With the groundtruth image, we select the relevant information from the chroma-key image that will be later combined with a background image to form the final semisynthetic image. designed to record videos with a particular configuration in terms of people, scenario, outfit, and sleeve. A recorded assistant ensures that, at each session, videos from the five different arm poses are recorded.

HSV Filtering
With the recorded chroma-key videos (see Fig.3B.), a HSVbased filter is applied to obtain the foreground images (values are in the range 0 − 1), as follows: being h 1 , h 2 and s 1 set to 0.22, 0.45, and 0.20, respectively (values obtained by empirical testing). To prevent high similarity, images are selected every 5 frames. Additionally, some morphological operations are applied to delete noisy areas (see Fig.3C).

Masking
Before creating the semi-synthetic image, the chroma-key image is masked with the groundtruth image to get the area of interest (in this case: arms, see Fig.3D.)

Combination
Semi-synthetic images (Fig.3F) are created combining background ( Fig.3E) with chroma-key images (Fig. 3B)) masked with foreground images (Fig.3D). Natural background images are obtained from the MIT Scene Parsing Benchmark [66]. Among the whole set of 20, 210 images, we select those which hold height = width and then reshape it to 720 × 720, resulting in a subset of 3, 697 different background images. These backgrounds contain indoor scenes related to houses, public spaces, commercial places as well as outdoor scenes such as landscapes, beaches, mountain, etc. As a final post processing, we discarded those pair of groundtruth-semi-synthetic images with some false positives in the groundtruth. Fig.4 shows examples of the variability of these images.

EGOCENTRIC ARM SEGMENTATION
Accurate and robust arm segmentation is vital to achieve enhanced self-perception in MR. DL-based approaches have been shown to outperform conventional approaches in diverse computer vision tasks if the used training data reflects real-world scenarios. This clearly motivates the development of a DL-based are segmentation system which is expected to outperform traditional approaches (see Section 2) in terms of robustness. Convolutional Neural Networks (CNN) have been shown to be the state-of-the-art for supervised classification and detection tasks [26]. CNNs are composed of different types of hidden layers: Convolutional, Rectifier Linear Unit, Pooling and Fully Connected (FC). FC are the final layers of CNNs that, along with the classification layer, hold the output (having the same size as the number of objects to classify). In 2015, Long et al. proposed Fully Convolutional Networks (FCN): a modification of CNN architectures that reached state-of-the-art performance in Sem-Seg problems. Concretely, they replaced fully connected layers by fully convolutional ones to preserve the spatial dimension while keeping the class identity information [39]. Another important key component aside from the encoding subnetwork here is the decoding subnetwork, which is placed after the fully convolutional layers and is entrusted of upsampling the class spatial map up to the original input size.

Considered Sem-Seg Networks
Our hypothesis, confirmed also by previous work [53], is that segmentation networks trained for third POV fail when segmenting from egocentric vision. Indeed, egocentric vision has the advantage that the objects tend to appear at the center of the image, but also the challenge of the camera moving with the human body, which creates fast movements and sudden illumination changes.
Due to the relatively small size of EgoArm (in comparison with datasets aimed to train architectures from scratchs such as Imagenet, Pascal VOC, etc.), we took the decision to apply tranfer-learning from existing Sem-Seg architectures. The first Sem-Seg architecture considered was the FCN, proposed by Long et al. and originally trained for the PASCAL VOC 2011 segmentation challenge, which consist of segmenting up to 20 classes categorized into people, 6 different animals, 7 means of transport, and 6 house objects. After having empirically found the best training parameters for fine-tuning the FCN architecture with the EgoArm database, we observed that the output mask was not giving accurate enough segmentation masks for the 720 × 720 required resolution.
The next architecture that we considered was DeepLab, originally proposed in 2017 as a new architecture for Sem-Seg. In particular, among the different enhanced networks proposed since then [11], [12], [13], we consider here the DeepLabv3+ [13] due to i) the use of the Res-net pre-trained model, replacing the former VGG-16 pre-trained model, placing 4 of these blocks in cascade. It is characterized by 101 layers and the introduction of short-cut connections [54]; ii) the use of a-trous convolution, using upsampled filters that allow dense feature extraction taking context into account, without increasing the number of parameters; iii) the use of Atrous Spatial Pyramid Pooling module to robustly segment objects at multiple scales; and iv) the use of a decoder module to refine the segmentation results [4], especially along object boundaries. In general, the fact that this arquitecture was very deep at the encoding subnetwork and deeper than the existing approaches in the decoding subnetwork give us the idea that could segment accurately egocentric images of high resolution. In order to understand the gain achieved with the fine-tuned network with respect to the original DeepLabv3+ model, we decide to use two different semantic segmentation networks: • DeepLabv3+: our idea here is to use the original DeepLabv3+ to segment egocentric arm and confirm our hypothesis. This original network was trained using the PASCAL VOC database, so arms are segmented pertaining to people class.
• DeepLabv3+ using EgoArm: we apply transfer learning using images from the EgoArm dataset so that the network segments two classes: arms and background. As there are more male than female subjects and in order to have a more gender-balanced dataset, we discard 4 male subjects, having a total of 11561 images.
We also considered combining EgoArm images with a subset of images from PASCAL VOC containing people. Concretely, we select 4, 344 images among the entire PAS-CAL VOC dataset, resulting in a total of 15, 905 images. As we were only interested on the people class, we relabelled all remainder pixels associated to any of the other 20 classes, to the background class. In this case, we have two classes: people and background, where arms are considered part of the people class. However, we did not find improvements for the egocentric segmentation tasks, as images from PASCAL VOC are acquired from a third-point-of-view perspective.

EXPERIMENTAL PROTOCOL
The GTEA Gaze+ dataset, which contains 1, 115 images, is used as the validation dataset [5]. The main motivation of not using a subset of EgoArm as the validation set, is that we aim to validate our results in a real egocentric database. Among the public real egocentric datasets, GTEA Gaze+ is the largest one and more similar to the arm segmentation task. It contains egocentric arms performing actions in a kitchen, with a very cluttered environment. In this dataset, groundtruth is related to the skin color but no clothes were presented in the images.
When it comes to the training of deep neural networks using stochastic gradient descent algorithm, several hyperparameters need to be adjusted. An exhaustive set of experiments following grid search strategies, have been conducted monitoring validation performance over the GTEA Gaze+. Training has been done using two GPU GTX-1080 Ti with 12GB RAM each. Batch size was set to 4 due to the large size of the training images (720 × 720). The final training of the DeepLabv3+ Ego Arm was achieved using an initial learning rate of 1e − 3, 2 epochs, 7500 as maximum number of iterations for reducing the learning rate, a final learning rate of 1e − 6, and weight decay of 1e − 5.

Tests
In order to assess the generalization capabilities of our algorithm, we perform the evaluation on different public datasets 3 : • EDSH (groundtruth related to skin color) [33]: EDSH2 and EDSH kitchen are the test videos of EDSH, and contain indoor and outdoor scenes with large variations of illumination, mild camera motion induced by walking and climbing stairs, with just 1 user. They provide 104 and 197 segmentation masks for the test datasets EDSH2 and EDSHK, respectively.
• EgoHands, (groundtruth related to hands) [5]: it contains 48 Google Glass videos of complex, interactions between two people playing board games (one with first POV, and the other with thrid POV). In 3. There were also other datasets available in the literature that we discarded for different reasons. For instance, The EPIC-KITCHENS Dataset is not providing segmentation masks [16]; the Egocentric Gesture Recognition dataset [10] only provide segmentation masks for chroma-key hand gesture images; Keyboard Hand Dataset (KBH, [62]) was not found available for research use. order to reduce redundancy and computational load, we create a subset of this dataset, by selecting 10 images per each of the 48 different videos, resulting in a total of 480 images.
• Ego Youtube Hands (groundtruth related to hands) [38]: It contains 3 egocentric videos from daily activities. Among the entire set of 1, 032 images, we create a subset including images showing hands and arms, resulting in a total of 689 frames.
• TEgO database (groundtruth related to skin color) [30]: in order to test the robustness against black skin color, we report results using the test set pertaining to subject B1 (which has black skin color), composed of different subsets of images under different illumination (normal and extreme) and background conditions (vanilla or in the wild).
• THU-Read (groundtruth related to skin color) [57]: initially created for egocentric action recognition from RGB-D data, they contain a subset of 650 images with egocentric actions where users arms appears holding different objects. Images are of 640 × 480, but their original resolution is lower so images appear pixelated.
• FPAB, First Person Action Bechmark (no groundtruth available) [23]: dataset that provides both color and depth images from egocentric images. As their original purpose was to infer hand pose, people are wearing some reference marks on their hands. As the color and depth images were extracted from different sensors and at different positions, it was very difficult to create a common groundtruth, so we do not report empirical results, but only visual examples.
• Ego Gesture (groundtruth created in this work and related to whole arms) [64]: Ego Gesture database contain egocentric color and depth videos acquired from RealSense SR300. It contains 83 different hand gestures from 50 different subjects and 6 different scenarios (e.g: indoors, outdoors, illumination, static clutter background, dynamic background, walking, etc.). We then create a subset of 277 images, by selecting approximately one image per subject and different scenario. As the groundtruth of this dataset was related to the hand gesture, we manually labelled the segmentation mask of this subset of images, labeling the whole arm as groundtruth. Notice that the groundtruth of the former datasets are related etiher to hands or skin color but not the arm concept itself. This discrepancy would be covered in Section 6. Fig. 5 describes the heatmaps of the different datasets; to give an idea of the type of groundtruth and the average position of hands/arms on those datasets.

Performance Metric
Empirical results are given in terms of Jaccard Index, also known as Intersection over Union (IoU), defined as:  where k is the number of classes (in our case k = 2: arms and background). IoU is computed per class and measures the ratio between intersection of two segmentation masks (the groundtruth over the predicted) over their union. Due to the great imbalance of pixels belonging to arm and background per image, we will report exclusively IoU pertaining to the arm class, in the range 0-100%. As grountruth of the available datasets do not relate entirely to whole arms, but only to hands or skin color, reported IoU could be underestimated, so we also reported M issRate = F N F N +T P , also in the range 0-100%. Table 3 reports the Sem-Seg results in terms of IoU and Miss Rate for the arm class, using three different segmentation algorithms based on color or deep learning and for the test datasets reported in Section 5.1. Also, Table 4 further indicates IoU reached on each of the 6 scenes of EgoGesture.

Color Performance
Color-based segmentation is applied using an HSV filtering similar to Equation 1, where Hue values are around the skin color. As can be seen from Table 3, color-based segmentation achieves similar or worse results than the original DeepLabv3+. Concretely, there is a range of absolute improvement from 10.00% to 25.00% when replacing color-based to DeepLabv3+ for GTEA Gaze+, EDSH, THU-Read and TeGO datasets. This is logically expected, as this approach considers exclusively the color for making its decision. Performance is also hindered when there are objects in the scene which share the skin color; notice the very bad performance achieved with the GTEA Gaze+ or THU-Read databases due to their yellowish / reddish scene appearance (see Fig.6 A and J, respectively). Also, results reported from the TEgO database show that relying exclusively on color is not an appropriate method for applications where there are people from different ethnicities, (see Fig.7). For the case of EgoGesture reported in Table 3, there is not observable differences between the color performance and the original DeepLabv3+. However, when assessing those results per scene (see Table 4), we observe: both color and DeepLabv3+ are severely affected by extreme illumination conditions (Scene3, see Fig.8B); color-based is TABLE 3 Segmentation results in terms of Intersection over Union for different egocentric segmentation datasets. The segmentation algorithms considered are: 1) color-based, 2) original Deeplabv3+ using the person class; 3) Deeplabv3+ using the proposed Egocentric Arm Segmentation Dataset. GTEA Gaze+ is our validation dataset. The reader should bear in mind that there is discrepancy between the available groundtruth (related to hands or skin color) with the arm concept. Due to that Miss Rate is also reported. Bold indicates best IoU for a given dataset.  (28.21) more robust than DeepLabv3+ in both dynamic or walking indoor scenarios where movement can produce some blur effect (around 10.00% average absolute improvement when using color-based rather than DeepLabv3+ for Scene2 and Scene4, see Fig.8A), and that DeepLabv3+ outperform color when good illumination is available (Scene5 and Scene6) or there is controlled background.

Deep performance
Concerning the behavior of the two different deep segmentation networks assessed, we observe the general superiority of the networks using EgoArm in comparison with the original DeepLabv3+ network. This observation validates our hypothesis of the convenience of adapting the network with a database more similar to the real application. We observe slight, moderate or large improvement, depending on the particular dataset. In a high level perspec-tive, this is expected since deep learning algorithms, unlike color-based segmentation, are not only considering color information for the segmentation task, but also complementary information such as shapes, texture or more abstract and complex information.
Slight improvement is observed for EgoHands (6.87% absolute improvement) and no improvement is observed for the Ego Youtube Hand Datasets. In both cases, the overall segmentation results from these two datasets are very poor due to the groundtruth being related just to hands despite the majority of images contain whole arms with or without clothes (see Fig.6 G and F). Also, in the case of EgoHands (see Fig.6 F), the majority of images present both egocentric and third POV arms. In general, third POV arms occupy a larger surface. As the networks trained with EgoArm are focused on egocentric arms, it is logically that there is not a huge improvement with respect to the original DeepLabv3+. In what concerns EgoYoutubeHands images, we assume that the low resolution of these images (384 × 216) along with the very uncontrolled and cluttered environment makes the segmentation very challenging.
Concerning results reported on EDSH2, there is also a slight gain when including EgoArm (6.50% absolute improvement). We believe this is because most of these test images just contain arms but not clothes, and also such images are controlled both in terms of the background but also regarding the hands (e.g. fingers are normally very well separated). For the more uncontrolled case of EDSHK, there is a larger improvement specially in terms of MissRate (from 22.91 to 8.10) between the original DeepLabv3+ and the one trained with EgoArm. Specially in the egocentric images from EDSHK, there is very frequent to encounter arms with clothes, which are not considered part of the groundtruth (see Fig.6 D). Therefore, part of the false positive rate is related to the clothe side of the arm.
Moderate enhancement is encountered for the GTEA Gaze+ validation dataset, THU-Read, and EgoGesture (in the range of 15.00% to 25.00% absolute improvement). As these datasets are purely egocentric, it is more noticeable the gain achieved when including the EgoArm (see Fig.6 A-B, Fig.6 K-L or Fig.8,). Also, in the case of EgoGesture, the fact that egocentric scenes are very clean with not objects surrounding the arms prevents additional mistakes. Having a more in-depth look to the IoU reported per scene in Table 4, it is also observed the gain achieved with the deep network trained on EgoArm in all scenes and notably in outdoors scenarios, which, apart from the aforementioned nature of the scenarios, is probably due to the uniform and good illumination. Lastly, a huge increase of performance is observed with the different subsets from TEgO dataset (in the range of 20 − 40% of absolute improvement). The main reason behind it relies on the diversity of skin colors presented in the EgoArm.
After having a visual inspection to the images, we notice that in some cases, the DeepLabv3+ EgoArm network generates some false positives from background items with some color similarities. We deduce that the nature of semisynthetic images of EgoArm, combining arms with a wide variety of natural backgrounds, is not always fully representing real nature and coherent scenes, preventing the encoding subnetwork (focused on pixel classification) to optimize its performance, generating some false positives that are later upsampled through the decoding subnetwork and the short-chut connections (see Fig.7 A where part of the coke is also segmented as arms or red color items from the kitchen scenes presented in Fig.6). We believe this false positive problem can be overcome by improving the classification performance of the encoding subnetwork [37] and by a more in-depth assessment of which background type are more appropriate for targeting real AV applications.

Comparison with depth
As stated in Section 2, segmentation based on depth implies select all objects from the user surroundings that are below a particular distance threshold. Here we aim to compare the results obtained with depth in comparison with color-based or deep-based segmentation, using the subset of EgoGesture described in Section 5.1.
It is clearly visible from Table 4 that segmentation based on depth is more uniform across the different scenes than deep or color-based approaches. A slight drop in depth performance is shown in outdoors scenarios (see for instance Fig.8 C), possibly because signal light (texture being projected in infrared to compute depth through disparity) is much weaker than ambient sunlight. However, despite this general superior performance, we believe EgoGesture images do not represent real and challenging scenarios concerning depth estimation. EgoGesture depth estimation works fine because the dataset generation scenes avoid all the critical scenarios for RGB-D sensors [44]: hands are always within the distance range of the camera (and never closer) and no other object is in such range, hands are fully visible from both infrared sensors, and they never cast shadows from the infrared emitter. Moreover, there are recent studies exploring deep learning to enhance depth maps (also known as depth completion), that suggest that there is still a large room for improvement in this area [36], [40], [65]. Once achieved, it could reliably segment near objects in AV applications.

Computation time
Given a 720 × 720 image, segmenting it with color, depth, and deep approaches would take 2.9ms, 700µs 4 and 74ms, respectively using a PC Intel Xeon ES-2620 V4 @ 2.1Ghz with 32 GB powered with 2 GPU GTX-1080 Ti with 12GB RAM. Our deep implementation achieves about 15 fps, which is 4 to 6 times slower than what it would be desirable for a smooth AV system. However, it is within the right order of magnitude, and therefore it is just a matter of algorithm optimization and hardware improvement that the Sem-Seg approach can work in real time. In practice, it would imply either having the HMD attached to a resourceful computer or offloading computation to the edge cloud [19], [34].

CONCLUSION
In this study, we have proposed the use of deep learning to segment egocentric human body parts, in particular arms, to enhance self-perception in AV. We have conducted first a thorough survey based on existing egocentric segmentation methods, mainly based on color or depth. In order to target the requirements of first POV segmentation for AV, we have created the EgoArm: Egocentric Arm Segmentation dataset composed of more than 10, 000 images including variations of gender, arm positions, clothes, indoors and outdoors, and skin color along with a procedure to generate automatic groundtruth. Later, we have explored different semantic segmentation networks to target egocentric arm segmentation. We have reported results on different real egocentric datasets: GTEA Gaze+, EDSH, EgoYoutubeHands, Ego-Hands, FPAB, THU-Read, TEgO, and Ego Gesture, providing comparisons with color-and depth-based segmentations. Results have proven the effectiveness of EgoArm for arm segmentation, boosting the average IoU from 25.00% IoU reached with chroma or from 31.35% reached with the original DeepLabv3+ network, up to 50.00% IoU. Besides, these segmentation networks are more robust than colorbased segmentation at dealing with illumination changes, segmeting clothes or arm with different skin color, etc.
In comparison with depth, deep-based segmentation algorithms are able to segment the desired objects exclusively, while depth will be forced to segment everything below a distance-threshold, which is a paradigm that may not apply to all AV applications. Besides, despite not being shown in the EgoGesture database, there are current challenges at estimating depth, that would need to be solved before using it reliably for AV.