Selfie Segmentation in Video Using N-Frames Ensemble

Many camera apps and online video conference solutions support instant selfie segmentation or virtual background function for entertainment, aesthetic, privacy, and security reasons. A good number of studies show that Deep-Learning based segmentation model (DSM) is a reasonable choice for selfie segmentation, and the ensemble of multiple DSMs can improve the precision of the segmentation result. However, it is not fit well when we apply these approaches directly to the image segmentation in a video. This paper proposes an N-Frames (NF) ensemble approach for a selfie segmentation in a video using an ensemble of multiple DSMs to achieve a high-performance automatic segmentation. The proposed NF ensemble approach executes only one segmentation model upon a current video frame and combines segmentation results of previous frames to produce the final result. For the experiment, we use four state-of-the-art image segmentation models and 81 videos dataset with a single-person view from publicly available websites. This paper calculates Intersection over Union (IoU), IoU standard deviation, false prediction rate, Memory Efficiency Rate and Computing power Efficiency Rate to measure the performance of segmentation models. The average IoU value of the Two-Frames NF ensemble was 95.1253%, and the Three-Frames NF ensemble was 95.1734%, whereas the average IoU value of single models was 92.9653%. The result shows that the proposed approach improves the accuracy of selfie segmentation by more than 2% on average. The result of cost efficiency measurement shows that the proposed method consumes less computing power like single models. To sum up the overall work, the proposed approach is as fast as single selfie segmentation models. At the same time, it produces optimized and improved segmentation results like the ensemble of multiple segmentation models at once.


I. INTRODUCTION
Increased use of mobile phones caused the popularity of "selfies" (self-portrait photographs) these days. Popular camera apps support automatic background change of selfie photos. Many online video conferencing solutions support real-time background change functions for aesthetic, privacy, and security reasons. Organizations are using these solutions more and more since many people are working from home because of the COVID-19. Selfie segmentation is a crucial technology to enable such kinds of functions.
Selfie segmentation is a subset of image segmentation. Many image segmentation methods have already been studied and developed. Many of them use traditional image segmentation methods, such as thresholding, watershed, region growing, clustering using contour and edge, graph cut, and Markov random fields . [1], [2]. A segmented area should be homogeneous and uniform for good image segmentation, but it remains a challenging task [3]. Recently, Deep-Learning (DL) based Segmentation Model (DSM) has opened a new era of image segmentation [4]. DSMs have made remarkable improvements in speed and accuracy for image segmentation compared to traditional methods [5]- [9]. These models predict segmentation regions using semantic labels for every pixel of the image [10].
In general, the ensemble of multiple methods can improve the performance of image segmentation. The key of the ensemble approach is to combine various models to create a more effective model [11]. The collective decision produced by the ensemble can reduce generalization errors of prediction and improve the performance accuracy [12], [13]. Many studies show that the combination of multiple segmentations can produce better predictive performance than the individual segmentation model [14]- [16].
Video Instance Segmentation (VIS) is a task that can classify objects into the pre-defined classes, trace objects throughout a video, segment classified objects, and classify objects with localization within a video. The extension of instance image segmentation to the video domain is a natural step as the use of video applications is increasing. Selfie segmentation in a video is a subset of VIS, and it can recognize and extract the human body, normally the upper body, from continuous video frames [17]- [18].

A. MOTIVATION
The use of selfie segmentation is increasing these days. However, high performance and real-time selfie segmentation in the video remains a challenging problem even with recent developments on image segmentation and matting using DSMs [5], [18]- [23]. There are several studies on selfie segmentation using single DSMs for single images. However, these studies focus on a single image. They are not nicely fit for image segmentation in a video considering the characteristics of a video, such as sudden noises or the continuity of adjacent video frames. Several studies on the ensemble method of multiple DSMs show better image segmentation accuracy compared to single DSMs. However, the ensemble approach is not satisfactory when we directly apply it to selfie segmentation in a video because the speed of segmentation is slower than a single DSM.

B. OUR METHOD
To address these issues, this paper proposes a novel approach for the selfie segmentation in a video using an N-Frames ensemble of multiple DSMs to achieve a high-performance automatic segmentation. The proposed method is not limited to selfie segmentation but can also be applied to segment any objects in a video. The characteristics of the proposed ensemble approach are as below: • It is as fast as a single DSM.
• It can generate optimized results using an ensemble of multiple DSMs. • It is robust to the sudden noises of a video.
• It is suitable for image segmentation of slowly moving objects such as people in a video.

C. CONTRIBUTIONS
The followings are major contributions of this study for selfie segmentation in a video.
• A novel N-Frames ensemble method for selfie segmentation in a video is proposed. The experiment result shows that the proposed method is as efficient as a single DSM, and it resolves the aforementioned challenges of the selfie segmentation in a video satisfactorily. To the best of our knowledge, no previous works have used the same approach for selfie segmentation in a video. • This paper compares four state-of-the-art DSMs, the ensemble of multiple DSMs and the proposed N-Frames ensemble to measure the performance of selfie segmentation in a video. The DSMs experimented in this work were developed recently and show state-of-the-art performance in image segmentation. According to our best knowledge, no previous works use the ensemble of these DSMs for selfie segmentation in a video. • A quantitative experiment with approximately 40,000 test images from 81 videos is conducted to evaluate the segmentation results. The size of the sample is an important factor to ensure the precision of the experiment. The number of test datasets used in this work is good enough to assure the accuracy of the experiment. According to our related works, it is rare to use such a massive number of test datasets for the experiment of image segmentation.

D. STRUCTURE
The following Section II explains the related works of DSM and human segmentation in a video. Section III introduces our ensemble approach for selfie segmentation in a video. Section VI presents experiment results and analysis of results. Section V presents the overall discussion and inspirations. Section VI is the conclusion part, covering the summary of experiments and the contribution of this paper.

II. RELATED WORKS
This section explores the detail of related works for image segmentation in a video.

A. IMAGE SEGMENTATION USING DEEP-LEARNING
Long et al. [24] proposed the fully convolutional network, FCN has shown good performance in semantic segmentation. Adam Paszke et al. [25] proposed an ENet (efficient neural network) model. It has an optimized deep neural network architecture to perform a real-time semantic segmentation on GPU hardware. Huang et al. [26] and Simon Jegou et al. [27] proposed a dense convolutional network (DenseNet) with several densely connected blocks, which has shown excellent results on image classification tasks. Deeplab [28] architecture uses atrous convolutions with upsampled filters for dense feature extraction and uses CRF to get a better localization, especially along the edge of objects. Chen et al. [29] proposed a network architecture called DeepLabv3. It adopts a dilated (atrous) convolution for the downsampling layer and upsampled filters for a dense feature map extraction and for long-range context capturing. MobileNets [2] is an efficient and lightweight segmentation model for mobiles and embedded systems. Mark Sandler et al. [30] introduced MobileNetV2, an improved version of MobileNets. It is based on an inverted residual structure that connects the thin bottleneck layers, and it improves the stateof-the-art performance on multiple tasks. Andrew Howard et al. [31] proposed MobileNetV3-Large and MobileNetV3-Small models using hardware-aware Network Architecture Search and NetAdapt algorithm. It was developed to achieve the best semantic segmentation on mobile devices. Due to the performance and efficiency of MobileNetV2 and MobileNetV3, these models are used as a backbone with other networks such as DeepLabv3, U-Net, LR-ASPP, etc., for semantic segmentation. Lingyu Zhu et al. [32] proposed an end-to-end portrait segmentation architecture with unique cross-granularity categorical attention and boundary enhancement mechanisms in a unified framework. Song-Hai Zhang et al. [33] proposed a real-time portrait segmentation model called PortraitNet for mobile devices. The model includes two modules, the encoder module and the decoder module. PortraitNet utilized MobileNetV2 as a backbone in the encoder module and U-shape architecture as a decoder. Sachin Mehta [34] introduced a fast and efficient ESPNet based on an efficient spatial pyramid (ESP) architecture. It can efficiently perform the semantic segmentation of highresolution images using limited resources in terms of computation, memory, and power. ESPNetv2 [35] is a lightweight architecture for semantic segmentation that can be easily deployed on edge devices. Hyojin Park et al. [36] introduced an extremely lightweight portrait segmentation model, SINet, which used an information blocking decoder to measure the confidence score. It blocks the flow of the decoder utilizing the information.
Compared to traditional image segmentation methods, these DSMs achieved remarkable performance improvement in image segmentation [5]- [9]. Some studies showed that DSMs could be utilized for a selfie segmentation purpose. Warfield et al. introduced an algorithm to combine multiple segmentations and validate image segmentation performance [37]. Rohlfing et al. introduced a shape-based averaging method to combine multiple segmentations and compared it to other ensemble methods [38]. Andrew Holliday et al. applied a compressed model technique for DL ensemble to the problem of semantic segmentation and achieved realtime speed [39]. D. Marmanis et al. applied the ensemble of multiple DL models using Fully Convolution Network (FCN) and achieved excellent segmentation results [40]. Y.-W. Kim et al. proposed an ensemble of multiple heterogeneous DSMs for portrait segmentation and analyzed the efficiency of the ensemble approach. The authors showed that some combinations of DSM could perform higher accuracy than single models while using low memory and computing power [41], [42].

B. IMAGE SEGMENTATION USING ENSEMBLE
In general, an ensemble model can generate optimized results using the combination of multiple machine-learning models or deep-learning models. Several studies applied the ensemble method using multiple DSMs to the image segmentation domain and achieved better performance compared to single DSMs. These works showed that the ensemble of various DSMs can be a reasonable choice for image segmentation applications.

C. IMAGE SEGMENTATION IN VIDEO
J. Li et al. [43] proposed a novel approach to segment an object in a video using a proposal-driven framework. The authors adopted the ResNet model for the proposal and the PSPNet model for object segmentation. Yifan Liu et al. [44] developed a real-time video segmentation method that considers accuracy and temporal consistency. The proposed method conducts per-frame inference using compact networks. The authors included the PSPNet18, the MobileNetV2 and a lightweight HRNet to verify that the proposed methods can improve the segmentation accuracy and the temporal consistency without extra computation and post-processing during inference. Mingyu Ding et al. [45] proposed a novel framework for joint estimation of semantic video segmentation and optical flow. The authors used the original PSPNet and the modified FlowNetS as the baseline network unless otherwise specified. Shan Lin et al. [46] proposed a Multi-frame Feature Aggregation (MFFA) module to improve instrument segmentation. It uses temporal and spatial relationships between frame pixels for feature aggregation. The authors applied the proposed approach to the real-time instrument segmentation using DeepLabV3+ with ResNet50 and MobileNet as backbone feature extractors. Federico et al. combined offline and online learning approaches. The method segments a particular object instance in a video by providing one or a few segmentation masks [47]. Several articles studied video portrait segmentation, such as "Automatic Real-time Background Cut for Portrait Videos" proposed by Xiaoyong Shen et al. [48]. They have segmented the person from the background on a video using a global background attenuation model. Monica Gruosso et al. [8] showed the possibility of automatic human recognition and segmentation in a surveillance video system. The authors used SegNet [49] encoder-decoder Convolution Neural Network (NCNN) model for the experiment. Tairan Zhang et al. [50] proposed a real-time single-person segmentation framework in a video. The framework combined a CNN model and a tracking system using a level set algorithm. The CNN model obtained a human segmentation result from a specific frame in a video and passed it to the tracking system to capture the human segmentation of the rest frames.
The extension of instance image segmentation to the video domain is a natural step as the requirement of image segmentation in a video is increasing. There are several studies on image segmentation in a video. These studies showed that DSMs could be used for image segmentation in a video. In these studies, deep-learning models for image segmentation were optimized by considering the characteristics of a video. However, the ensemble of multiple DSMs for the selfie segmentation in a video is an area that needs further study.
It is evident that DSMs are a reasonable choice for the selfie segmentation in a video, and the ensemble of multiple DSMs can improve the precision of the segmentation result. However, it is not fit well when we apply these approaches directly to the image segmentation in a video. For the same reason, many researchers have developed various DSMs having optimized architectures for the image segmentation in a video. Considering the advantage of the ensemble model, it is an appropriate attempt to utilize the ensemble of multiple DSMs for image segmentation in a video. This paper proposes a novel ensemble called "N-Frames ensemble" for the selfie segmentation in a video to address issues that occurred when we apply single DSMs or the ensemble of multiple DSMs to the image segmentation in a video.

III. APPROACH
This section describes the proposed approach and the experiment environment in detail.

A. ENSEMBLE APPROACH
A single model segmentation uses only one model to produce the segmented outputs for the input images. The ensemble is a machine learning technique that incorporates several single models for an optimized prediction result. In general, the ensemble method takes two ways; one way is to combine heterogeneous models which are trained on the same dataset or to combine homogenous models which are trained on different datasets. The diversity of models is generally believed to be one of the critical performance factors in an ensemble [51]. In this paper, we have used pre-trained heterogeneous models for the ensemble. Fig. 1 shows the conceptual diagram for selfie segmentation in video using the N-Models (NM) ensemble approach. All segmentation models generate output results for every single frame in a video, and the output results are combined using the ensemble method. In more detail, at a given time t, a video has a Framet, and the Framet is fed to segmentation models Model 1 to Model n to generate segmented output Mask 1 to Mask n. Finally, the masks are combined to make optimized output Maskt, as shown in Fig. 1. This paper proposes a novel ensemble approach for selfie segmentation in the video. We call it an N-Frame (NF) ensemble. Fig. 2 shows the conceptual diagram for selfie segmentation in video using the NF ensemble approach. The NF ensemble method uses a single segmentation model to generate an output result for a single frame in a video. It combines the output results of previous frames with the current output result using the ensemble method. In detail, at a given time t, a video has Framet, and the Framet is fed to a segmentation model n to generate a segmented n th output mask. The NF ensemble method combines the output mask with other output masks generated from previous segmentation model n-1, …, model 1 to generate the final Maskt. It rotates each segmentation model using a roundrobin way. For example, if there are total n models for ensemble and model n was used for Framet at a given time t. The NF ensemble uses model 1 for the next Framet+1. It combines the segmented output mask of model 1 with previous output masks generated from segmentation model n, …, model 2 to create final output Maskt+1. The benefit of this approach is that only one segmentation model needs to make a segmentation mask for a current Framet in a video at a given time t. The advantages of the NF ensemble are as below: • NF ensemble can proceed as fast as a single segmentation model. It requires only one segmentation model at a given time t. It combines the segmented output of the current segmentation model with the segmented results of previous segmentation models. The speed of segmentation is the mean speed of all segmentation models involved in the ensemble. • NF ensemble can enjoy the advantage of the ensemble approach. It combines segmented results of multiple segmentation models to generate optimized output results for every video frame. • NF ensemble is robust to sudden noises of a video. It merges segmented results of multiple frames, and the effect of fusing multiple frames reduces the sudden noises of a video. • NF ensemble is suitable for the segmentation of the human body or slowly moving object in a video. For example, the movement of humans is almost frozen when we observe it at a millisecond level. Considering a video of 30 FPS, a time slice of each frame will be approximately 33 milliseconds, and the difference of human movement among each frame will be almost ignorable.
The disadvantages of the NF ensemble are as below: • NF ensemble requires as much memory space as that of a total number of segmentation models used in an ensemble. • NF ensemble is not suitable for the segmentation of a video that has a fast-moving object. As it merges segmented results of multiple frames in a video, combining numerous segmented results always has the chance to produce a blurring effect. Therefore, if the difference of segmented objects in video frames is high, it will affect the quality of the final segmentation result. • NF ensemble is not suitable for the segmentation of a low FPS video. Frames of a low FPS video have a longer time gap, and it has a chance to have a higher difference of each frame than a high FPS video.
In contrast to ordinary learning approaches using a single model, an ensemble approach combines the results of the first-level learners and generates final results from the second-level learner. Among many ensemble methods, averaging and voting are the most commonly used. There are different voting methods such as Majority voting, Plurality voting, Weighted voting, Simple soft voting, Weighted soft voting, etc. [51]- [53]. In this paper, a simple soft voting method is used for the ensemble. It treats the individual classifiers equally and averages the outputs of the individual. Equation (1) is for the Simple soft voting: where |r| < |T|, |r(i, t)| < |T|, T > 1 and t ≥ T at given time t.
In a set of T individual classifiers { 1, …, T}, the classifier i generates a l-dimensional vector ( 1 ( ), ..., l ( )) T for the instance of , where h indicates the classifiers and j ( ) ∈ [ , ]. In general, a round-robin method can be presented by the following (2): Equation (3) denotes a round-robin method for a different time frame t.

B. PERFORMANCE MEASUREMENT
The Intersection over Union (IoU) is a metric used to measure the accuracy of image segmentation. An IoU is defined as in (4) below: where A is an image segmentation result, B is a ground truth image, A ∩ B is the intersection of A and B, and A ∪ B is the union of A and B. IoU standard deviation is used to measure the prediction variance. False Negative Rate (FNR) and False Discovery Rate (FDR) are used for validating the accuracy measurement. FNR is used to measure lesser regions than the ground truth. FDR is used to measure larger regions than the ground truth. FNR and FDR are defined as in (5) and (6) below: where FN is False Negative, TP is True Positive, and FP is False Positive. The bias error means the amount of difference between the ground truth and the prediction. In general, the ensemble of multiple models can reduce bias. In this paper, FDR+FNR is used to measure the bias error. |FDR-FNR| is used to measure the variance of prediction. In this paper, Memory Efficiency Ratio(MER) and Computing Efficiency Ratio(CER) are used to measure the cost efficiency rate of segmentation models. MER indicates the memory efficiency of the model. Since an efficiency ratio can be denoted as costs over gain, MER measures the required memory size to gain accuracy, and it is calculated as in (7).

= (7)
where M is the number of parameters and IoU is the accuracy of a given model. CER indicates the computing power efficiency of a model. CER measures required computing power to gain accuracy. CER is calculated using (8) as below: where C is Floating-Point Operations (FLOP), and IoU is the accuracy of a given model.  [33]. It has 224x224 input/output resolution. SN is a pre-trained model of SINet [36]. It has 320x320 input/output resolution. Table I shows the backbone of each model and the accuracy comparison using the verification dataset EG1800+CDI. We experimented with these four pre-trained selfie DSMs to segment the body of a human from video dataset. The tested video dataset was collected from publicly available websites. The dataset is a collection of captured frames from 81 videos having a single person view. The resolution of videos is 480x360 pixels, the duration is 10 seconds ~ 20 seconds, and the frame rate of videos is in the range of 25 to 30 frames per second (FPS). Finally, the test dataset consists of around 40,000 frame images generated from the above mentioned 81 videos. Single models, NM ensemble and NF ensemble models are evaluated using the test dataset. We experimented with all ensembles of four single models. The simple soft voting is the combining method for the ensemble of single models. The experiment was conducted using Python with Keras, PyTorch, OpenCV and TensorFlow libraries on Ubuntu operating system and GeForce GTX 1080 GPU hardware.

IV. EXPERIMENT
This paper uses four DL-based selfie segmentation models, namely MNV2, MNV3, PN, and SN, to compose proposed ensemble models. Four single models and proposed ensemble models are experimented with to evaluate the accuracy, variance, and bias errors. IoU is used for accuracy measurement, IoU standard deviation for variance error measurement, and FNR+FDR for bias error measurement. |FNR-FDR| means the absolute value of FNR-FDR, which measures the difference between FNR and FDR. MER and CER are also calculated to measure the efficiency of single models and proposed ensemble models. Table II shows the experimental results of four single models MNV3, MNV2, PN, and SN. In the table, MNV2 produces the highest IoU value, the lowest IoU standard deviation, the lowest FNR+FDR, and the lowest |FNR-FDR|. It indicates that MNV2 is the most accurate in selfie segmentation among four single models. The lowest IoU standard deviation indicates that the variance error is lowest, and the lowest FNR+FDR indicates that the bias error is also the lowest. The lowest |FNR-FDR| means MNV2 is well balanced in false positive regions and false negative regions. MNV3 shows the highest |FNR-FDR| value, and MNV3 tends to predict false positive regions more than false negative regions. SN is the second most well-performing model compared to others.

B. TWO-MODELS AND THREE-MODELS NM ENSEMBLE RESULT
In this paper, the NM ensemble was introduced in Section III-A. To evaluate the NM ensemble, we experimented with the Two-Models(N=2) NM ensemble and Three-Models(N=3) NM ensemble model.   Table IV shows the experiment result of the Three-Models NM ensemble. Three combinations out of four (75%) show improved IoU value, IoU standard deviation, and FNR+FDR than the best single model MNV2. All Three-Models NM ensembles show improved results than any single model that participated in the ensemble. The range of IoU values is between 94.5861% and 96.0695%, and the difference of maximum and minimum is 1.4834%. In the Two-Models NM ensemble in Table III, the range of IoU values are between 94.0351% and 96.5587%, and the difference of maximum and minimum is 2.5236%. It indicates that the variance of IoU values of the Three-Models NM ensemble is less than the Two-Models NM ensemble.
The same trend can be observed in IoU standard deviation. The range of IoU standard deviation of the Two-Models NM ensemble is between 1.5433% and 2.2678%, and the difference between the two values is 0.7245%. The range of IoU standard deviation of the Three-Models NM ensemble is between 1.5445% and 1.7661%, and the difference between the two values is 0.2216%. It indicates that the variance of the Three-Models NM ensemble is less than the Two-Models NM ensemble. Table V shows the differences of maximum and minimum values of IoU, IoU standard deviation, FNR+FDR, and | FNR-FDR | values. Ensemble models show the lesser difference of maximum and minimum than that of single models, and Three-Models NM ensemble shows the lowest value in all aspects.

C. TWO-FRAMES AND THREE-FRAMES NF ENSEMBLE RESULT
In this paper, we proposed a novel ensemble approach called an N-Frame (NF) ensemble in Section III.A. The conceptual diagram of selfie segmentation in a video using NF ensemble is shown in Fig. 2. The idea of NF ensemble is: • If the difference of segmented outputs of adjacent video frames is small enough, then the ensemble of segmented outputs of adjacent video frames using NF ensemble will produce almost the same result as the ensemble of segmented outputs of a single frame using NM ensemble.
To verify the idea of NF ensemble, the difference of ground truth segmentation masks among adjacent frames in the test dataset is calculated. The below (9) calculates the difference of n number of consecutive frames in a video.
where Gt is a ground truth segmentation mask of a Framet at a given time t, n > 1 and t ≥ n. This equation can be expressed using bit operation such as D(n) = (Gt OR Gt-1 OR … Gt-n+1) XOR (Gt AND Gt-1 AND … Gt-n+1). Table VI shows the difference of consecutive 2-frames, 3-frames and 4-frames in the test dataset. The result shows that the difference of consecutive 2-frames is less than 0.35%, consecutive 3-frames is less than 0.65%, and consecutive 4frames is less than 1%. To evaluate the proposed NF ensemble, Two-Frames(2F) and Three-Frames(3F) NF ensemble were experimented. In Table VII, the Two-Frames NF ensemble improves IoU, IoU standard deviation, FNR+FDR, and |FNR-FDR| compared to single models. The performance of the Two-Frames NF ensemble is almost the same as the Two-Models NM ensemble. MNV2 + SN performs the highest IoU value indicating the most accurate segmentation among Two-Frames NF ensemble models and shows better accuracy than any other single model. IoU value is 96.4729%, which is slightly less than 96.5587% of MNV2+SN(1F) NM ensemble. MNV3 + PN shows much improvement in IoU, IoU std, FNR+FDR and |FNR-FDR| than MNV3 and PN. MNV3 + SN also shows much improvement in IoU value than any other single model. PN + SN has better accuracy than MNV3 and PN single model and better IoU standard deviation than MNV3.  Table VIII shows Three-Frames NF ensemble models. All NF ensemble models show an improvement in IoU value than MNV3, PN, and SN models. Among four possible ensemble combinations, three of them show better IoU accuracy than any single model. The highest IoU value of the Three-Frames NF ensemble is 95.8342%, which is less than the highest IoU value of the Two-Frames NF ensemble, 96.4729%. The difference between these two IoU values is 0.6389%, and it is almost the same as the difference value of 3-frames in Table VI.  Table IX shows the average values of IoU, IoU standard deviation, FNR+FDR and |FNR-FDR| from single, Two-Models (1F) NM ensemble, Two-Frames (2F) NF ensemble, Three-Models (1F) NM ensemble, and Three-Frames (3F) NF ensemble models. Two-Models (1F) NM ensemble and Two-Frames (2F) NF ensemble models significantly improve IoU, IoU std, FNR+FDR, and |FNR-FDR| than single models. The difference of IoU value between Two-Models (1F) NM ensemble and Two-Frames (2F) NF ensemble models is 0.0615%, and it is much less than that of 2-frames in Table VI. Three-Models (1F) NM ensemble and Three-Frames (3F) NF ensemble also show significant improvement in IoU, IoU std, FNR+FDR, and |FNR-FDR| than single models. The difference of IoU value between Three-Models (1F) NM ensemble and Three-Frames (3F) NF ensemble models is 0.1933%, and it is much less than the difference of 3-frames in Table VI.       , and MNV3+SN (2F) models belong to this group. Group#3 is the lower left part of the chart, and it shows a low accuracy and low use of computing power. PN and MNV3 models belong to this group. In general, a model that produces high accuracy using low computing power is desirable, and the models of Group#2 belong to this category. The result shows that proposed NF ensemble models can perform high accuracy using low computing power than other single and NM ensemble models.    Fig. 6 show the average of MER and CER of single models, NM ensemble, and NF ensemble models. All ensemble models show higher IoU values than a single model. Two-Frames (2F) NF ensemble and Three-Frames (3F) NF ensemble show lower CER than a single model. It indicates that proposed NF ensemble models can perform better accuracy in a cost-efficient way than single models. However, it is also observed that the requirement of memory for the ensemble is increased as the number of models for the ensemble is increased.  Fig.  7 shows the segmentation result of various videos. The proposed NF ensemble shows better segmentation result than single models. Fig. 8 shows the segmentation result of continuous frames. The proposed NF ensemble shows stable segmentation result while single models show irregular segmentation results between adjacent frames.   In table VI, the average difference of segmented output masks of adjacent four frames is less than 1%. It is an important observation that the difference of segmented output masks of adjacent video frames is almost ignorable considering the expected performance improvement of segmentation using NF ensemble. Due to this characteristic, proposed NF ensemble models show almost equal accuracy to NM ensemble models. In Two-Frames (2F) NF ensemble, three combinations out of six (50%) showed higher accuracy and lower bias errors than the best single model MNV2.

F. EXAMPLES OF SELFIE SEGMENTATION
Overall performance of Two-Frames (2F) NF ensemble shows similar to Two-Models (1F) NM ensemble. The difference of average IoU between the Two-Frames (2F) NF ensemble and Two-Models (1F) NM ensemble is 0.0515%. For the Three-Frames (3F) NF ensemble, three out of four combinations (75%) show improvement in accuracy, variance, and bias errors than the best single model MNV2. The average IoU value of the Three-Frames (3F) NF ensemble is 0.1933% less than Three-Models (1F) NM ensemble. However, it is 2.2081% higher than that of single models. It indicates that the Three-Frames (3F) NF ensemble can perform as accurately as the Three-Models (1F) NM ensemble.
MER and CER are helpful methods to measure the cost efficiency of single and ensemble models. In general, the use of memory and computing power is increased when multiple models are combined to generate ensemble results. However, the result in table XI shows that it is possible to combine multiple models to produce better accuracy than a single model yet having better memory efficiency (MER) and computing power efficiency (CER). Especially, proposed NF ensemble models show almost the same CER value to a single model. It indicates that the proposed NF ensemble model can produce as high accuracy as the NM ensemble model; simultaneously, it uses less computing power as a single model. It is the most significant contribution of this paper.

VI. CONCLUSION
In this paper, we proposed NF ensemble approach that produce better accuracy, lower variance, and lower bias errors than single DL-based selfie segmentation models. Two-Models (1F) and Three-Models (1F) NM ensemble, and Two-Frames (2F) and Three-Frames (3F) NF ensemble models were experimented and compared with the single models. A simple soft voting method was used to combine multiple DL-based selfie segmentation models. Captured video frames of 81 videos collected from TED Talk, Talk Show, News, etc., having a single-person view, were used as a test dataset. Intersection over Union (IoU) for accuracy, IoU standard deviation for variance error, and false prediction rate for bias error were used to measure the performance of single models and proposed ensemble models. Memory Efficiency Rate (MER) and Computing power Efficiency Rate (CER) were used to analyze the cost efficiency of single models and proposed ensemble models. The experiment result shows that the NM ensemble and NF ensemble could produce better accuracy than single models, and the Three-Models combination showed higher accuracy than the Two-Models combination.
The ensemble of multiple DL-based selfie segmentation models improved the performance of segmentation, and it required higher memory and computing power than single models. However, the analysis of efficiency rate showed that some combinations of single models could perform better accuracy than single models yet having better memory efficiency (MER) and computing power efficiency (CER). Proposed NF ensemble models showed almost the same CER value as single models. It indicates that the proposed NF ensemble approach can produce as high accuracy as NM ensemble models, at the same time it uses as less computing power as single models.