Effective Source-Aware Domain Enhancement and Adaptation for CNN-Based Object Segmentation

In this paper, we propose an effective source-aware domain enhancement and adaptation (SDEA) approach to increase the accuracy of the existing convolutional neural network-based (CNN-based) object segmentation methods. We first scoop out the source elements, such as the falling-leaves, manhole covers, cirrus clouds, and advertisements, which often cause invalid object segmentation and make the existing object segmentation methods provide unreliable information to the ADAS (automatic driving assistance systems) applications. Secondly, we create a new GTA5-like (Grand Theft Auto V-like) dataset with the scenarios including these source elements. Furthermore, we perform a domain adaptation on the created GTA5-like dataset to generate a photo-realistic GTA5-like dataset, namely GTA<inline-formula> <tex-math notation="LaTeX">$5_{s}^{SDEA}$ </tex-math></inline-formula>. Without the need to relabel the pixel-annotations for GTA<inline-formula> <tex-math notation="LaTeX">$5_{s}^{SDEA}$ </tex-math></inline-formula>, we combine GTA<inline-formula> <tex-math notation="LaTeX">$5_{s}^{SDEA}$ </tex-math></inline-formula> with the realistic dataset, namely Camvid, to constitute a newly enhanced dataset. After retraining the existing CNN-based object segmentation methods by using our enhanced dataset, it can achieve substantial segmentation accuracy improvement. The comprehensive experimental results have demonstrated the clear accuracy improvement merit by applying our SDEA approach to the state-of-the-art object segmentation methods on FCN (Fully Convolutional Networks), SegNet-basic, AdaptSegNet, and Gated-AdaptSegNet, providing more reliable information to ADAS applications.


I. INTRODUCTION
Recently, developing object segmentation methods using convolutional neural networks (CNN) [1], [4], [16]- [18], [21], [27], [32] has received great attention in different applications, particularly in ADAS (automatic driving assistance systems) applications [1], [24]. In ADAS applications, for each image frame, the CNN-based object segmentation method often considers up to nineteen object types, namely road, sidewalk, building, wall, fence, pole, traffic light, traffic sign, vegetation, terrain, sky, person, rider, car, truck, bus, train, motorcycle, and bike. Due to the high object segmentation accuracy demand in ADAS applications, designing a novel approach to improve the existing CNN-based object segmentation methods is an important task.
The associate editor coordinating the review of this manuscript and approving it for publication was Bohui Wang . In the past years, by using the CNNs, animation-based dataset enhancement, and domain adaptation, some successful object segmentation methods have been developed to increase the segmentation accuracy, providing more reliable information to ADAS applications in lane detection [7], [14], traffic sign recognition [16], departure/collision warning [26], [29], and vanishing point detection [3], [13]. In the next subsection, the related work is introduced.

A. RELATED WORK
Based on the end-to-end fully-convolutional network (FCN) model, in which there are 15 convolutional layers in the encoder and there are three deconvolutional layers in the decoder, an effective FCN-based object segmentation method [18], [25] was proposed. The configuration of FCN used in their method is shown in Table 1. To achieve higher   object segmentation accuracy, Badrinarayanan et al. [1] modified the FCN model by reducing the number of convolutional layers from 15 layers to nine, and replacing the three deconvolutional layers by five upsampling layers. Their modified FCN model is called SegNet-basic. Table 2 shows the configuration of SegNet-basic. Due to the available source codes, the object segmentation methods by using FCN and SegNet-basic are included in the comparative methods to justify the segmentation accuracy improvement merit of our source-aware domain enhancement and adaptation (SDEA) approach.
To increase the segmentation accuracy, researchers have put much effort into capturing realistic images, and then they labeled each pixel annotation for these captured images for creating realistic datasets, such as Camvid [2], Cityscapes [5], KITTI [9], and Urban LabelMe [23]. However, capturing and labeling more realistic images to enhance the datasets is quite expensive and time-consuming. To solve this expensive and time-consuming problem, Richter et al. [20] proposed a cheap and effective animation-based approach to create a synthetic GTA5 dataset with 7500 synthetic images. Experimental data demonstrated that using this synthetic GTA5 dataset and one realistic dataset, e.g. Camvid, as the hybrid training set, the CNN-based object segmentation accuracy can be improved.
Although Richter et al.'s animation-based dataset enhancement approach [20] can improve the segmentation accuracy relative to the traditional approach, due to the domain shift problem between the realistic dataset, namely Camvid, and the synthetic dataset, namely GTA5, the segmentation accuracy using the hybrid training dataset ''GTA5+Camvid'' to train the CNN-based frameworks is not as good as expected.
To solve this domain shift problem, several domain adaptation approaches were proposed to reduce the gap between the synthetic dataset and the realistic dataset. Tsai et al. [27] proposed a generated adversarial network-based (GANbased) object segmentation method, called the AdaptSeg-Net method, that used the adversarial training to align pixel-level ground truth in the output space. Based on GAN, Lin et al. [17] proposed the Gated-AdaptSegNet based method that used a foreground adaptation module to separate the foreground and background for improving the segmentation accuracy.
Different from Tsai et al.'s approach [27], Zhang et al. [33] transformed the GTA5 dataset to a photo-realistic dataset, denoted by GTA5 s , by using the style transfer technique. Experimental data illustrated that using the hybrid training dataset ''GTA5 s +Camvid'' as the enhanced training dataset can increase the object segmentation accuracy. Due to the available codes, the AdaptSegNet and Gated-AdaptSegNet methods are included in the comparative methods to justify the accuracy improvement by using our SDEA approach.

B. MOTIVATION
For convenience, let the FCN-based object segmentation method, the SegNet-basic-based object segmentation method, the AdaptSegNet-based object segmentation method, and the GatedAdaptSegNet object segmentation method be denoted by FCN, SegNet-basic, AdaptSegNet, and GatedAdapt-SegNet, respectively. Based on the enhanced dataset ''GTA5 s +Camvid'', after training the above-mentioned four object segmentation methods, we found that in the testing step, the two sources, namely falling-leaves and manhole covers, often cause invalid road segmentation; another two sources, namely cirrus clouds and advertisements, often cause invalid sky and building segmentation, respectively. In particular, the invalidly segmented road, sky, and building information may result in improper decisions in ADAS applications. For example, the invalid road segmentation may lead to improper lane detection, invalid traffic sign recognition, and wrong departure/collision warning; the invalidly segmented sky may lead to incorrect vanishing point detection.
Without the need to relabel the pixel-annotations, the above source-aware observation prompted us to develop a novel and effective source-aware domain enhancement and adaptation (SDEA) approach to create a newly enhanced dataset ''GTA5 SDEA s ''. And then, based on our dataset GTA5 SDEA s , the retrained version of the four considered object segmentation methods, namely FCN, SegNet-basic, AdaptSegNet, and Gated-AdaptSegNet, can have higher segmentation accuracy. Note that our SDEA approach is infeasible to retrain the Mask R-CNN based object segmentation method [12] because among the eighty objects considered by Mask R-CNN, only nine, namely person, bicycle, car, motorcycle, bus, train, truck, traffic light, and stop sign, are useful in ADAS applications. Therefore, we do not apply our SDEA approach to the Mask R-CNN model.

C. CONTRIBUTION
To overcome the above-mentioned weakness and limitation existing in the related work, this paper proposes a novel and effective SDEA approach to achieve substantial object segmentation accuracy improvement for the four considered CNN-based object segmentation methods. The three contributions of our SDEA approach are clarified as follows.
In the first contribution, the proposed SDEA approach scoops out the sources, namely the falling-leaves, manhole covers, cirrus clouds, and advertisements, which infrequently or irregularly appear in the real situation but often cause invalid object segmentation; the invalid object segmentation information tends to interfere with incorrect decisions in ADAS applications. Therefore, we propose a source-pasting technique to create a new GTA5-like dataset which contains the scenarios including these sources. In each GTA5-like image, the additive sources come from the sub-image cutting off from a realistic image in the dataset ''Camvid''.
In the second contribution, we perform a domain adaptation on our GTA5-like dataset to generate a photo-realistic GTA5-like dataset, called ''GTA5 SDEA s ''. Accordingly, the new hybrid dataset, called ''GTA5 SDEA s + Camvid,'' is created. Due to inheriting the originally labeled pixel annotation in GTA5 s and Camvid, the labeling work on GTA5 SDEA s can be waived, exempting the labeling-time overhead. Furthermore, we apply our new hybrid dataset to retrain the four considered object segmentation methods, namely FCN, SegNet-basic, AdaptSegNet, and Gated-AdaptSegNet, achieving substantial object segmentation accuracy improvement.
In the third contribution, the comprehensive experimental data have confirmed that our SDEA approach with our newly enhanced dataset ''GTA5 SDEA s +Camvid'' can substantially improve the object segmentation accuracy for the above-mentioned four considered CNN-based object segmentation methods. In terms of mean intersection over union (mIoU) to measure the object segmentation accuracy for the considered nineteen objects, the mIoU gains of our SDEA approach over FCN, SegNet-basic, AdaptSegNet, and Gated-AdaptSegNet are 1.1, 3.1, 1.5, and 1.7, respectively, providing more reliable object segmentation information to ADAS applications, making more trustworthy traffic decisions.
The rest of this paper is organized as follows. Section II presents our SDEA approach and describes how to build up the newly enhanced dataset ''GTA5 SDEA s +Camvid''. Section III reports the object segmentation accuracy improvement merit of our SDEA approach relative to the four state-ofthe-art CNN-based object segmentation methods. Section IV addresses some concluding remarks.

II. THE PROPOSED SDEA APPROACH
We first scoop out sources causing invalid object segmentation in the testing step. Then, without the pixel-annotation labeling overhead, a source-pasting technique is proposed to create an enhanced version of the dataset ''GTA5 s ,'' called ''GTA5 SDEA s ,'' which contains the scenarios including these sources. Furthermore, we create a newly enhanced dataset ''GTA5 SDEA s +Camvid'' which will be used to retrain the above-mentioned CNN-based object segmentation methods for increasing their segmentation accuracy.

A. SCOOP OUT SOURCES CAUSING INVALID OBJECT SEGMENTATION
From the observation of the object segmentation results in the testing step, we found that some invalid segmentation for objects, such as road, sky, and buildings, is often caused by the sources, namely the falling-leaves, manhole covers, cirrus clouds, and advertisements, because these sources infrequently or irregularly appear in the testing images. In particular, these invalid segmented roads, sky, and buildings may provide wrong information to ADAS applications in lane detection, departure/collision warning, and vanishing point detection.
Before taking practical examples to explain why the above-mentioned sources cause invalid object segmentation, the loss function Loss(I ) used for object segmentation is defined by where I (∈ R H ×W ×3 ) denotes the input H × W RGB full-color image; S c(I ) ∈ R H ×W ×C denotes the output H × W binary map, in which C denotes the set of all object classes VOLUME 8, 2020 and S c(I ) denotes the resultant segmentation map for the object class c ∈ C. When the entry in S c(I ) is 1, it indicates that the recognized object class for that pixel is equal to the object class c; otherwise, it denotes the wrong recognition for the pixel. Y c(I ) denotes the ground-truth labeled annotation map.
Based on the dataset ''GTA5 s +Camvid'' on FCN, in which the photo-realistic dataset GTA5 s is obtained by performing the domain adaptation method ''Photorealistic Image Stylization [15]'' on the synthetic dataset GTA5, we take four practical testing images to explain why the above-mentioned four sources lead to the invalid object segmentation problem, prompting us to propose the SDEA approach to solve this important problem.
As shown in Fig. 1(a), the falling-leaves surrounded by a yellow trapezoid on the lane cause an invalid road segmentation, as shown in Fig. 1(b). As shown in Fig. 1(c), the manhole cover on the lane surrounded by a yellow rectangle causes an invalid road segmentation, as shown in Fig. 1(d), because the texture of the manhole cover is different from that of the road. As for the cirrus clouds and advertisements shown in Fig. 1(e) and Fig. 1(g), respectively, the invalid segmented sky and buildings are illustrated in Fig. 1(f) and Fig. 1(h).

B. THE PROPOSED SOURCE-PASTING TECHNIQUE TO CREATE A NEWLY ENHANCED DATASET
In Fig. 1, four invalid object segmentation examples caused by four sources have been demonstrated. Capturing more realistic images with the scenarios containing the considered four sources is a straightforward way to enhance the dataset, but it is expensive and time-consuming. In addition, it is also quite time-consuming to label each pixel annotation of these captured real images. In what follows, without pixel-annotation labeling overhead, we propose a fast and effective source-pasting technique to create a new photo-realistic dataset, in which the scenarios contain these sources coming from the images in ''Camvid,'' and then we combine it with ''Camvid'' to create the newly enhanced dataset.
For easy exposition of the proposed SDEA approach, we first explain how to create a new synthetic GTA5-like image containing the falling-leaves. Given a labeled synthetic GTA5 image in Fig. 2(a), from the dataset ''Camvid,'' we select one real image containing falling-leaves, as shown in Fig. 2(b). Then, we paste the subimage containing fallingleaves, which is cut off from Fig. 2(b), to Fig. 2(a), creating the synthetic GTA5-like image shown in Fig. 2(c). In Fig. 2(c), all the pixels in the falling-leaves inherit the original labeled annotation in Fig. 2(b), and except for the falling-leaves, all the pixels in Fig. 2(c) inherit the original labeled annotation in Fig. 2(a), waiving the pixel-based labeling overhead.
After performing the style transform on the synthetic GTA5-like image via the domain adaptation method [15], the resultant photo-realistic GTA5-like image is shown in Fig. 2(d). As for the manhole cover case, by the same argument, Figs. 2(e)-(h) illustrate the corresponding four snapshots. Figs. 2(i)-(l) and Figs. 2(m)-(p) show the corresponding snapshots for the cirrus cloud and advertisement sources with respect to sky and building, respectively. However, using the above domain adaptation way to create the new photo-realistic GTA5-like images, the versatility of the created images is still not enough. To overcome this disadvantage and automatically generate more new photo-realistic GTA5-like images with different styles, we deploy the weather, namely the sunny day, the rainy day, and the overcast day, influence and the time period, the daytime, the nighttime, and the twilight, influence into the domain adaptation of our SDEA approach to increase the diversity of the newly created photo-realistic GTA5-like images as quick as possible.
By using our proposed SDEA approach, let the newly created photo-realistic GTA5-like dataset be denoted by GTA5 SDEA s . Consequently, the newly enhanced dataset ''GTA5 SDEA +Camvid'' is used to retrain the four considered object segmentation methods, namely FCN, SegNet-basic, AdaptSegNet, and Gated-AdaptSegNet, to increase the segmentation accuracy, providing more reliable segmentation information to ADAS applications.
In order to demonstrate the accuracy improvement merit of our SDEA approach, we apply the proposed new dataset ''Camvid+GTA5 SDEA s '' with 1722 (= 701+1021) images to train the above-mentioned four CNN-based object segmentation models. For convenience, the four retrained versions of the four object segmentation methods are called FCN SDEA , SegNet-basic SDEA , AdaptSegNet SDEA , and Gated-AdaptSegNet SDEA , respectively. For fairness, to compare the segmentation accuracy performance of all the considered object segmentation methods, we utilize the same testing dataset which consists of 580 images of which 500 are randomly collected from the dataset ''Cityscapes'' and 80 are captured from the real urban world and can be accessed from the website [8]. Note that in our two-step SDEA approach, the experimental results indicated that based on the dataset ''Camvid+GTA5 SDEA ,'' the accuracy improvement effect of the first step, namely the source-aware based domain enhancement step, is incremental relative to the baseline models, while based on the dataset ''Camvid+GTA5 SDEA s ,'' the accuracy improvement effect of our two-step SDEA approach, namely the source-aware based domain enhancement and adaptation, is obvious relative to the baseline models.
All experiments are implemented using a desktop with an Intel Core i7-7700 CPU running at 3.6 GHz with 32 GB RAM and an NVIDIA 1080Ti GPU. The operating system is Microsoft Windows 10 64-bit. The program development environment is PyCharm Professional with the Python programming language.

A. OBJECT SEGMENTATION ACCURACY IMPROVEMENT MERIT OF FCN SDEA
In the first set of experiments, the ''mIoU'' gain is used to show the average object segmentation accuracy improvement  merit of the proposed FCN SDEA method over the FCN method [18].
The metric ''IoU'' is used to measure the object segmentation accuracy of one object, and ''IoU'' is defined by IoU(object) = |Detected object pixels Ground truth object pixels| |Detected object pixels Ground truth object pixels| (2) In Eq. (2), '' '' and '' '' denote the ''intersection'' and ''union'' operations, respectively. The metric ''mIoU'' is used to measure the expected value of the ''IoU'' values for all considered objects. In terms of ''IoU'' Table 3 tabulates the segmentation accuracy of each object among the considered 19 objects; the IoU value of each object is listed below the object field. The mIoU value of all the objects is listed in the final column of Table 3. Table 3 indicates that the mIoU gain of our FCN SDEA method over FCN is 1.1 (= 19.2 -18.1), leading to a clear segmentation accuracy improvement of FCN SDEA .
Besides demonstrating the accuracy improvement in terms of ''mIoU'' to help the readers to visualize the accuracy improvement by using FCN SDEA , Fig. 3 depicts the perceptual effects of our SDEA approach. In Fig. 3, we observe that by using our FCN SDEA method, the perceptual effects for the segmented road, sky, and building have been much improved relative to the FCN method.
For fairness, based on the same testing dataset with 580 images [8], the average execution times for one testing image on the baseline model FCN and our model FCN SDEA are reported. Because the CNN configurations of FCN and FCN SDEA are the same and the only difference is the trained weights in the two models, for one testing image, the average execution times required by both models are the same and it takes 0.06 seconds.

B. OBJECT SEGMENTATION ACCURACY IMPROVEMENT MERIT OF SegNet-basic SDEA
In the second set of experiments, Table 4 tabulates the IoU comparison between the proposed SegNet-basic SDEA method and the SegNet-basic method. In the last column of Table 4, the mIoU gain of our SegNet-basic SDEA method over SegNet-basic is 3.1 (= 24.7 -21.6), indicating a clear average   IoU improvement by using our SDEA approach. In addition, as shown in Fig. 4, we observe that by using our SegNetbasic SDEA method, the perceptual effects of the segmented road, sky, and building justify the related IoU improvements.
Based on the same testing dataset [8], the average execution times for one testing image on the baseline model SegNet-basic and our model SegNet-basic SDEA are reported. Because the CNN configurations of the two models are the same and the only difference is the trained weights in the two models, for one testing image, the average execution times required by both models are the same and it takes 0.059 seconds.

C. OBJECT SEGMENTATION ACCURACY IMPROVEMENT MERIT OF AdaptSegNet SDEA
In Table 5, the mIoU gain of our AdaptSegNet SDEA method over AdaptSegNet is 1.5 (= 34.4 -32.9), and it indicates a clear average IoU improvement by our SDEA approach. In Fig. 5, we observe that by using our AdaptSegNet SDEA method, the perceptual effects of the segmented road, sky, and building justify the related IoU improvements.
Based on the same testing dataset, the average execution times for one testing image on the baseline model AdaptSeg-Net and our model AdaptSegNet SDEA are the same because the configurations of the two CNN models are the same. Experimental results demonstrated that for one testing image, the average execution times required by both models are the same and it takes 0.036 seconds.

D. OBJECT SEGMENTATION ACCURACY IMPROVEMENT MERIT OF GATED-AdaptSegNet SDEA
In Table 6, the mIoU gain of our Gated-AdaptSegNet SDEA method over Gated-AdaptSegNet is 1.7 (= 36.4 -34.7), indicating a substantial average IoU improvement by using our SDEA approach. In addition, Fig. 6 illustrates the perceptual effects of the segmented road, sky, and building by using our SDEA approach relative to Gated-AdaptSegNet.
Based on the same testing dataset, for one testing image, the execution times required by the baseline model VOLUME 8, 2020   Gated-AdaptSegNet and our model Gated-AdaptSegNet SDEA are the same and it takes 0.042 seconds.

IV. CONCLUSION
We have presented the proposed novel and effective SDEA approach to enhance the accuracy of the CNN-based object segmentation methods on FCN, SegNetbasic, AdaptSegNet, and Gated-AdaptSegNet. In particular, in the proposed fast source-pasting technique, the labelled pixel-annotations covered by these sources can inherit the original pixel-annotations from the sub-image of the selected ''Camvid'' image, and the labelled pixel-annotations covered by the other parts can also herit the original pixel-annotations in the selected GTA5 image.
The comprehensive experimental results have justified the segmentation accuracy improvement merit and the perceptual effect of our SDEA approach relative to the four CNN-based object segmentation methods on FCN, SegNet-basic, Adapt-SegNet, and Gated-AdaptSegNet.
Our first future work is to extend our SDEA approach to cover more sources for further improving existing CNN-based object segmentation methods. In addition, our SDEA approach will be considered to apply to the spatio-temporal graph convolutional network-based traffic forecasting method [31] which has been successfully used in the public bike sharing program [30]. Our second future work is to deploy the time-varying communication time delay issue [6], [28] into the proposed SDEA-and CNN-based object segmentation method to achieve higher segmentation accuracy and real-time demand in ADAS applications. YA-YUN CHENG received the B.S. degree in computer science and information engineering from Tamkang University, New Taipei City, Taiwan, in 2017, and the M.S. degree in computer science and information engineering from the National Taiwan University of Science and Technology, Taipei, in 2020. Her research interests include machine learning and object segmentation. VOLUME 8, 2020