Accelerator-Aware Fast Spatial Feature Network for Real-Time Semantic Segmentation

Semantic segmentation is performed to understand an image at the pixel level; it is widely used in the field of autonomous driving. In recent years, deep neural networks achieve good accuracy performance; however, there exist few models that have a good trade-off between high accuracy and low inference time. In this paper, we propose a fast spatial feature network (FSFNet), an optimized lightweight semantic segmentation model using an accelerator, offering high performance as well as faster inference speed than current methods. FSFNet employs the FSF and MRA modules. The FSF module has three different types of subset modules to extract spatial features efficiently. They are designed in consideration of the size of the spatial domain. The multi-resolution aggregation module combines features that are extracted at different resolutions to reconstruct the segmentation image accurately. Our approach is able to run at over 203 FPS at full resolution ( $1024\times 2048$ ) in a single NVIDIA 1080Ti GPU, and obtains a result of 69.13% mIoU on the Cityscapes test dataset. Compared with existing models in real-time semantic segmentation, our proposed model retains remarkable accuracy while having high FPS that is over 30% faster than the state-of-the-art model. The experimental results proved that our model is an ideal approach for the Cityscapes dataset. The code is publicly available at: https://github.com/computervision8/FSFNet.


I. INTRODUCTION
In recent years, there have been significant advances in the field of visual perception in computer vision; this has played a key role in the development of autonomous driving. This is mostly due to the emergence of deep convolutional neural networks (DCNNs) such as Inception [1]- [3], ResNet [4], and MobileNet [5]- [7]. In particular, semantic segmentation [8]- [11] using deep learning can predict each pixel of different semantic categories for an image, enabling autonomous driving systems to fully understand their surroundings such as roads, cars, pedestrians, and sidewalks at a pixel-level accuracy. The fully convolutional network (FCN) [12] is the first deep learning-based network to implement pixel-level classification; it led to new opportunities to further improve deep semantic segmentation architecture.
The associate editor coordinating the review of this manuscript and approving it for publication was Bin Liu . The latest deep learning-based semantic segmentation architecture attempts to improve the trade-off between high quality and low computation architecture using three methodologies: input size, convolution filter structure, and an encoder-decoder architecture.
The first is to reduce the size of the input image, such as ENet [13], ERFNet [14], SegNet [15], EDANe [16], and ESPNet [17]. This can increase the inference speed; however, these algorithms lose spatial information. The second is using filters with multiple sizes that operate on the same level as the inception [1]- [3] module. The accuracy can be further increased, but this is not good for inference speed owing to the need for complex operations. The third is an encoderdecoder architecture [14], [15], [18], [19] that has the ability to train a single end-to-end directly. This architecture has the benefit of reducing the computational load, but there are few architectures that can achieve a fast inference speed with high accuracy. Based on the trends of the aforementioned existing technologies in Table 1, we proposed a fast spatial feature network (FSFNet) designed for real-time semantic segmentation. FSFNet uses an original image resolution of 1024 × 2048 as an input image to minimize the loss of spatial detail. Next, we designed a lightweight convolution structure using latest trend operation. Our proposed architecture adopts an encoder-decoder method by considering the tradeoff between efficiency and performance when it is implemented using the NVIDIA TensorRT accelerator.
Compared with the existing method, our FSFNet exhibits good performance and low inference time as shown in Fig. 1.
Our main contributions are summarized as follows: • We propose an FSF module having three different subset modules according to the size of spatial domain. Each subset module is implemented using the NVIDIA Ten-sorRT accelerator. Our experimental best combination of various convolution layers is found by comparing the efficiency and performance.
• We propose a multi-resolution aggregation module that can combine features extracted at different branches of downsample rates to reconstruct accurately the segmentation image with high-level semantic features.
• We propose an optimized deep neural network architecture by performing ablation studies to investigate parameters such as the batch size, channel size, dilation rate, dropout rate, convolution layer sequence, spatial domain, and decoder structure. This helps to build an accelerated encoder-decoder network with high accuracy for real-time semantic segmentation.
• Our FSFNet achieves state-of-the-art inference speed (over 30% faster than the closest manually designed competitor on the Cityscapes Benchmark Suite) and maintains competitive accuracy. More specifically, we obtained the results of 69.13% mIoU on the Cityscapes test dataset with a speed of 203 FPS using only one NVIDIA 1080Ti GPU.
The rest of the paper is organized as follows. First, we provide the trend of semantic segmentation in Section II. Then the proposed method using FSFNet is presented in Section III, where details of FSF module, multi-resolution aggregation module, and network architecture. The experiment result is presented in Section IV, comparing our model performance to the state-of-the-art model using Cityscapes, Camvid, and Mapillary Vistas dataset and ablation study for different components in the proposed model. At last, Section V describes the conclusion of this work and its future work.

II. RELATED WORK
Semantic segmentation is a dense procedure, and it therefore requires the fine-grained localization of class labels at the pixel level. This technique is important in applications such as autonomous driving technology. In addition, real-time operation is required for practical applications. To implement the above, the design of CNNs and building architecture carefully is vital.

A. CONVOLUTIONAL NEURAL NETWORK
CNNs were initially developed by Krizhevsky et al. for the image classification challenge called ''AlexNet''. AlexNet [20] proposed the use of a CNN with ReLU activation [21], dropout [22], and overlapping pooling. The next important milestones in convolutional networks were achieved by performing a series of inception modules [1]- [3] and a residual network [4]. Inception-v1 adopted convolutions with various multiple sizes of operation for more global or local applications. Later, Inception-v2 introduced batch normalization [23] to normalize the value distribution before going into the next layer so that higher accuracy and faster training speed can be achieved. Inception-v3 improved the computational complexity using the factorization convolution method. ResNet makes it possible to train very deep neural networks. The main idea of ResNet, called ''identity shortcut connection,'' is to skip one or more layers, and it simply adds the output from the previous layer to the next layer. After that, the gradients can be easily reversed, and these shortcuts act like highways. By doing this, ResNet can alleviate the vanishing VOLUME 8, 2020  gradient problem so that the performance of the network can be improved.

B. SEMANTIC SEGMENTATION ARCHITECTURE
FCNs [12] were proposed by Long et al., and this architecture is first employed to produce image representation in each domain, where the last fully connected layer of a normal CNN is substituted by another convolution layer in order to classify each pixel on the image. However, there are disadvantages such as computational requirements and lower classification accuracy. Based on the FCN architecture, the latest semantic segmentation using deep learning has been divided into two development purposes: high-quality and high-efficiency.
The high-quality studies focused on achieving high accuracy, but did not focus much on the inference speed. The highefficiency studies attempt to improve the trade-off between high quality and computational resources.

1) HIGH-QUALITY SEMANTIC SEGMENTATION
The DeepLab series [24]- [26] was proposed by the Google team for high-quality semantic segmentation tasks. Traditional CNNs often suffer from the loss of spatial information during downsampling and pooling processes. In response to this problem, DeepLab-v1 proposed an atrous convolution to increase the receptive field size, while maintaining a higher resolution of feature maps. The fully connected conditional random fields (CRFs) are used to improve the segmentation performance as a post-processor. DeepLab-v2 introduced atrous spatial pyramid pooling (ASPP) blocks, enabling the utilization of several atrous convolutions at different dilation rates for a larger field of view. DeepLab-v3 adopted a method to obtain a denser feature map using an atrous convolution in the ResNet structure. The most recent version, DeepLab V3+, proposes the use of an atrous separable convolution approach that combines separable convolution and atrous convolution. Another method employed for highquality architectures adopts using the relationship between features. PSPNet [27] considers the global context of the image to predict the local-level predictions. You et al. [28] adopted the feature extraction based on bidirectional word vectors to reflect the contextual relationships between the pixels. Cai et al. [29] proposed cross-attention mechanism and graph convolution integration algorithm to generate deep features. Wang et al. [30] applied an attention mechanism for small sample classification of hyperspectral images to generate the deep features of multiscale convolution.

2) HIGH-EFFICIENCY SEMANTIC SEGMENTATION
ENet [13] proposes the first network in real-time semantic segmentation, which reduces the computation of a number of convolution layers using trimmed encoder-decoder architecture. ERFNet [14] redesigns the non-bottleneck residual module as a composition of a 3 × 1 convolution followed by a 1×3 convolution. ESPNet [17], [31] introduces a fast spatial pyramid (ESP) module, which is based on the convolutional factorization principle from shared parameter numbers across the image pyramid. The Bisenet [32] design is used to decouple the function of spatial information preservation and receptive field offerings into a spatial path (SP) and a context path (CP). The SP preserves the spatial information and generates a high-resolution feature. The CP is a downsampling strategy that is employed to obtain a sufficient receptive field. ICNet [33] developed a cascade feature fusion unit together with cascade label guidance to recover segmentation prediction.

III. FSFNet: FAST SPATIAL FEATURE NETWORK
The FSFNet architecture is initiated in order to develop a network with low computational cost and high accuracy in real-time inference speed on a single GPU card evaluated on the Cityscapes datasets [34]. To do this, we first proposed the fast spatial feature (FSF) module, which is the core component of feature extraction. Second, we designed a multiresolution aggregation module to upsample the feature maps. Third, we introduce the entire network architecture. Finally, we design the overall structure of the proposed FSFNet, which is shown in Fig. 2.

A. FSF MODULE
As we aim to improve inference speed and accuracy, we compared convolution operators, including their direct metric and indirect metric on the NVIDIA 1080Ti GPU with TensorRT SDK in Table 2. We widely used to measure the model size representing the total number of operations, parameter numbers, and floating-point operations per second (FLOPs), which are a measure of the complexity of the model. However, they are indirect metrics and are usually not equivalent to the direct metric, such as the inference speed test on real devices, because of many other factors such as memory access, I/O, and hardware/software optimization. Based on the above conditions, we designed a new variant called an ''FSF module,'' which is an optimized structure that enables a trade-off between inference speed and high quality on real devices. Our FSF module has three different types of subset modules, which are denoted as 16, 32} is the number of downsample rates, and is applied to each subset of the FSF module, as shown in Fig. 3. We demonstrate the optimized combination of convolution operators using subset modules of the FSF module in different spatial domain sizes. Increasing the downsample rate can reduce the computational cost, but the spatial domain information decreases. To compensate for this weakness, we considered the convention to increase the number of channels, denoted as c = {64, 96, 128}, to each subset of the FSF module. In addition, there was a possibility of vanishing and exploding gradients in the FSF module; therefore, we applied residual learning. This eases the optimization of deeper networks and improves the accuracy. To speed up the calculation of the convolution layer, we adopt bilinear interpolation. This method can either reduce or enlarge the size of the feature map as a parameter-free layer in the network. Therefore, this special design achieves a speed that is over 30% faster than the other approaches in the state-of-the-art inference speed while maintaining high accuracy on the Cityscapes test dataset.

1) F 8 MODULE
Inspired by zoomed convolution in Fasterseg [35], we designed the input feature map sequentially processed with bilinear downsampling, 3 × 3 convolution, 3 × 3 convolution, and bilinear upsampling. F 8 is a subset of the FSF module, and it is applied in the spatial domain with a size of 128 × 256, which is the 8 downsample rate of the original input size. To design F 8 , we analyzed the 3 × 3 convolution from the perspectives of direct and indirect metrics. In the indirect metric, 3 × 3 convolution uses a transformation function T (k 8 ) : f in → f out to input the feature map f in ∈ h×w×c in order to obtain the output feature map f out ∈ h ×w×c , where h, w, c is the height, width, and the number of feature channels, respectively. The transformation function is applied to two-dimensional (2D) kernels k 8 ∈ n×n×c to extract the feature map to the next layer, where n is the kernel size. Thus, the total number of operations in the 2D convolution layer is ccn 2hw . The height and width h, w are applied by stride, which is the step size of the kernel parameter,h = h/stride,w = w/stride and the spatial size of the output feature map is calculated using the equation (1): We also compared the parameter numbers and FLOPs of the 3 × 3 convolutions with other convolution operators in Table  2. The 3 × 3 convolution has approximately 1.5 times more parameter numbers and FLOPs than 3 × 1 and 1 × 3 convolutions. However, the direct metric of the 3 × 3 convolution is 1.25 times faster than the 3 × 1 and 1 × 3 convolutions. This means that the 3 × 3 convolution is optimized compared with other convolution layers by the latest CUDNN library in excluded from the analysis by the F 8 module because they are fast, but have very low accuracy in this spatial domain. Because of this advantage of 3×3 convolution specifications, we were able to find the optimal process of convolution combination as an F 8 module in the spatial domain with a size of 128 × 256, as shown in Fig. 3. Moreover, the F 8 module showed better accuracy and inference speed than the other FSF subset modules in the spatial domain with a size of 128 × 256, as shown in Tables 6 and 10 of ablation studies.

2) F 16 MODULE
Inspired by the leveraging of skip connections and convolutions with one-dimensional (1D) kernels in ERFNet [14], we designed the input feature map that is sequentially processed with bilinear downsampling, 3 × 1 convolution, 1 × 3 convolution, depthwise convolution, and bilinear upsampling. F 16 module applied after F 8 module and the spatial domain size is 64×128, which is a 16 downsample rate of the original input size. For indirect metric analysis in stacked 1D convolution, the transformation function T (k 16 ) : f in → f out is used to input the feature map f in ∈ h×w×c to obtain the output feature map f out ∈ h ×w×c . The transformation function is applied to 1D kernels k 16 ∈ n×1×c and k 16 ∈ 1×n×c . The total number of operations O(T , k 16 ) in two stacked of n×1 and 1×n convolutions is cc2nhw. An equation (2) is an output feature map using the step size of the kernel parameter.
The inference speed of most convolution operators becomes fast enough in this spatial domain; therefore, we focus on finding a higher accuracy combination of convolution operators than inference speed in Table 2. The direct metric of stacked 1D convolution is slower than 2D convolution, but this operator has the advantage of an extraction feature map more locally in the vertical and horizontal filters using 3 × 1 and 1 × 3 filters. This can help to increase accuracy without significantly reducing the inference speed in this small spatial domain. The next indirect metric analysis is depthwise convolution. The total number of depthwise operations O(T ) is calculated by equation (3): For the direct metric, the depthwise convolution [36] is 2.1 times faster than 3 × 3 convolution, and can connect only the corresponding feature channel. In addition, we integrated multi-scale contextual information on pixels using dilated convolution. This can accommodate a wide receptive field area compared with the standard convolution without expanding the number of parameters. To sum up our strategy in the F 16 module when the spatial domain becomes smaller, we set the convolution filter to be more local, and increase the number of channels. The next step is calculate the corresponding channel using depthwise convolution with dilated convolution. Based on the results of the experiment, we observe that F 16 has a better performance than other FSF modules for a spatial domain size of 64 × 96 obtained by the ablation study shown in Tables 6 and 10.

3) F 32 MODULE
Inspired by factorizing into smaller convolutions in Inception [1], we designed the input feature map sequentially processed with bilinear downsampling, 1 × 1 convolution, 1 × 1 convolution, depthwise convolution, and bilinear upsampling. The F 32 module applied after the F 16 module and the spatial domain size is 32 × 64, which is a rate that is downsampled by a factor of 32 compared with the original input size. For indirect metric analysis in 1 × 1 convolution, the transformation function T (k 32 ) : f in → f out is used to input the feature map f in ∈ h×w×c in order to obtain the output feature map f out ∈ h ×w×c . The transformation function T (k 32 ) is applied to 1D kernels k ∈ 1×1×c and k ∈ 1×1×c . The total number of operations O(T , k 32 ) in the 1D convolution layer is cchw. An equation (4) is an output feature map using the step size of the kernel parameter.
The inference speed of 1 × 1 convolution is faster than that of 1D and 2D convolution with a very small parameter numbers and FLOPs in Table 2. Our strategy is to localize the feature extraction in the tiny spatial domain size of 64 × 96 using 1 × 1 convolution. We also adopt the depthwise convolution in order to connect the corresponding feature channel and dilated convolution to adjust the receptive field of the feature points similar to the F 16 module. The experiments results showed that the F 32 module in the tiny spatial domain has a better accuracy and inference speed than other subset modules of the FSF module in Tables 6 and 10.

B. MULTI-RESOLUTION AGGREGATION MODULE
We used a multi-regression aggregation (MRA) module to restore the pixel-wise prediction from the lower-resolution feature maps. The MRA module adopts three strategies to improve the inference speed and accuracy. First, we used a 1× 1 filter of all convolution to reduce the local connectivity. The F 16 and F 32 modules extracted feature maps with minimal local connectivity in the small and tiny spatial domains, so the MRA module restored the spatial domain consistently while reducing local connectivity by 1 × 1 convolution. The results of the experiment showed that the 1 × 1 convolution of the MRA module is better than the combination of 3×3 convolution and 1 × 1 convolution as shown in Table 11. The second step is to combine multi-resolution because this enables us to obtain the ensemble effect in the feature maps, and some studies [37], [38] show good performance of this structure. Third, we adopted bilinear upsampling to restore the spatial domain. This operation has the advantage of being fast and simple to implement because it only uses the four nearest pixel values. The entire process of the MRA module is shown in Fig. 4. The feature maps of each branch from F 8 , F 16 , and F 32 modules are combined and sequentially processed with 1 × 1 convolution, bilinear upsampling, and concatenation operation. The two branches are chosen to aggregate outputs with downsample rates of 16 and 32. The next branch combines outputs with downsample rates of 16 and 8.

C. NETWORK ARCHITECTURE
Inspired by the encoder-decoder architecture in ERFNet [39], we followed the encoder-decoder architecture, as shown in Table 3. The layers from 1 to 5 in our architecture form the encoder, composed of a downsample module in Table 3. The downsampling operation gradually reduces the spatial domain to extract high-dimensional semantic information. We applied 3 × 3 convolution with a stride of 2, as shown in Fig. 4. This convolution convolves and downsamples the spatial domain simultaneously, so it becomes significantly cheaper (reduced by 1 4 ) computationally than convolution with a stride of 1 and pooling for downsampling.
For the channel size, the number of channels increases by the downsample rate, but we limited the size of channels to 64 in order to consider the low computational cost.
The FSF module in the encoder is composed of layers 6 to 19 in Table 3. We applied the F 8 , F 16 , and F 32 modules at the spatial domain with downsample rates of 8, 16, and 32, respectively, and this was done four times each. This can extract the feature map with high inference speed according to the size of the spatial domain. We also included a dropout in the FSF module, and slightly increased the dropout ratio, as this yielded better results in our architecture, as shown in Table 8. Each subset module is connected by 1 × 1 convolution, which is named ''connection convolution'' in our architecture. The connection convolution is used by increasing the number of channel sizes because the modules F 8 , F 16 , and F 32 modules have a different number of channel sizes. In order to apply residual learning to the FSF module, the input and output must be the same, so the connection convolution can adjust the number of channels without significantly increasing the number of parameters.
The multi-resolution aggregation of the decoder completes a more sophisticated boundary segmentation in layers 20 to 29 in Table 3. It gradually restores the spatial information lost in the encoder owing to the reduction of the spatial domain. The feature map is restored images in reverse order of the encoder, so the number of channels from the decoder's downsample rate is too large. We use connection convolution for dimensional reduction. In the 29 layers, we restore the original 1024 × 2048 image eight expansion using bilinear upsampling.

IV. EXPERIMENT
In this section, we conduct a set of semantic segmentation experiments on the challenging dataset as Cityscapes [34], CamVid [40], and Mapillary Vistas [41] to demonstrate the high accuracy-efficiency trade-off of our proposed FSFNet.

A. EXPERIMENT SETTINGS 1) CITYSCAPES DATASET
The Cityscapes dataset adequately captures the complexity of real-world urban scenes from 50 cities, and consists of training, validation, and test sets of 2975, 500, and 1525 images, respectively. The size of the images was 1024 × 2048 and 19 semantic categories. As labels of the test set are not disclosed, the results were evaluated using an online test server. To address the problem of accuracy evaluation, we adopted the intersection-over union (IoU) metric, also referred to as the standard Jaccard overlap index, which is the most commonly used evaluation metric in semantic segmentation using (5). Here, TP, FP, and FN indicate the number of true positives, false positives, false negatives respectively, as follows:

2) IMPLEMENTATION DETAILS
The inference speed measurements were performed on the NVIDIA Geforce GTX 1080Ti under CUDA 10.0 [42] and CUDNN V7 [43]. We also employed the high-performance inference framework TensorRT V5.1.5 [44], and implemented it with Pytorch. Our FSFNet is trained using Adam optimization [45] with a batch size of 6 for 200 epochs. Momentum of 0.9, weight decay of 1e −4 , and initial learning rate is set to 5e −4 . Our architecture is an encoder-decoder network. First, the encoder is trained to extract the appropriate feature map, and then, the decoder is attached to restore the image to continue training the full architecture. We consider two methods to  train the encoder. One is only using the Cityscapes dataset and the other employs pretrained ImageNet weights [46]. In the training only one set of data, we first trained the encoder with downsampled ( 1 32 size) segmentation annotations from the Cityscapes dataset by attaching an extra convolution layer at the end of the encoder. After the encoder is trained, we remove the last layer of the encoder and then attach the decoder to the original size of the dataset to train the full network. In the pretrained ImageNet weights [47], compared with the above method, the encoder is trained by the Ima-geNet dataset and the last layer of the encoder, adding an average pooling layer. Once this modified encoder is trained, we remove the last average pooling layer and attach the decoder to train the full architecture using the Cityscapes dataset. For data augmentation, we applied a random horizontal flip to some images, and performed random translation between -2 and 2 in both axes

B. ABLATION STUDIES
We conducted a series of experiments to demonstrate the performance of our proposed network. We analyzed the Cityscapes dataset using both quantitative and qualitative approaches. The quantitative experiments are described by the ablation studies and the qualitative experiments of the FSFNet are as shown in Fig. 5. Table 4 shows the ablation study that has an increasing batch size from 1 to 8. From the batch size of 1 to 7, one GPU was used to train our FSFNet architecture, and the batch size of 8 utilized two GPUs to train our FSFNet architecture because of memory limitations. We also compared the cached memory, which is the memory that is currently used on the GPU by pytorch (as can be seen in nvidia-smi). In our FSFNet architecture, mIoU increased with the batch size from 1 to 6, and mIoU stagnated for batch size of 7 to 8; batch size 6 is the most optimized hyper-parameter in our architecture. The reason is that a batch size that is too small may have a high variance because of the updating of an inaccurate gradient. Batch sizes that are too large can make optimization easier, but it is possible to realize a local minimum point or saddle point and high computation cost per iteration.

2) CHANNEL SIZE
We performed experiments to demonstrate how the channel affects our FSFNet architecture with respect to increasing the accuracy and decreasing the inference speed. We increased the size of the channel in the FSF module, which is {16, 32, 64, 96, 128}. For the analysis, we studied the variation of parameter numbers, FLOPs, and FPS with the channel in Table 5. An IoU is evaluated using the Cityscapes validation set. As the channel increased, the inference speed decreased and the number of parameters increased. In our work, FSFNet designs the target model based on the criterion using (6), where Time is 5 ms.
The models of {16, 32, 64, (64, 96, 128)} channel size satisfied our target model. Among these, we chose the model with the channel size combination of 64, 96, and 128, which is the most accurate model. VOLUME 8, 2020

3) DILATION RATE
We used dilated convolution [48] to enable the flexible aggregation of the multi-scale contextual information in the FSF module. There are four fast spatial convolution layers at {16, 32} downsample rates. We adopted some dilation rates, which are {not used}, {1, 1, 2, 2}, {1, 2, 3, 4}, and {1, 2, 4, 8} to 16 and 32 downsample rates. We compared the ones that did not apply the dilation rate with those that gradually increased the dilation rate. In Table 7, it can be confirmed that the evaluated Cityscapes validation result varies by over 2.01% according to the use of the dilation convolution. The dilation rate, which is {1, 2, 4, 8}, and the most frequent dilation rate interval, are higher than other dilation rates in our architecture.

4) DROPOUT RATE
In this section, we show how a significant dropout can enhance the accuracy of our architecture. In Table 8,  we compared the dropout ratios for each FSF module. There are four types: not applying at all, increasing the dropout ratio slightly, applying the same dropout ratio, and applying it from a large to a small dropout ratio. The results of this study show that the use of the dropout is effective, and a 1.08% difference is shown depending on whether or not the use is performed. In particular, the gradual increase in the dropout ratio, starting from 0.01, for the FSF module shows the best performance. This is because our algorithm increases the number of channels in the FSF module, so the result has a higher possibility of overfitting owing to deeper learning. Thus, the increasing dropout ratio is suitable for our architecture, and shows good performance.

5) CONVOLUTION LAYER SEQUENCE
The original batch normalization paper [23] proposes the use of batch normalization before ReLU activation, but some papers [21], [49] report that batch normalization after ReLU activation results in a better performance. This issue remains a topic of debate. We experimented different orders  Table 9.

6) SPATIAL DOMAIN
Our FSFNet design architecture use spatial domain information. We verified that it is effective to apply a different convolution combination for each spatial domain size using quantitative and qualitative experiments. The quantitative experiments are shown in Tables 6 and 10. We obtained the per-class result for different combinations of the FSF module for the Cityscapes test sets in Table 6. We compared the results obtained for the various combinations of the three subset modules as FPS, parameter numbers, and FLOPs, respectively, in Table 10. We observe that subset modules that are sequentially processed with F 8 , F 16 , and F 32 have a better trade-off mIoU accuracy of 0.2∼2.8% with high inference speed than other subset modules. In addition, the qualitative experiment results show that subset modules F 8 , F 16 , and F 32 precisely classify traffic signs, buses, cars, and walls, rather than applying only one constant subset module of the FSF module in Fig 6. We can compare the red bounding box parts of each image to determine which combinations of FSF modules are more accurately classified.

7) DECODER STRUCTURE
To explore the effect of the decoder, we built another decoder structure using only 3 × 3 convolution and a combination of 3 × 3 convolution and 1 × 1 convolution, as shown in Table  11. We can see that 1 × 1 convolution results in an increased accuracy of 1.43% and 2.15%, respectively. In addition, the inference speed of 1 × 1 convolution is faster than that for other 1D and 2D convolution operations obtained in the previous experiments.

C. COMPARISON WITH SOTA
In this section, we compare our FSFNet with the stateof-the-art architecture for real-time semantic segmentation. Our architecture achieves the fastest inference speed while being significantly superior in terms of accuracy on challenging Cityscapes dataset. We propose two FSFNet and FSFNet(pretrained) architecture methods for training using Cityscapes or ImageNet, respectively. The strategy of FSFNet involves training only the Cityscapes dataset using the encoder-decoder architecture, and accomplishes a 68.3% class mIoU and an 86.4% category mIoU in Table 6. The FSFNet-pretrained strategy was pretrained using ImageNet weights for the encoder part, and the decoder part is then trained by the Cityscapes dataset. It attains a 69.1% class of mIoU and an 86.5% category mIoU. Both architectures were evaluated on the Cityscapes test set. The individual per-class accuracy results are given in Table 6, show that our architecture achieves excellent accuracy in all classes. In particular, the class of road, building, vegetation, sky and car achieves over 90%. Some challenging classes such as wall, fence, and truck results, are lower than the above-mentioned classes, but compared to other models, this result is quite high. Based on our results, our FSFNet is the only model with high accuracy while at 203 FPS in a single GPU. This high FPS is over 30% faster than the state-of-the-art model. The evaluation results and inference speed were obtained under the same conditions using the Cityscapes test set for a fair comparison. ERFNet [14] and ICNet [33] with many parameters have low inference speed and high accuracy. SegNet [15] with many parameters has a low inference speed and low accuracy. It can be seen that for cases with small number of parameters, such as ENet [13], the inference speed is not very fast, and has low accuracy. Table 12 shows that the accuracy of FSFNet is similar to that of ICNet, and only uses 10% of the parameters of ICNet, while increasing the inference speed by over five times. Moreover, the input resolution of our model is the original high resolution of 1024 × 2048, but our FSFNet is the fastest model, which has a small number of parameters and high accuracy. Comparing our FSFNet to the DF1-Seg-d8 [50] and Fasterseg [35] models, mIoU is 2.2% and 2.3% lower, but the inference speed is faster by 48% and 23%, respectively. Thus, we can confirm that our FSFNet has a good trade-off between semantic segmentation accuracy and computational resources in existing state-of-the-art semantic segmentation. VOLUME 8, 2020

D. SEGMENTATION RESULTS ON OTHER DATASETS 1) CamVid DATASET
The Cambridge-driving Labeled Video (CamVid) dataset [40] is a collection of videos with street scenes from a car driving through Cambridge the UK. The database consists of training, validation, and test sets of 367, 101, and 233 images, respectively. We train our FSFNet in an original input size of 720 × 960 images involving 11 semantic categories such as sky, building, pole, road, pavement, tree, sign-symbol, fence, car, pedestrian, and bicyclist. Table 13 shows that FSFNet is not reaching much accuracy compared to other models, but the inference speed achieves an FPS of 538. This inference speed is more than eight times faster than other models. This result makes it possible to confirm that FSFNet retains high accuracy while having a high FPS.

2) MAPILLARY VISTAS DATASET
Mapillary Vistas dataset [41] is a more challenging dataset due to containing high-resolution images with large-scale street-level images. This dataset contains training and validation sets of 18,000 and 2,000 respectively and annotated into 66 object categories. We resized to 2048 × 1024 using  bilinear interpolation to reduce training time and enable it to run on one GPU. Only a few methods have attempted to process this data in real-time. The result shows that our FSFNet achieves competitive accuracy and low inference time on high-resolution images in Table 14.

E. PERFORMANCE RESULT ON AN EDGE DEVICE AND REAL VEHICLE ROADS
We measured the FSFNet performance on an edge device named NVIDIA Jetson TX2 compared to other models. This device embedded artificial intelligence and computer vision applications. It contains an NVIDIA Pascal GPU with 256 CUDA cores on a base clock frequency of 854 MHz and the size of GPU memory is 8 GB. Table 15 shows inference speed for different resolutions. FSFNet works on the edge device for real-time semantic segmentation and runs at 130 FPS in the size of 480 × 320 image. This result verifies VOLUME 8, 2020 that FSFNet works extremely fast inference time than other approaches on edge devices.
In addition, Fig. 7 shows the results of FSFNet trained by Cityscapes dataset on the real vehicle roads without optimizing anything. This result indicates that our model works well in general real-world driving environment images.

V. CONCLUSION
In this paper, we introduced the optimized lightweight FSFNet for semantic segmentation model using the accelerator. In comparison with the previous work, we achieved a significant improvement in high segmentation accuracy and state-of-the-art inference speed. Meanwhile, our model is optimized by Cityscapes dataset, and the accuracy of other datasets is less than that of Cityscapes dataset. In future work, we will attempt to improve the accuracy performance in other datasets by enhancing two modules: FSF and multi-resolution aggregation. In addition, we will expand these modules which are originally limited to semantic segmentation to panoptic segmentation for higher-level visual tasks. SUYOUNG CHI received the Ph.D. degree in computer applications from Korea University. Since then, he joined the Electronics and Telecommunications Research Institute (ETRI) to work on various research fields, such as image processing, computer vision, and machine learning. He is currently a Principal Research Scientist with ETRI, and a Professor with the Korea University of Science and Technology (UST), South Korea. He was a recipient of the Prime Minister Award from the 12 th Korean Robot Awards. VOLUME 8, 2020