Efficient Fast Semantic Segmentation Using Continuous Shuffle Dilated Convolutions

It is difficult for many semantic segmentation methods to perform useful inferences under extremely resource-constrained devices; therefore, an efficient fast semantic segmentation network (EFSNet) that employs a continuous shuffle dilated convolution (CSDC) and a up-sampling module is proposed in this paper. First, the number of parameters is reduced by group convolutions and the receptive field is enlarged by dilated convolution. Second, a up-sampling module is proposed for reducing noise and increasing inference speed. Finally, with the above CSDC module and up-sampling module, the EFSNet based on a fast semantic segmentation network is obtained. Our experiments on the CamVid and Cityscapes datasets show that the mean intersection over union (mIoU) values obtained by the proposed EFSNet with only 173 k parameters are 61.1% and 61.9%, respectively. The method is able to run in real-time at a speed of 332 frames per second (FPS) and 107 FPS on a single NVIDIA Titan Xp GPU. Compared with several existing methods, our network achieves efficient segmentation with low resource consumption.


I. INTRODUCTION
With the rapid development of deep learning, convolutional neural networks (CNNs) have achieved satisfying results in different applications, e.g., classification [1], target detection and tracking [2], [3] and segmentation [4]. Semantic segmentation is used to extract rich contour feature information in an image by dividing each pixel into determinate categories, which has a wide range of applications in many aspects. In terms of image beautification, semantic segmentation is used as an important method to extract human external contours [5]. In addition, it is an important pre-processing step to replace some manual medical image segmentation to extract areas where lesions may occur [6]. It also plays an important role in road scene understanding, such as road edge segmentation [7]. Semantic segmentation is a big challenge in computer vision tasks because it requires precise classification of target pixels of different scales and categories.
Many semantic segmentation methods contain a large number of parameters, resulting in a low inference speed. The most classical method is the fully convolutional network (FCN) [4], which is the pioneer method used to The associate editor coordinating the review of this manuscript and approving it for publication was Muhammad Afzal .
perform end-to-end semantic segmentation. SegNet [8], a symmetric encoder-decoder structure, uses the previous layers of the image classification network VGGNet [9] as the encoder and adopts pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear up-sampling, but the network has many parameters. PSPNet [10] applies the pyramid pooling module to extract low-level features and contextual features with different scales and achieves excellent segmentation performance; However, the forward time is long because of the large number of convolution filters. DANet [11] proposed the self-attention mechanism in semantic segmentation, capturing rich contextual information relationships for better feature representations with intra-class compactness with a position attention module and a channel attention module; But a lot of matrix multiplication operations are required. Although the above methods have achieved competitive accuracy performance, there is no good trade-off between accuracy and speed. In some real-world applications, e.g., autonomous driving and intelligent surveillance, a light-weight network is still needed in embedded devices for fast image prediction.
Recently, some light-weight networks have made great contributions in reducing inference time and enhancing VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ segmentation accuracy performance. ENet [12] uses the bottleneck block [13] and asymmetric convolutions [14] to reduce the feature map dimensions and floating-point operations (FLOPs), achieving good semantic segmentation performance on embedded devices. CGNet [15] uses regular and dilated convolution to perform feature extraction in parallel, and then obtains the weight vector to guide the feature map fusion through the global average pooling layer and the fully-connected layer. It can learn local features and contextual information with low parameters. ESPNet [16] decomposes efficiently a standard convolutional kernel into a point-wise convolution and multiple dilated convolutions with a spatial pyramids structure, achieving large receptive fields and a good trade-off between speed and accuracy. Zhang et al. construct a fast semantic segmentation network named FSSNet [17] by multiple effective blocks to achieve high accuracy performance with only 0.2M parameters, but it did not have a good accuracy performance for similar inter-class objects. ShuffleNet [18], a light-weight image classification network, uses point-wise group convolution to reduce feature map dimensions first, and increase the dimensions later. In addition, to alleviate accuracy degradation caused by group convolution, it uses a channel shuffle operate to solve poor communication in different groups, achieving high accuracy classification with low computational complexity. The light-weight semantic segmentation networks use a small number of parameters to achieve low-budget segmentation, but the resulting accuracy is relatively low. Therefore, methods to achieve high-precision semantic segmentation with a limited number of parameters are important and interesting. To further reduce parameters and improve accuracy performance, we propose the efficient fast semantic segmentation network named EFSNet, which adopts channel shuffle dilated convolution (CSDC) to perform fast and accurate segmentation. Unlike CDB [17], our proposed CSDC not only enlarges the receptive field but also reduces the number of FLOPs. It uses group convolution [18] to first reduce and then increase the number of feature map channels, uses dilated convolution [19] to enlarge the receptive field, and uses channel shuffle [18] to alleviate the loss of accuracy. Moreover, we construct a up-sampling module that combines feature map information of the same resolution in the encoder, which is able to generate prediction images with less noise while reducing a large number of element-wise additions. Finally, we use the continuous ShuffleNet Unit [18] to further reduce the number of FLOPs and improve segmentation accuracy performance.
The main contributions in this paper are as follows: 1) We propose an CSDC module, which uses dilated convolution with different spatial dilation rates to efficiently enlarge the receptive field.
2) We construct an up-sampling module that improves inference speed and semantic segmentation accuracy.
3) Our network uses only 173 k parameters and achieves good performance on the Camvid and Cityscapes test sets.

II. RELATED WORK
A. HIGH ACCURACY SEMANTIC SEGMENTATION FCN [4], a pioneer work in image semantic segmentation, replaces the last fully-connected layers with convolutions. SegNet [8], an encoder-decoder architecture, applies max-pooling indices to an up-sampling lowresolution feature map to enhance non-linear transformations. DeepLab [20] uses dilated convolution to enlarge the receptive field for dense labeling, and employs dense conditional random fields for increasing segmentation accuracy. PSPNet [10] adopts multiple branch convolutions with different sizes to extract spatial features, then applies up-sampling to the same size to enrich the details. Zhao et al. [23] use a point-wise spatial attention network to self-adaptively learned contextual informations. DANet [11] proposes a dual attention module based on self-attention mechanism to learn local featurs and global dependencies. CCNet [24] proposes a criss-cross attention module to utilizes the contextual information of all the pixels on its criss-cross path. These methods are effective, but high FLOPs are required because of the model with many parameters and the lack of efficient techniques for group convolution [18], [21] and depth-wise separable convolution [22].

B. REAL-TIME SEMANTIC SEGMENTATION
Recently, some real-time semantic segmentation networks have received extensive attention because of high speed and accuracy. SQ [25] is based on the low-latency network SqueezeNet [26] with high classification accuracy, first reducing and then increasing the channels of the feature map by point-wise convolutions for real-time segmentation. ICNet [27] based on PSPNet [10], uses low resolution images to rapidly obtain semantic information and high resolution images to refine the details for high accuracy segmentation performance. BiseNet [28] adopts the attention mechanism to capture high level semantic features, and the feature fusion module fuses high-level semantic features and low-level textures to enrich details. ERFNet [29] employs cascaded factorization convolution to enhance non-linear capabilities, while using residual connections to enhance gradient propagation. Although these semantic segmentation networks have achieved high accuracy, they did not work well on mobile devices with low computation budgets, such as mobile phones.

C. LIGHT-WEIGHT NETWORK
Light-weight networks represent a good trade-off between accuracy and model parameters, enabling their deployment on mobile devices. The image classification network Shuf-fleNet [18], [21] employes the principle of convolution decomposition to reduce parameters by factorizing a standard convolution into point-wise and depth-wise convolutions. Specifically, channel shuffle is used after the point-wise convolution to alleviate the poor communication between different groups caused by group convolution. ENet [12] uses small convolution kernels instead of standard convolutions for building light-weight semantic segmentation networks, making semantic segmentation based on embedded devices feasible. CGNet [15] applies the context information guidance module to learn local features and features of different stages, but the segmentation speed is slow for high resolution images. ESPNet [16] is an efficient semantic segmentation network with a large receptive field that uses point-wise convolution and a spatial pyramid of dilated convolutions to build efficient feature extraction modules. FSSNet [17] uses continuous factorized blocks to extract low-level features; additionally, a continuous dilated block is applied to enlarge the receptive field. However, FSSNet does not uses group convolution; therefore, it still consumes a large number of FLOPs. Our proposed CSDC not only reduces computational complexity but also enlarge yhe receptive field.

III. OUR APPROCH
In this section, we first describe the EFSNet based on FSSNet [17] architecture. Then, we introduce our proposed CSDC module and up-sampling module in detail and the decoder network. Finally, we perform a series of experiments that show that our method can reduce computational cost and increase segmentation accuracy performance.

A. THE STRUCTURE OF EFSNet
We propose the structure for EFSNet as shown in Fig. 1 and Table 1. EFSNet is a light-weight encoder-decoder semantic segmentaiton network, which not only learns multi-scale context, but also reduces the computational cost.
The encoder network consists of four parts. First, the initial block [12] reduces the dimension of the high resolution images for training and prediction. The down-sampling block [17] preserves the salient features and learns the down-sampling feature map by max-pooling operations TABLE 1. Architecture of our proposed EFSNet. Here, In the ''Type'' column, × n indicates that the module is repeated n times and r denotes dilation rate. ''Size'' represents the size of output feature maps of the module. For example, 512 × 256×16 denotes output feature map resolution is 512 × 256 and the number of channels is 16. C represents the number of classes of the dataset. and spatial convolutions, respectively. The factorization block [17] reduces the parameters by using asymmetric convolutions that decompose a spatial convolution into two smaller convolutions.
shuffle dilated convolution (SDC) and continuous shuffle dilated convolution (CSDC) as shown in Fig. 2. To learn multi-scale features and ease the gridding problem, we replace the 3 × 3 convolution filter in the ShuffleNet Unit [18] by dilated 3 × 3 convolutions with different dilation rates. Table 6 illustrates that the SDC module can obtain higher segmentation accuracy with fewer parameters compared to the ShuffleNet Unit.
We form the CSDC module by stacking multiple SDC modules with different dilation rates. The CSDC VOLUME 8, 2020 module learns multi-scale contexts based on the principle of reduce-transform-expand, which not only enlarges the receptive field but also requires less calculational effort. The hyper-parameter N is the number of CSDC modules stacked together to control the depth of the network. CSDC ×N stands for the CSDC module is repeated N times. Table 6 illustrates that an SDC module can build a deeper network with lower number of parameters to improve accuracy performance.
Since the main use of the decoder network is to restore the feature map resolution, a decoder network structure with a low number of parameters can also obtain a decent segmentation probability map. The decoder consists of the up-sampling module, the ShuffleNet Unit [18] and the deconvolution [30]. In the up-sampling module, previous down-sampled feature map and the output of the previous layer are up-sampled twice, and then concatenated to refine the details. Then, the ShuffleNet Unit is used to further improve the prediction accuracy, as shown in Table 2. In the last layer, a transposed convolution with a 3 × 3 kernel size and a stride of 2 is performed to generate prediction maps of C classes. When the decoder network performs up-sampling and convolution operations, it combines high-level semantic information with low-level details, thus effectively improving the accuracy of the segmentation. EFSNet 1 and EFSNet 2 are obtained by using different dimensionality reduction factors in EFSNet. In EFSNet 1 and EFSNet 2 , except for the initial block, up-sampling block and deconvolution layer, the number of all bottleneck channels is set to 1/4 and 1/2 of the input channel, respectively. The experiments described in IV show that our proposed EFSNet peforms better with respect to parameter size and mIoU when compared to similar methods.

B. CONTINUOUS SHUFFLE DILATED CONVOLUTION
Shuffle dilated convolution was proposed based on the Shuf-fleNet Unit [18] and dilated convolution [19], as shown in Fig. 2 (a). Unlike the ShuffleNet unit, SDC uses dilated convolutions with different dilation rates of r = 2 k , k ∈ [0, K − 1], which enlarge the receptive field without additional parameters. The left branch takes the output of previous layer as the input feature map. In the right branch, a 1 × 1 group convolution is used to reduce and increase channels. However, group convolution may cause the problem that outputs from a certain group only relate to the inputs within the group. Channel shuffle [18] is used to alleviate this drawback. Spatial convolution is carried out for each feature map by 3 × 3 dilated convolutions.
By stacking four SDC layers with rising expansion rates, the CSDC module is obtained, as illustrated in Fig. 2 (b). Compare to CDB [17], we use a different dilation rates strategy and more convolution layers to build a deeper and wider network. Experiments in Section IV show that our CSDC module achieves higher accuracy with fewer parameters.

C. UP-SAMPLING MODULE
The feature fusion up-sampling module is divided into two parts, using both the shallow and high-level features of the encoder, as shown in Fig. 3. In the left branch, a 1 × 1 convolution is used to reduce the number of channels of the high-level feature map, and then the bilinear interpolation algorithm is used to double-sample the feature map. In the right branch, low-level output feature maps from the down-sampling block reduce the number of input channels using a 1 × 1 convolution, and then a 2 × 2 deconvolution is used to double the up-sampling and gradually increase feature map resolution. Finally, the concatenation operation is used to merge features of two branches; this helps to reduce noise in predicted images. Compared with an up-sampling block [17], our module has fewer parameters. The 1 × 1 convolutions in the first and second up-sampling modules reduces the number of input channels to 1/4 and 1/8, respectively. The module uses a concatenation operation to fusion features instead of element-wise summation, maintaining similar accuracy and low inference time.

D. DECODER NETWORK
The decoder network consists of an up-sampling block, the ShuffleNet Unit [18] and a transposed convolution [30], as shown in Fig. 1. The main function of the decoder network is to restore the feature map resolution and to output semantic segmentation probability maps. Therefore, an asymmetric decoder can be assigned to reduce parameters and inference time. The decoder network is light-weight, and the number of parameters is less than 20 k when 1 × 1 convolution is used to decrease and increase the number of channels times. In the decoder network, we use two continuous ShuffleNet Units instead of the Bottleneck Block, as shown in Fig. 4 (a) and Fig. 4 (b). The ShuffleNet Unit reduces the number of FLOPs and keeps accuracy comparable with Bottleneck Block. The Bottleneck Block was proposed by ResNet [13] to build a thin and deep network, that can be divided into two branches. In the right branch of the Bottleneck Block, the 1 × 1 convolution is used to first reduce and then increase the feature map channels, and features are extracted using a 3 × 3 convolution, while maintaining a low parameters count. Element-wise summation is used to merge the two branch outputs, thus alleviating the problem of vanish gradients. The ShuffleNet Unit performs better than the Bottleneck Block. In Fig. 4 (b), a 1 × 1 group convolution is used to reduce the number of parameters and the computing resource consumption, the channel shuffle increases the information flow between channel groups, and a 3 × 3 depth-wise convolution performs convolutions on each channel of the input feature map, which greatly reduces the number of parameters compared to a standard convolution. The comparison experiment of the Bottleneck and the ShuffleNet Unit is shown in Table 2. The last layer in the network, a transposed convolution layer with 3 × 3 kernel size and a stride of 2, generates a segmentation probability map with C channels.

IV. EXPERIMENTS
In this section, we evaluate our EFSNet on the common road scene understanding datasets CamVid [31] and Cityscapes [32]. First, We introduce the experimental details and the evaluation metrics. Then, we perform a series of ablation studies on our CSDC module to validate its effectiveness. Finally, we report our competitive evaluation results and compare them with other state-of-the-art methods on the CamVid and Cityscapes test set.

V. CamVid DATASET
The CamVid dataset is a video set that contains five videos for urban scene understanding. There are 701 labeled images in the dataset. Similar to SegNet [8], we separated the images into 367 training images, 233 testing images, and 101 validation images for fair comparison. There are 11 different classes, such as building, tree, sky, road, bicyclist, etc., ignoring the 12th class that contains unlabeled data. Following SegNet [8], we down-sample each image from 960 × 720 to 480 × 360 before training and testing. In the training, we randomly crop the training images to 256 × 256, and set the batch size as 10.

A. CITYSCAPES DATASET
The Cityscapes dataset is a large-scale urban scene perception dataset that includes 5,000 high-quality finely annotated images and 20,000 coarsely annotated images collected from 50 cities. The entire dataset has a total of 19 small semantic categories, such as car, person and rider. We only use the fine-annotated images training model for verifying the effectiveness of the proposed method. The fine-annotated images contain 2,975 training, 500 validation and 1,525 testing images. During the training, we first down-sample the original image from 2048 × 1024 to 1024 × 512, and then random crop it to 720 × 360 for rapid training. We set the batch size to 16 to acquire better test results.

B. IMPLEMENTATION DETAILS
The proposed method is implemented using PyTorch 1.0.1 [33] open-source deep learning library, a computer with an Intel I5-6500 3.2 GHz, 16GB RAM, and a GeForce GTX 1080Ti GPU card to accelerate training. The version of CUDA is 9.0. We train the networks with the Adam optimization algorithm [34] with an initial learning rate of 0.0005. The 'poly' learning rate policy is adopted with power 0.9, and the momentum and weight decay are set to 0.9 and 0.0004, respectively. The learning rate lr is defined as follows: (1) VOLUME 8, 2020 We initialize the network parameters using He initialization and choose PReLU [35] as our activation function. Batch Normalization [36] is used to accelerate deep network training. The probability of dropout is set to 0.2. Our experiments are carried out on the CamVid and Cityscapes datasets. The cross-entropy loss function is used as the loss function. Because of the imbalance of the training data, different weights are assigned to each class in the training, forcing the network to learn more difficult classes, such as person and rider. Following ENet [12], the class weighting scheme is defined as follows: During training, regular data enhancements are performed on the CamVid and Cityscapes datasets, including random horizontal flipping and image cropping, thus improving the generalization performance for the proposed EFSNet. For the low resolution CamVid dataset, the dilation rates of 1, 2, 5, and 9 are adopted. The EFSNet can achieve end-to-end training without pre-training from scratch.

C. EVALUATION METRICS
We evaluated our model using standard strategies to measure network performance, e.g., segmentation accuracy, computational cost, and parameters of the network. These metrics are widely used in real-time semantic segmentation.
For the CamVid dataset, we evaluated our results using global accuracy, mean intersection over union (mIoU), class average accuracy, and per-class accuracy. These metrics are defined as follows: where n ii is the number of pixels of class i predicted to belong to class j, and there are n cls different classes, t i = j n ji is the total number of pixels of class i. For the Cityscapes dataset, we compare our results using class-wise IoU, category-wise IoU, and instance-level intersection over union (iIoU) with existing state-of-the-art approaches. Class IoU and Category IoU represent the mean performance IoU of nineteen classes and eight categories. Unlike the CamVid dataset, we add an additional instance-level intersection-over-union (iIoU) to evaluate the segmentation accuracy of individual instances in the labeling. The iIoU is formulated as follows: where FP is the number of false positive pixels and iTP and iFN denote weighted counts of true positive and false negative pixels, respectively. In contrast to the standard IoU measure, iIoU is computed by weighting the contribution of each pixel by the ratio of the class average instance size to the size of the respective ground truth labeling. We use both FLOPs and frames per second (FPS) for evaluating the network computational complexity. FLOPs is an indirect metric [18] that is widely used to evaluate the computational complexity of algorithms. However, depending on the actual operating platform, the operating speeds with similar FLOPs are also very different.

D. RESULT ON CamVid DATASET
In the testing, we compare the global accuracy, the per-class accuracy and the mIoU with existing state-of-the-art approaches. Furthermore, we measure the inference speed of our EFSNet by averaging 100 iterations for 480 × 360 inputs on a NVIDIA Titan Xp GPU. Our proposed EFSNet 1 with only 93 k parameters achieves 60.7 % mIoU on the CamVid test set and processes the inputs at a rate of 376 FPS. In addition, with our EFSNet 2 , which has 173 k parameters, we obtain 61.1 % mIoU on the test set with 332 FPS.
The experimental results are shown in Table 3, where n/a indicates that the semantic segmentation algorithm did not publish the corresponding results. Table 3 demonstrates that our EFSNet 1 achieves the best available trade-off in terms of accuracy and efficiency. Among all the approaches, our EFSNet 1 yields 60.7 % mIoU and 84.8 % global pixel accuracy, where 4 out of the 11 categories obtain best scores. Regarding to the efficiency, our method is nearly 7 times faster and uses 317 fewer parameters than SegNet [8]. Furthermore, compare to FSSNet [17], an another fast semantic   Visual comparison on the CamVid test dataset. From left to right are input images, ground truth, segmentation outputs from EFSNet 1 (ours), EFSNet 2 (ours), ESPet [16], ENet [12], SegNet [8].
segmentation network, our method is 2.1 % more accurate, 2.1 times faster and has 2 times fewer parameters.
Our EFSNet 2 with 173 k parameters achieves 61.1 % mIoU, and the visual result is better than that achieved with FSSNet. Fig. 5 shows some visual examples of our method and FSSNet on the Camvid dataset. Fig. 7 show our EFSNet produces more accurate results with fewer network parameters than other methods. The figure demonstrate that our method performs better on pole, sign, car, and pedestrian; i.e., the method correctly classifies objects at different scales. Additionally, our network is capable of differentiating larger classes such as building and road, nearly as well as SegNet.

E. RESULT ON CITYSCAPES DATASET
During the testing, we up-sampled the output feature map using nearest interpolation to 2048 × 1024 for matching the VOLUME 8, 2020
official image resolution requirements. Moreover, we measured the inference speed of our method by averaging 100 iterations for 1024 × 512 inputs on a single NVIDIA Titan Xp GPU. In Table 4, we report the quantitative results compared with several baselines. Experimental results show that our EFSNet 1 obtains 57.75% mIoU with only 93 k parameters on the Cityscapes test set and processes the inputs at a rate of 128 FPS.
We can see that our EFSNet 2 outperformed FSSNet [17] in terms of network parameters, class IoU (mIoU), class iIoU, category IoU, category iIoU and inference speed with 173 k network parameters. Specifically, our EFSNet 2 is 3.11% more accurate than FSSNet. Our EFSNet 2 achieves 61.91 % mIoU, and the visualization is better than FSSNet. Fig. 6 shows some visual examples of our method and FSSNet on the Cityscapes dataset. Some visual comparison of our proposed EFSNet and other methods as shown in Fig. 8. The data shows that our results are more accurate than those of FSSNet [17]. For example, the top right picture in Fig. 6 shows an incorrect prediction in building class and the lower right picture in Fig. 6 misrepresents a car in the center of the picture as a person, while our method produces accurate results. In addition, EFSNet 2 achieved the highest iIoU, indicating that our method can achieve more accurate segmentation for small categories such as rider and fence.
To investigate the computational complexity, we measured the FLOPs and the FPS of the state-of-the-art networks on an NVIDIA Titan Xp GPU, as shown in Table 5. Our EFSNet 1 requires a computational budget of approximately 130M FLOPs, which is 2.46 times less than that of ENet [12]; additionally, it is faster.

F. ABLATION STUDY 1) ABLATION STUDY ON CamVid
To quantitatively compare the effectiveness of the CSDC module and the ShuffleNet Unit, we perform an ablation study on the CamVid dataset, as shown in Table 2. For example, Model A represents the CSDC module used in the encoder and the ShuffleNet Unit is used in the decoder. As can be seen from Table 2, Model C is better than the others; 3 of the 11 categories achieve the highest pixel accuracy.
In addition, Model C achieves the highest mIoU of 60.7 %. Compared with the ShuffleNet Unit, the CSDC module enlarges the receptive field and improves the segmentation accuracy, indicating the effectiveness of the CSDC module proposed in this paper. Model C has 10 categories with higher pixel accuracy than Model D, and mIoU and global pixel accuracy are 3.3 % and 1.0 % higher, respectively. Moreover, we find that class accuracy with significant improvement, such as sky, pole, sign, fence and car, are comparable to FSSNet. This shows that the CSDC module is at work. However, some classes, such as building, road, and bicyclist, have a lower pixel accuracy, which indicates a slight decrease using the CSDC module.

2) COMPARISON OF CDB AND CSDC
To evaluate the effectiveness of our CSDC module, we conduct a comparative analysis of the CDB [17] and the CSDC module, as shown in Table 6. For comparison, we separately counted the parameters and FLOPs of the two modules, and the parameters of the remaining layers. Table 6 shows our CSDC module has fewer parameters and uses less FLOPs. Although CDB can effectively enlarge the receptive field without extra parameters, it still accounts for half of the total parameters becaues of the many convolution filters. Compared with CDB, the number of parameters for out CSDC is reduced by approximately 2.7 times, and the FLOPs can be reduced. Moreover, CSDC increases the depth of the network and has better segmentation performance compared to CDB with 6 convolutional layers.
The reason for the above difference is the group convolution and depth-wise convolution [18]. Group convolution first divides the inputs into several groups and then performs convolution, reducing the number of parameters and the FLOPs. Depth-wise convolution performs convolution on each channel of the inputs to reduce the number of parameters and the FLOPs. However, several group convolutional layers, simply stacked together, causes a drop in accuracy because the outputs only come from the feature map in a certain group [18]. A channel shuffle operation is adopted to solve this problem. To increase the receptive field, we use dilated convolutions with different dilation rates. Ablation studies for the G, K, and N in our encoder network with the CSDC module. Here, G is the number of groups in the point-wise convolution, K is the kernel size in depth-wise convolution, and N is the depth factor. D denotes dilated convolutions, and true stands for using dilated convolution in the CSDC module.

3) ENCODER NETWORK WITH CSDC MODULE
To rapidly validate the effectiveness of our CSDC module, we evaluated the segmentation accuracy of the encoder using the Cityscapes validation set, as shown in Table 7. By adding a point-wise convolution layer to the last layer of the encoder, and then training 500 epochs using the default parameters, we obtain the output feature map for C classes. We measured the inference speed by averaging 100 iterations for 1024×512 inputs on a single NVIDIA Titan Xp GPU. We also adopt various experiments with different groups (G), kernel size (K) and staking times (N) to obtain a good trade-off between accuracy performance and time consumption. For a fair comparison, the light-weight segmentation network ESPNet-C [16] and the encoder of the ESCNet [37] are used as baselines to compare parameters, mIoU and inference speed obtained using the Cityscapes validation set.

a: GROUPS (G)
The input is divided into G groups; the number of parameters can be reduced by increasing G. Inputs with G = 2 achieve better validation results compared to inputs with G = 4. Although the case with G = 2 uses 76.8 k parameters, it can obtain higher mIoU performance and similar inference speed. To make a good trade-off in accuracy and parameters, we fixed G = 2 to validate the effectiveness of other hyper-parameters.

b: KERNEL SIZE (K)
The K represents the kernel size of the depth-wise convolutions, which enables us to increase the receptive field of the convolution kernel by increasing K . To evaluate the effect of the kernel size K, we change the value of K from 3 to 5 and do not use dilated convolutions. Table 7 show that increasing the convolution kernel improves the segmentation accuracy, but increases the number of parameters and the inference time. We set K = 5 and do not use dilated convolution. Even though the kernel size is set as 5, the accuracy is 3.0 % lower and the inference speed is lower than that of the module using a dilated convolution with kernel size of 3. In addition, the segmentation accuracy of the encoder with a 3 × 3 dilated convolution is improved by 6.9 %.   The N represents the depth of the CSDC module. For example, given the depth factor N , the CSDC module will be repeatedly stacked 2 times in the encoder. In other words, model capacity can be increased by adjusting the depth factor N to increase the network depth. The encoder with N = 1 has fewer parameters than the encoder with N = 2; however, the accuracy is greatly reduced because of the limited model capacity.
Unlike ShuffleNet Unit, the CSDC module uses a dilated convolution with different dilation rates, thus lessening the gridding problem and enlarging the receptive field. Multiple CSDC modules in the encoder are used to increase the depth and receptive field of the network and obtain preferable inference speed. The data from the above ablation experiments show that the encoder network using the CSDC module achieves the highest accuracy at 57.1 % on the Cityscapes validation set. More specifically, EFSNet 1 achieves 54.3 % mIoU with 76.8 k parameters, which is approximately 1.0 % higher than ESPNet. Compared with ESCNet [37], EFSNet 2 achieves 57.2 % mIoU with only 154.2 k parameters with an inference speed of 237.5 FPS.

4) UP-SAMPLING MODULE
To investigate the effect of our up-sampling module, we replaced the up-sampling module of Table 1 with the up-sampling block of FSSNet while fixing the other blocks. The results of the ablation experiments on the Camvid test set and the Cityscapes validation set are shown in Table 8 and Table 9. The data show that the up-sampling module of FSSNet slightly increases the number of parameters and the forward propagation time. The up-sampling block of FSSNet is used in EFSNet 1 to achieve similar accuracy, but it increases the number of parameters and the inference time. In addition, the use of an up-sampling block of FSSNet in EFSNet 2 results in a drop in accuracy. The data show that the use of the up-sampling block in FSSNet in our network will reduce the mIoU. Results on the Cityscapes validation set. We quantify the performance of the up-sampling block using the forward inference speed (FPS), the class average accuracy (C), and the mean intersection over union (mIoU). An asterisk indicates that the corresponding up-sampling module is replaced with an up-sampling module in FSSNet.
denotes an up-sampling block of FSSNet [17] and denotes our up-sampling module.

TABLE 9.
Results on the Camvid test set. We quantify the performance of the up-sampling block using the forward inference speed (FPS), the global accuracy (G), the class average accuracy (C), and the mean intersection over union (mIoU). An asterisk indicates that the corresponding up-sampling module is replaced with an up-sampling module in FSSNet.
denotes the up-sampling block of FSSNet [17] and denotes our up-sampling module. Some visual segmentation images are shown in Fig. 9 and Fig. 10. As can be seen from the second row of pictures in Figure 9, our network cannot segment trucks and cars well due to the use of the up-sampling module of FSSNet. Fig. 10 show that after using the up-sampling module of FSSNet, similar objects such as car and building cannot be segmented well in low-contrast pictures. At the same time, the prediction images have some noise.

VI. CONCLUSION
In this paper, we propose the light-weight neural network EFSNet with a low number of parameters to perform efficient semantic segmentation. The method is evaluated on the CamVid and Cityscapes datasets, which demonstrates the method performs well in segmentation accuracy, especially in terms of calculation cost and inference time. Some ablation studies have shown that the CSDC module can enlarge the receptive field without additional parameters and improve segmentation accuracy. In addition, the up-sampling module can reduce the noise of segmentation and increase the forward propagation speed. Our method achieved a good trade-off between accuracy and efficiency, indicating that it can be applied when computing resources are limited.