Efficient Feature Recombining Network Based on Refining Multi-Level Feature Maps for Semantic Segmentation

Modern approaches for semantic segmentation usually concatenate the feature map of the last convolutional layer and multi-scale features to form the final feature representation, which can achieve the more accurate classification of target pixels for the input image. However, the feature information of the last layer is not complete and refined so that there is a performance bottleneck in the concatenation between the final feature map and multi-scale feature representations. To solve this problem, we propose the Feature Recombining Network to get more refined and precise features for Semantic Segmentation. Our network is composed of Feature Recombining Module and Modified Pyramid Pooling Module. The two modules can extract more detailed and representative features through the feature recombination and acquire richer context information than the previous module respectively. Experiments show that our modules are effective to improve the segmentation precision and the Modified Pyramid Pooling Module is also superior to the previous module. Based on our proposed network, we achieve the performance of 51.9% mIoU on Pascal Context dataset and 44.75% mIoU on ADE20K dataset.


I. INTRODUCTION
Semantic segmentation is an important task in computer vision, which aims at dividing each pixel into determinate categories. It has a wide range of applications in many aspects like autonomous driving, scene understanding, etc. Fully Convolution Network (FCN) [1] is the pioneer method used to perform end-to-end semantic segmentation, which is the base of state-of-the-art semantic segmentation frameworks. The Deep Convolutional Neural Networks (CNNs) [2] can capture the informative features by stacking convolutional layers. In order to obtain feature representations of large receptive fields, it is necessary to downsample several times. Finally, the feature map of the last convolutional layer can acquire rich information of object categories and scene semantics. In FCN, the input image is downsampled 5 times and the last fully-connected layer is replaced with the convolution The associate editor coordinating the review of this manuscript and approving it for publication was Nadeem Iqbal. layer. Although the final feature map encodes rich semantic information, spatial information is lost. For conquering the problem of spatial information loss associated with downsampling, SegNet [3] propose a novel convolutional neural network (Encoder-Decoder Architecture for Semantic Segmentation). The role of the encoder is to extract high-level feature information. And the decoder can recover spatial details by using the transferred pool indices. Alternatively, Yu and Koltun [4] propose dilated (atrous) convolutions that expand the size of the convolution kernel to obtain a larger receptive field without the reduction of the spatial resolution. The dilated convolution is a very efficient operation to maintain the resolution of the final feature map. DeepLab [5] replaces the last two downsampling operations with dilated convolutions to obtain a high-resolution feature map as well as the receptive field of view unchanged. As a result, the resolution of the last feature map in DeepLab is just 8 times smaller than the original image, which is beneficial for the final performance. Furthermore, DeepLab constructs the Atrous Spatial Pyramid Pooling (ASPP) module to capture context information at multiple scales by using kernels of multiple dilated atrous rates. Although it can capture multi-scale information, neighbor information could be lost. Moreover, kernels of larger dilation rates are likely to cause gridding artifacts [6]. Alternatively, PSPNet [7] proposes the Pyramid Pooling Module (PPM), which captures multi-scale context information by concatenating the final feature map and features of four different pyramid scales. It is effective to capture multi-scale context information. However, the pyramid pooling only can acquire global context information of different portions for the feature map of the last layer, which ignores the significance of the features of lower layers. Moreover, the final feature representations are formed with multi-scale representations of PPM and the last convolutional layer. There is no further processing of the last feature map, which leads to an imperfect fusion between features. It is necessary to take more processing on the feature map of the last convolutional layer. Based on the above reasons, we propose our Feature Recombining Network (FRNet), which mainly consists of the Feature Recombining Module (FRM) and the Modified Pyramid Pooling Module (MPPM). Unlike PPM, MPPM can obtain the context information from the last two feature maps, which makes the acquired context information much richer. And our FRM can highlight the characteristics of high layer and low layer features respectively. As a result, the final features can be more refined and representative so that there is a better fusion between the final features and multi-scale features. In order to design the optimal module, we construct two different FRM and conduct a series of experiments.
Our contributions in this paper are summarized as follows: 1) We propose the MPPM to capture the context information from the last two feature maps and introduce FRM to extract the more refined and representative features from the last three feature maps for semantic segmentation.
2) Our method achieves a good performance of 51.9% mIoU on Pascal Context dataset and 44.75% mIoU on ADE20K dataset with ResNet-101 as the backbone.

II. RELATED WORK
Since the FCN was proposed, many approaches have been introduced. Such as some methods [8]- [10] utilize CRFs to improve the performance of semantic segmentation and DSSPN [11] acquires the concrete expression of information by building a specific neuron graph for each entity. Again, for example, CascadeNet [12] merges different segmentation information by constructing multiple streams. Although the above methods can significantly improve the accuracy of pixel-level classification, it also requires a high computational cost.
Furthermore, there are many methods [13]- [16] for object instance segmentation, which can get the object detection boxes while simultaneously accomplishing the pixel classification for the instances. Although these methods are effective in instance segmentation, they are not the optimal method for the semantic segmentation of the whole scene where all pixels need to be classified. Now the excellent semantic segmentation methods are focus on encoder-decoder, multi-scale context, attention mechanism, etc.

A. ENCODER-DECODER
To better identify pixel-wise segmentation masks, Noh et al. [17] propose the deconvolution network. The network can recover more detail by stacking deconvolution and unpooling layers. In addition, Ronneberger et al. [18] introduce the efficient network U-net which is composed of contracting and expanding paths. Each upsampling of the feature map is concatenated with the correspondingly feature map of the contracting path. For acquiring more information from the multi-level features, RefineNet [19] propose a new architecture. All the information of features can be exploited using the multi-path refinement network. As the number of layers of a network increases, the receptive field will become larger and larger. Theoretically, the receptive field of ResNet [20] is larger than the size of the original image. However, B. Zhou et al. [21] indicate that the empirical receptive field of CNN is much smaller than the theoretical one. Besides, it is proved that the receptive field is an important factor in semantic segmentation in Global Convolutional Network (GCN) [22]. In that paper, the ''classification'' and ''localization'' problems are solved simultaneously in the segmentation task. UPerNet [23] propose a new network, which is based on FPN [15]. The network can get more concepts from images through multi-task training. Furthermore, DFN [24] present a novel discriminative feature network. The network consists of Smooth Network and Border Network, which are used to solve the intra-class inconsistency and inter-class indistinction problems respectively. Although the methods based on the structure of encoder-decoder have achieved some competitive performance, there are still some disadvantages. For example, although the convolution operations during the decoding can recover some low-level information and spatial information, some information is still lost. An effective way to solve this problem is to add additional processing modules, but it is also computationally expensive. In addition, some small objects in the original image may be directly lost after 5 times of downsampling, which also degrade the performance of the model.

B. MULTI-SCALE CONTEXT
Spatial Pyramid Pooling (SPP) was first proposed in [25] for Visual Recognition. SPP can produce the output of a fixed size regardless of the size of the input image. SPP not only eliminates the limit of input image size but also reduce the disturbance caused by object deformations. For semantic segmentation, ParseNet [26] introduce a new module to add the global context information for features. The module can extract context from the multiple feature maps by pooling the features to a context vector. DeepLab and PSPNet achieve Spatial Pyramid Pooling by adopting parallel atrous convolution with different rates and performing pyramid pooling under different grid scales, respectively. Lin et al. [27] propose the multi-scale context intertwining for semantic segmentation. The method can capture the effective features by combining the feature maps hierarchically. Zhang et al. [28] capture multi-scale information by the scale-adaptive convolutions. And multi-scale contents can be captured adaptively in DMNet [29] by stacking dynamic convolutional modules in parallel. Although the above methods can obtain the global contextual information of different sub-regions through their respective operations, the processing is focused on the last feature map. The contextual semantic information contained in some low-level features of the shallow layer is ignored, which is also important for the whole semantic information. Different from the above methods, our network is an improvement on the original model. It can provide more complete contextual information for the final semantic segmentation.

C. ATTENTION MECHANISM
Attention-based methods can capture complex contextual information in many ways. SENet [30] propose a novel architecture based on stacking the Squeeze-and-Excitation block, which can establish explicit channel-wise feature responses by modeling interdependencies between channels. For semantic segmentation, Hang Zhang et al. [31] propose the Context Encoding Module to capture contextual information from the scenes for the independent feature maps. Furthermore, each point in a feature map is linked through an attentional mechanism in PSANet [32]. DANet [33] propose a dual attention module, which consists of the position attention module and channel attention module. The network can get a more accurate feature representation by summing the output of two modules. Huang et al. [34] introduce the novel criss-cross attention module to capture the contextual information for each pixel. To capture the co-occurrent context information, the aggregated co-occurrent feature (ACF) module is constructed in CFNet [35]. The module can learn a spatial invariant representation across the scene. CPNet [36] propose a completely new method to capture two different contextual dependencies (intra-class and inter-class). Although the above methods can acquire the interdependence of channel information or spatial information to effectively improve the performance, it also means a lot of matrix operations. In addition, the complicated attention mechanism tends to be more suitable for the more complex scenes, which is not necessary for some simple images or even may lead to false dependencies between some simple feature information.

D. GROUP CONVOLUTION
Group convolution or depthwise separable convolution is an effective operation to decrease the number of parameters in convolutional neural networks. To accelerate the feedforward execution, Jin et al. [37] introduce the flattened convolutional neural networks. The 3D filters are flattened to one-dimensional filters, which greatly improves the training speed. Based on the hypothesis that cross-channel spatial correlations can be decoupled in convolutional neural networks, Xception [38] present a novel convolution operation. Depthwise convolution is followed by pointwise convolution, called depthwise separable convolution. The Xception architecture obtains slight improvements compared to InceptionV3 [39]. Howard et al. [40] present a new network based on depthwise separable convolutions for mobile vision applications. Additionally, ShuffleNet [41] propose an extremely efficient convolutional neural network for mobile devices. Channel shuffle is an extremely important operation in the network, which allows the next input data to be composed of different groups after the pointwise group convolution operation. To take advantage of group convolution, we introduce it into our work. We construct combinations of different group convolution in our modules and achieve a good performance.

III. PROPOSED METHOD
In this section, we first introduce the architecture of our proposed FRNet. Then, we discuss the Modified Pyramid Pooling Module and Feature Recombining Module in detail. And we also illustrate the differences between the two structures of FRM.

A. THE STRUCTURE OF FRNet
The structure of FRNet is shown in Figure 1. FRNet can be summarized as a new semantic segmentation network of Dilated FCN. The spatial resolution of input images is reduced by only 8 times. FRNet contains a backbone convolutional neural network (CNN), Modified Pyramid Pooling Module (MPPM), and Feature Recombining Module (FRM). The backbone CNN can be various classification networks without a fully connected layer, such as VGG, ResNet, Xception, etc. In this paper, the research results are based on ResNet. The main function of the backbone CNN is extracting features. The MPPM can capture contextual information from the feature maps of two levels, which contains more comprehensive context information. So the MPPM can acquire more complete and effective context information from the whole scene. The FRM can acquire more detailed and representative features by recombining the last three feature maps. Because the final feature representation is not only derived from the last feature map. The output of FRM is composed of the concatenation of multi-level feature information so that the features are richer and more detailed. Finally, the complete context information and rich features are combined to produce the final feature representation.

B. MODIFIED PYRAMID POOLING MODULE
The MPPM is based on the Pyramid Pooling Module (PPM) of PSPNet. We make some adjustments and improvements on the original structure. The module can improve performance without increasing computational cost. The difference between the two modules can be shown in Figure 2. PPM process the last feature map by adaptive average pooling of different pyramid scales. As a result, the module can acquire the pooled representation of different sub-regions. The level size of the pyramid is usually set to 4 and the VOLUME 8, 2020  size of pooled feature maps are set to 1 × 1, 2 × 2, 3 × 3, 6 × 6. After pooling, 1 × 1 convolution is employed in each level to reduce the dimension of feature maps, followed by upsampling to recover the original size. Finally, the four different levels of feature maps are aggregated. As mentioned in that paper, PPM can capture much effective global contextual prior for semantic segmentation. But the final context information is only extracted from the feature map of the last convolutional layer, which may miss some of the lowlevel features. Instead of the previous approach PPM, our MPPM can capture more information from some slightly low-level features for semantic segmentation. As shown in Section B of Figure 2, we reserve the structure of the first three levels whose global context information is from the feature map of the last layer. We extract the features from the penultimate feature map using convolution with a 3 × 3 kernel size, a stride of 2, and groups of 2. In addition to being able to extract more information from low-level features, this change has the advantage of replacing auxiliary loss and improving the efficiency of the whole network. Related details will be demonstrated with experimental data in Table 3. For feature extraction of the penultimate feature map, we try many downsampling methods, and the results are shown in Table 2. Our experiment proved that the MPPM is efficient to extract multi-scale and multi-level context information.

C. FEATURE RECOMBINING MODULE
The goal of FRM is to make the extracted features of high and low layer features more representative. In a deep neural network, the deeper convolutional layer has more channels. This feature map often contains more discriminative and complex semantic information. But for semantic segmentation, it is often necessary to combine the complex semantic information and the low-level feature information to classify a pixel. For this reason, we construct FRM to make the final feature have high-level semantic information and low-level spatial information simultaneously. Next, we describe the FRM in detail.
The input of FRM is a D h ×D w ×C feature map F extracted by backbone ResNet-50/101, where D h and D w are the spatial height and width of the feature map respectively, C is the number of input channels. we first perform the convolution with a 3 × 3 kernel size and a stride of 1 on the feature map to generate a reduced feature map F 1 , where the length and width remain the same, and the number of channels is reduced by half. The purpose of this operation is to extract the most representative high-level features from the feature map of the last layer. Then we perform another convolution operation with a 1 × 1 kernel size and a stride of 1 on feature map F to get another reduced feature map F 2 (like F 1 ). To make the final feature map also have more prominent detailed features, we perform some operations on the third and fourth feature maps. We adopt one adaptive average pooling operation and one 1×1 convolution on the feature maps. Then we employ an upsampling operation (bilinear interpolation) on the feature maps to get the same size of the original feature map. Finally, we can get the feature map F 3 , F 4 . Further, F 2 , F 3 , and F 4 are concatenated as a new feature map F 5 , which contains much more detailed information. Finally, we perform the convolution with a 3 × 3 kernel size, a stride of 1, and groups of 2 on feature map F 5 to generate a reduced feature map F 6 , followed F 1 and F 6 are concatenated to be the final feature map of FRM.
To find the optimal structure of FFM, we design another structure for FFM. The difference between the two structures is illustrated in Figure 3. As shown in Section A of Figure 3, in another FFM, we perform a simpler operation on the feature maps. Instead of performing the extra convolution to generate feature map F 1 , we only perform one convolution on feature map F to get the new feature map F 7 , which has the same size as the original feature map. Finally, we concatenate F 3 , F 4 , and F 7 as the new feature map F 8 , followed by performing the convolution operation with a 3 × 3 kernel size, a stride of 1, and groups of 4/8/16/32 to get the final feature map of FRM. The performance comparison of the two structures is shown in Table 3.

IV. EXPERIMENTS
In this section, we evaluate our approach on the common segmentation datasets Pascal Context [42] and ADE20K [12]. First, we introduce the information about the datasets as well as the implementation details. Then, to validate the effectiveness of two modules for the whole network, we conduct a series of ablation studies based on ResNet-50 as the backbone on Pascal Context dataset. Finally, we report the performance and compare them with the state-of-the-art methods on Pascal Context and ADE20K datasets.

A. PASCAL CONTEXT DATASET
The Pascal Context dataset provides the additional annotations for PASCAL VOC 2010, which involves 4,998 training images and 5,105 testing images. There are pixel-wise semantic annotations for the whole scene in the semantic segmentation task. Like the prior works [5], [19], [31] do, we set 60 labels, 59 common categories and one background.

B. ADE20K DATASET
ADE20K dataset is a common benchmark in scene parsing. The dataset contains 150 object categories, which makes the semantic segmentation task more challenging than other datasets. There are 20K/2K/3K images for training(train), validation (val), and testing (test).

C. IMPLEMENTATION DETAILS
Our proposed method is implemented in PyTorch 1.1, which is a widely used deep learning framework. We adopt the ResNet-50/101 pre-trained on ImageNet [43] as the backbone. The last two downsampling operations are replaced with dilated convolutions, whose dilation rate is set to 2 and 4. The size of the feature map of the last layer is 1/8 of the original image, which is the same as [5], [31].
We randomly flipped and scaled the original images to implement data augmentation. The scaling rate is set to be between 0.5 and 2.0. and then crop the input image into 480 × 480. During the training, we set the batch size to 16. The SGD is adopted as the training optimizer. We set the momentum and weight decay to 0.9 and 0.0001 respectively. We employ ResNet-50 and ResNet-101 as standard backbones with training 80 epochs on Pascal-Context dataset and 120 epochs on ADE20K dataset. The initial learning rate is set to 0.001 for Pascal-Context dataset and 0.01 for ADE20K dataset. The learning rate of each epoch is obtained by multiplying the initial learning rate by ( where the α is 0.9. VOLUME 8, 2020

D. EVALUATION METRICS
We use pixel accuracy (pixAcc) and mean Intersection of Union (mIoU) as the evaluation metrics, which are widely used in semantic segmentation. These two metrics can best distinguish the performance of semantic segmentation. The metrics are defined as follows: where p ij denotes the number of pixels of class i predicted to belong to class j, p ji denotes the number of pixels of class j predicted to belong to class i, p ii denotes the number of pixels classified correctly and k denotes the number of classes of the object. Like the standard competition benchmark [12] does, we calculate mIoU with considering the background pixels.

E. ABLATION STUDY
To show the effectiveness of our proposed FRNet and verify the impact of modules MPP module and FR module on network performance, we perform a complete ablation study To make a better comparison between PMM and MPPM and further confirm the optimal convolution operation on the penultimate feature map, we perform a systematical experiment on the Pascal Context dataset (with ResNet-50 as the backbone). We perform the convolution operation with different groups and different strides. As shown in Table 2, the convolution operation with a stride of 2 and groups of 2 performs best. It can be seen from the table that the result of convolution with a step of 2 is better than a step of 1 because the convolution operation with a step of 2 is equivalent to a downsampling operation. The features acquired after the downsampling would be more representative than the original features. Moreover, the group convolution can reduce parameters while improving performance slightly (setting groups too large can degrade the performance). Notably, the PSPNet without the auxiliary loss branch performs worst and all MPPM without auxiliary loss branch perform better than all PPM with auxiliary loss branch. Because the PPM only extracts context information from the last layer of the feature map, which does not contain the semantic information of different levels. Although the use of an auxiliary loss branch helps to the performance improvement, the contextual information of the different levels still remains independent. Different from PPM, our MPPM can merge the features of different levels to obtain more discriminative and accurate features, which is the essential reason why our MPPM is superior to PPM. This also proves that the convolution operation on the penultimate feature map can instead of an auxiliary branch.
For more exploration on FRM, we designed two structures for this module. A represents the MPPM without the extra convolution operation on the feature map of the last layer. B represents the MPPM with an additional convolution operation on the last layer while performing the concatenate operation on three feature maps. As shown in Table 3, the performance of structure B is significantly higher than that of structure A. This suggests that this additional convolution operation can extract more representative features from the feature of the last layer. Meanwhile, in structure B, the convolution (with groups of 2) after concatenate operation achieves the highest result.

F. RESULT ON PASCAL CONTEXT DATASET
To make a fair comparison with prior work, we calculate the mIoU using 60 classes (including background). As reported in Table 4, our method (using ResNet-50 as backbone) has much better performance than DeepLabV2(using ResNet-101 as backbone and with COCO pretraining) and RefineNet (using ResNet-152 as backbone). Notably, ResNet-101 and ResNet-152 are much stronger than ResNet-50. Moreover, by using ResNet-101 as backbone, we improve the performance from 51.25% to 51.9% in mIoU, which outperforms PSPNet (ResNet-101) by 1.1% in mIoU. The visual results are shown in Figure 4 and Figure 6.

G. RESULT ON ADE20K DATASET
We evaluate our model on ADE20K val dataset and report mIoU in Table 5. As shown in Table 5, our method obtains a much better performance compared VOLUME 8, 2020  to RefineNet (ResNet-152). With the same backbone of ResNet-101, our model performs better than UperNet [23], DSSPN [11], PSANet [32], SAC [28], and PSPNet. Although our method (ResNet-101) performs a little worse than PSPNet (ResNet-269), our method (ResNet-101) outperforms PSPNet (ResNet-101) by 1.46% in mIoU. Under the  same conditions, our method is better than other state-ofthe-art methods, which proves that our FRNet is an efficient method for semantic segmentation. The visual results are shown in Figure 5 and Figure 7. As can be seen from the comparison of the results, our proposed FRNet can extract more detail than PSPNet so that our method is more accurate VOLUME 8, 2020 in classifying smaller objects. Meanwhile, our method also performs better on large objects (sky, streets, buildings, etc.).

V. CONCLUSION
In this paper, we introduce a novel convolutional neural network FRNet for semantic segmentation. We construct the FRM and MPPM for our network. Our method is evaluated on the common segmentation datasets Pascal Context and ADE20K, which demonstrates the method performs well in segmentation accuracy. In addition, we also designed two different versions of FRM, A and B. The experiment proved that B is more effective than A, but both of them are better than the baseline. The ablation study shows our Feature Recombining Module and Modified Pyramid Pooling Module are effective in improving the performance of the semantic segmentation task respectively. Furthermore, the combination of these two modules can greatly improve the performance of the network. For further research, we may embed our modules into lightweight networks.