Filling the Gaps in Atrous Convolution: Semantic Segmentation With a Better Context

The main challenge for scene parsing arises when complex scenes with highly diverse objects are encountered. The objects not only differ in scale and appearance but also in semantics. Previous works focus on encoding the multi-scale contextual information (via pooling or atrous convolutions) generally on top of compact high-level features (i.e., at a single stage). In this work, we argue that a rich set of cues exist at multiple stages of the network, encapsulating low, mid and high-level scene details. Therefore, an optimal scene parsing model must aggregate multi-scale context at all three levels of the feature hierarchy; a capability that lacks in state-of-the-art scene parsing models. To address this limitation, we introduce a novel architecture with three new blocks that systematically aggregate low, mid and high tier features. The heart of our approach is a high-level feature aggregation module that augments sparsely connected atrous convolution with dense local and layer-wise connections to avoid gridding artifacts. Besides, we employ a novel feature pyramid augmentation and semantic refinement unit to generate low- and mid-level features that are mixed with high-level features at the decoder. We extensively evaluate our proposed approach on the large-scale Cityscapes and ADE2K benchmarks. Our approach surpasses many latest models on both datasets, achieving mean intersection-over-union (mIoU) scores of 80.5% and 44.0% on Cityscapes and ADE20K, respectively.


I. INTRODUCTION
Given an image, the goal of semantic segmentation is to assign a category label to each pixel [1], [2]. It is a challenging problem with numerous real-world applications, including autonomous driving, satellite imaging and scene understanding [3]. The semantic segmentation task simultaneously deals with two sub-problems: classification and localization. However, both these sub-tasks are contradictory in nature and largely influence the design principles of the segmentation model. In the case of classification, the models aim to learn higher-level representations to capture the global context and therefore need to be invariant to local image details such as in-plane transformations and deformations. [4] Whereas such abstraction of spatial information is undesirable for the The associate editor coordinating the review of this manuscript and approving it for publication was Lefei Zhang . localization task, for which the models need to resolve finegrained local information with pixel-level accuracy. [5] These two conflicting aspects of the problem mean that a range of complementary features are desired to build a robust model that balances the two extremes based on the given context.
The recent well-known models such as PSPNet [7] and DeepLab [8], [9] only perform multi-level contextual learning at a single stage within the deep network hierarchy. Specifically, PSPNet [7] applies pooling operations with different sub-sampling rates, all arranged in parallel, to capture context information. Their pyramid pooling module works only on last convolutional layer features, that generally lacks local scene details. The Atrous Spatial Pyramid Pooling (ASPP) module in DeepLabv2 [8] & v3 [9] applies parallel atrous convolutions with different dilation rates to extract multiscale context information, however, they also work on a high-level feature representation. Similarly, [10] uses dense FIGURE 1. The importance of local and high-level information for semantic segmentation. Qualitative comparison of our approach with the baseline DenseASPP [10].Since the context aggregation block of [10] only considers the high-level features, it misinterprets local information. This leads to inaccurate segmentation (e.g., first row: confusion between background and object (person); second and third rows: coarse local details of objects (bicycle and chair). In comparison to [10], we propose to jointly use low, mid and high-level cues in a single framework, resulting in refined and more accurate semantic labeling (fourth column).
connections between convolution layers with multiple dilation rates and only operates on later-stage features in the deep network. Besides, all the above-mentioned approaches use a direct feature upsampling for final prediction. In this paper, we argue that a simple upsampling operation is insufficient to fully recover the fine details and is sub-optimal to combine multi-stage (mainly low and mid-level) features for final prediction (see Figure 1).
One alternative to resolve the above-mentioned limitation was proposed in 'U-Net' architecture [6], [11] that gradually fuses high-level features from top layers with lowlevel features from initial layers. Despite being intuitive, this design does not consider both mid-level features and multiscale context on low and high-level representations. Similar architecture designs have been proposed in GCN [12], DeepLabv3+ [13], PANet [14] and ExfuseNet [15] but with a similar limitation. Some recent models, such as RefineNet [16] and DFN [17], not only design a top-down smooth network but also apply a bottom-up border network to refine the final output. However, the low-level features in these networks are actually converted to mid-level features before they are used for refinement. This results in loss of sharp details (e.g., around boundaries) and mistakes for small objects. In order to address these issues, we propose to jointly integrate low-level features in the network, along-side mid and high-level cues.

A. CONTRIBUTIONS
We propose and approach with the following main contributions: (a) We propose a feature pyramid based augmentation module to efficiently generate refined low-level features to preserve the local details. (b) For mid-level multiscale feature fusion, we propose a semantic refinement unit that combines a diverse set of features from the network encoder. (c) The central component of our model is a highlevel context aggregation block. It is based on the insight that dilated convolution expands kernel size by interleaving its weights with zeros, that equates to dropping the intermediate activations in the input feature map. To alleviate this problem, we propose to combine the strengths of dilated (sparse) and wider (dense) kernels, that eventually enhance the discriminative power of the network and avoids unfairly neglecting the local information, as is the case with atrous convolution. We perform experiments on Cityscapes and ADE20K datasets. Our approach achieves superior results compared to existing methods on both datasets. Specifically, it achieves 80.5% and 44.0% mIoU scores on Cityscapes and ADE20K datasets, respectively, outperforming the best reported results in [7], [17] II. RELATED WORK The seminal work of Long et al. [18] on fully convolutional networks (FCNs) exceeded the performance of all previous hand-crafted [19], [20] and learning-based segmentation methods [21]- [24]. FCN based segmentation frameworks typically employ (pretrained) image classification networks as backbone that use deep pooling hierarchy to increase the receptive field for better encoding of image semantics. However, the repeated use of pooling layers and convolutions with striding operations drastically reduce the spatial dimensions of the feature maps. When such features maps are upsampled back to full resolution at the output layer, the produced segmentation results appear coarse with deteriorated object boundaries [18].
To alleviate this problem, Chen et al. [25] propose to remove the downscaling operator from the last few pooling layers of the backbone and instead use convolution with upsampled filters (aka. atrous convolution [26]), thus producing dense features maps. The atrous convolution produces feature maps in which all units have the same receptive field size and therefore incorporate the objects' context at a single scale, which is a limitation as objects exist at multiple scales in real-world images.
Inspired from spatial pyramid pooling (SPP) [27], PSPNet fuses features at different pooling pyramid scales. The ASPP network [8] applies atrous convolution with multiple dilation rates in parallel, thus effectively increasing the receptive field as well as capturing object and image context at multiple scales. In certain applications, e.g., autonomous driving and land cover mapping the receptive field of neurons is required to be really large, which ASPP [8] can only achieve by using large dilation rates. However, with the increased dilation rate the atrous convolution starts to lose its modeling power and gradually becomes less effective.
DenseASPP [10] has been proposed to incorporate wider context in the ASPP block, especially in cases where dilation rates are high. Although DenseASPP improves context, one limitation of such design is that only the high-level features from the later layers in a deep CNN are used to model contextual information. This limitation makes [10] and previous network designs inattentive to local scene details. One promising way to counter this effect is through a encoder-decoder structure [11], [29]. The encoder-decoder [11], [12], [15], [16], [29]- [32] networks acquire the semantic detail in the encoder branch and try to recover sharp object boundaries in the decoder branch.

A. OUR APPROACH
The existing network architectures lack a multi-stage and systematic combination of high, mid and low level image features. Our proposed model design demonstrates that none of the three levels is independently sufficient for optimal multi-scale feature fusion. Specifically, our model introduces novel network designs for all the three context-aware blocks in the proposed model that leads to significantly better performance. The proposed approach outperforms existing methods, mostly employing high-level features, on both Cityscapes and ADE2K datasets. Figure 2 illustrates the core components of our framework. These include: (1) large kernel paralleled with dilation (LKD) block, (2) refinement unit, and (3) low-level feature pyramid (LFP). Next, we first describe the functionality of each of these individual components. Later we show how they fit together in overall architecture to complement each other with different type of information, thus providing us with accurate semantic segmentation results.

A. LKD BLOCK
The dilated convolution ('algorithme á-trous') provides an excellent mechanism to expand the receptive field without any loss of spatial resolution [26]. It enlarges the convolution kernel with a dilation factor d such that the kernel weights are separated by d−1 zeros, hence á-trous (with holes). However, dilated convolution suffers from gridding artifacts that hamper the performance for segmentation tasks due to two main reasons. First, the receptive field for each kernel has a checkerboard shape which sub-samples an input unit without considering its neighborhood information. This results in loss of local information especially when dilation rates are high. Second, the repeated zeroed patterns in the dilated kernel cause the neighboring points to be computed from mutually exclusive sets of input units. Figure 1 has shown this problem and highlights the importance of local information.
Some recent approaches aim to achieve 'degridding' in the dilated convolution outputs. Stacking several dilated convolution layers with fixed [33] or varied [35] dilation rates have shown to be useful but at the cost of additional computation complexity. An alternative with low computation cost was proposed in [34] that simply smooths out the inputs to make them inter-dependent. In contrast to these approaches, our motivation is different: since the local context can be helpful or misleading for the dense local predictions, we want to systematically decide whether and how to use it optimally beside more wider contextual cues. VOLUME 8, 2020 We, therefore, propose to concatenate the feature responses from both the sparse and dense kernels 1 with a large receptive field. It not only enhances the effective receptive field, but most importantly preserves the complementary neighborhood aware and neighborhood agnostic features that prove very useful in resolving local ambiguities depending on the given case (see Figure 1 as an example). Consider an input feature x and a convolutional kernel w, the output feature map can be given by: Beside y(i), we also obtain a complimentary feature map using a sparse kernel that is dilated with a factor d: The output of the LKD block is the concatenation of both feature responses i.e., o = y ⊕ŷ. Note that dense convolutions with large kernel convolution are typically expensive, therefore we use the asymmetric convolution via spatial factorization scheme proposed in [28] to reduce the computation cost of large kernel operation.

B. REFINEMENT UNIT (RU)
In a task like semantic segmentation, features from all network layers are useful in providing efficient dense predictions. Therefore, several previous works [13], [15]- [17] combine low-level features from earlier layers that contain abundant spatial information with high-level features from deeper layers that are semantically enriched. On one hand, high-level semantic features are helpful in recognizing image regions' labels. On the other hand, low-level features are good at generating sharper object boundaries for high-resolution predictions.
Traditional encoder-decoder architectures (like U-Net [11]) use skip-connections to directly pass features from the lower layers of the encoder to the corresponding higher layers of the decoder. In this paper, we introduce refinement units in the skip pathways to filter the flow of information. The refinement unit effectively exploits information from encoder layers for each spatial region in a top-down manner. Simply put, at each layer of the encoder, the refinement unit allows every subsequent high-level layer to provide contextual feedback in refining the relatively low-level features of the current layer, which are then passed on to the decoder as the midlevel features. may differ in spatial resolution and numbers of channels. Therefore, we first match their channel dimensions by applying a matching function F(·). The matching function is a small sub-net comprising of 3 × 3 convolution followed by batch normalization and a ReLU non-linear activation. Next we upsample f i+1 e by a factor of two to match its spatial dimensions with f i e map. Finally, we perform an element-wise multiplication on these two feature maps in order to produce the refined features In the case when high-level features come from multiple encoder blocks (i.e., j = 1, . . . , N ), we add the individual refined feature maps together to obtain the final feature responses, as follows: As shown in Figure 2(d), the refined features maps (R f ) are processed with some convolutional layers (followed by bacth normalization and ReLU non-linearity) to further enrich them with mid-level semantics. Finally, these feature maps are passed to the decoder.

C. LOW-LEVEL FEATURE PYRAMID (LFP)
Natural images exhibit large scale variation of objects. A segmentation method that works well for a particular object scale might perform poorly at other scales. For instance, making predictions on images with large resolution is beneficial for small objects and vice versa. To address the problem of scale variation, several solutions have been proposed such as combining features from shallower layers with deeper layers [13]- [17] to particularly deal with small objects, dilated convolutions [8], [10], [26] to increase receptive field, independently performing dense prediction at different resolutions to treat objects at each scale differently. Besides our key contributions described in previous sections on aggregating context of objects at multiple scales, in this section we introduce a novel low-level feature pyramid (LFP) that complements the features in the decoder to further improve the performance of the network. The LFP module takes the original image as input and reduces its spatial dimension by the factor of 4 using bilinear interpolation. Different from the previous works [36], [40], our LFP does not contain many convolution operations, which greatly guarantees to incorporate the true low-level information of the image. The reduced-dimension image is processed with a light-weight convolutional network to obtain a feature pyramid. From this tensor we form a pyramid with three different scales: (a) original-sized low-level features, (b) low-level features downsampled by the factor of 2 using a 3×3 convolution with stride = 2, and (c) low-level features downsampled by  It can be seen in Figure 2 that the decoder receives information from three different sources: (1) last LKD block, (2) refinement units, and (3) low-level feature pyramid. Due to the presence of several convolutional layers in the refinement units and the backbone network, the features that the decoder receives from the skip-pathways do not remain lowlevel features in a true spirit. This is why, the addition of pixel level information via LFP module yields significantly improved results, as we can see in Figure 4.

D. OVERALL ARCHITECTURE
In Figure 2 we show the schematic of our framework that is based on the encoder-decoder framework. We use ResNet-101 as the network's backbone. In order to maintain a good trade-off between speed and accuracy, we make the spatial resolution of the final feature maps 16 times smaller than the input image resolution. After getting the output feature maps from ResNet, we pass them through a semantic consolidation module. The semantic consolidation module consists of four dense-connected LKD blocks. The LKD block makes use of a large kernel convolution [12] paralleled with a dilation convolution [26] to aggregate lossless contextual information. The dilation convolution is capable of embedding abundant context detail without decreasing the spatial resolution. However the holes in dilated kernels may eliminate the impact on each pixel from its immediate surroundings, especially in the case of high dilation rate (e.g. 6,12,18,24). Therefore, we address this issue by adding a complementary large kernel branch in parallel.
Each LKD block in the semantic consolidation module uses a different dilation rate and large kernel to compute features at different scales. The dilated convolution is performed using a 3×3 kernel with certain dilation rate. Whereas large kernel convolution is implemented by using a cascade of 1 × k convolution and k × 1 convolution in order to reduce the parameters and computation. With dilated convolution we accumulate the influence on each neuron from some of the far away neurons. On the other hand, we aggregate responses for each neuron from all neurons in its local neighborhood with large kernel convolution. Together the large kernel branch and the dilated convolution branch aggregate local and non-local context. We connect each LKD block with all subsequent LKD blocks in a feed-forward fashion by using dense connections, which is different from ASPP module [8] that has four independent parallel branches. Therefore, in our network the features from different LKD blocks are well integrated. Moreover, we use the global average pooling operation to obtain a descriptor that incorporates the global context. We upsample this descriptor in order to concatenate it with the output of the last LKD block.
Finally we have a decoder in the network that receives high-level features from the last LKD block, low-to-mid level features from the refinement unit and low-level features from the low-level feature pyramid module. The decoder effectively combines all these features to make the final dense prediction of semantic segmentation. Figure 3 illustrates outputs at different stages in the decoder. We can see that the extracted features become progressively more precise and objects that are mixed together are gradually separated, resulting in segments with well defined contours.

IV. EXPERIMENTS
The proposed method aggregates low, mid and high level features at multiple scales in order to provide precise segmentation results on images having small, medium and large size objects depicting scale variations. To show the effectiveness of our framework, we carry out comprehensive experiments on two large-scale diverse datasets: Cityscapes [43] and ADE20K [44]. The quantitative performance is measured in terms of mean class-wise intersection-over-union (mIoU). Next we briefly describe each of these benchmark datasets.
A. DATASETS 1) CITYSCAPES [43] This dataset is particularly developed for semantic understanding of urban street scenes. It is a large-scale diverse dataset that is captured in 50 different cities and comprised of 5,000 high resolution (2048 × 1024) pixel-level annotated street images of which 2,975 are for training, 500 for validation and 1,525 for testing. Note that in our experiments, we consider only those 19 categories that have finely annotated labels.
2) ADE20K [44] It contains 20,210 training, 2,000 validation and 3,000 test images with 150 stuff/object category labels. It is challenging dataset in which the frequency of objects in images ranges from 5 to 273.

B. IMPLEMENTATION DETAILS
The backbone of our network is ResNet-101 pretrained on ImageNet-1k. Following the works of [9], [28], we replace VOLUME 8, 2020 FIGURE 4. Ablation study: effect of adding different modules in our framework. As we add the modules, the visual appearance becomes more refined and faithful to the ground-truth.

TABLE 1.
Ablation study on dilation rates and large kernel sizes. the first 7 × 7 convolution layer of the backbone with one 3 × 3 convolution of stride 2 and two 3 × 3 convolutions of stride 1. In order to strike the right balance between speed and accuracy, we make the backbone output resolution only 1/16 times smaller than the input image, which is achieved by using dilated convolutions with small rates in block 3 and block 4 of the ResNet [9].
To perform experiments on Cityscapes and ADE2K datasets, we employ the 'poly' learning rate policy lr = base_lr × (1 − iter total_iters ) 0.9 , with base_lr set to 0.015, weight decay to 0.0001 and momentum to 0.9. For data augmentation, we use random flipping and random scaling in the range [0.5, 2]. The batch normalization parameters are trained with decay equal to 0.9997. In the case of Cityscapes dataset, we use crop size of 769 × 769, batch size of 12, and training is performed for 90K iterations. For the ADE2K dataset, we set batch size to 24, crop size to 513 × 513 and total iterations for training to 150K.

C. ABLATION STUDY
Our network consists of several individual modules: semantic consolidation module, refinement units and low-level feature pyramid. Here, we evaluate the design choices of these modules and their influence on the overall performance. The semantic consolidation module contains four LKD blocks, where each module includes a large kernel branch and a dilation convolution branch. Table 1 shows that the selection of an appropriate size for the large kernel (k) and the dilation rate for the dilated convolution is important to obtain improved performance. Similarly, the connection mode (i.e., dense, parllel and cascaded) of these four LKD blocks is also important as shown in Table 2. Based on the experiments, we opt for the parallel layout for the large kernel and dilation convolution branches, and use dense connection for the LKD blocks as shown in Figure 2. Table 3 shows the comparison with the baseline (DenseASPP [10]) and the impact of progressively integrating each of our  contributions in the baseline. Note that original DenseASPP is based on DenseNet121 [46]. To make the results comparable, in this paper we use ResNet101 as base network of DenseASPP as the baseline. The baseline [10] in which we use {6,9,12,18} rates for dilation convolutions obtains a mIoU score of 76.77%. By integrating our core modules in the baseline, we achieve a mIoU score of 79.31%, which is a significant improvement of 2.54% over the baseline [10]. Figure 4 illustrates the improvement in segmentation quality by progressively integrating each of our building blocks in the baseline. The semantic consolidation module (SCM), when used alone, yields coarse result. The integration of refinement units (RU) and low-level feature pyramid (LFP) allows to recover the spatial information leading to an improvement in segmentation quality, i.e., sharp object boundaries. Figure 5 shows the comparison of our approach with the baseline DenseASPP [10] on the ADE20K dataset. As shown in Figure 5, the DenseASPP [10] yields segmentation results with missing objects (right pillow in row 1), coarse appearance (kids in row 2, chairs in row 3, and coffee table in row 4) and incorrect categories (sofa in row 4); whereas our method performs better in all of these cases. Figure 6 depicts visual comparison between our method and the DenseASPP   baseline [10]. It is apparent that [10] has a tendency to assign false labels to some objects; notice, the sidewalk and the car in row 1 and row 2, respectively. Whereas our network generates relatively accurate segmentation results; compare the fence and overhead pole in row 3 and row 4, respectively. Table 4 shows the comparison results on Cityscapes test set. Our method gets a comparable result by achieving a mIoU score of 80.5%. Furthermore, for some classes such as bus and pole the proposed method attains a significant Our method produces more accurate and sharper results than the DenseASPP [10] baseline. Compare the sidewalk, car, fence, and poles in row 1, 2, 3 and 4, respectively. DenseASPP achieves 76.77% mIOU on val set, whereas we obtain 79.31%. performance improvement of 5% over existing approaches. Table 5 reports the results on ADE2K validation set. The baseline DenseASPP [10] obtains a mIoU score of 41.51%. Our approach provides superior results compared to existing methods with a mIoU of 44.0%.

D. COMPARISON WITH OTHERS
In Figure 7 we present a comparison of segmentation maps generated by our network and the most popular methods [7], [9], [10] on ADE20K (first 3 rows) and Cityscapes (last 3 rows). It is apparent that our method produces results that are more precise, sharp and faithful to the ground-truth than those of the other competing approaches. For example, compare the detail in the following regions; row 1: side table, sofa cushion, window blinds and plant; row 2: table, screens and desktop PC; row 3: person and window; row 4: bus; row 5: fences and poles; row 6: traffic light, poles and persons in the background.
The proposed framework is not only efficient in segmenting large objects (see the accuracy of bus and train class in Table 4) but also the small objects. A visual illustration is presented in Figure 8, where it is evident that our method segments small objects with much higher accuracy than other competing methods. Note that only the cropped (zoomed-in) results are shown for better visualization.

E. IMPROVEMENTS ON OBJECT BOUNDARIES
In this section, we evaluate the segmentation accuracy with the trimap experiment to quantify the accuracy of different methods near object boundaries. Specifically, we first detect the boundaries of the labels on Cityscapes val set and apply the morphological dilation on these boundaries with void label annotations, which produces the dilated band (called trimap). We then compute the mean IoU for those remaining pixels. Fig. 9 shows the mean mIoU results of our method, DenseASPP [10] and ASPP [8]. Note that trimap width is equal to 0 means we do not have any operation on the label. We can see that there is a great gap between our method and other two methods without trimap. When we ignore some boundary information, the performance of DenseASPP [10] and ASPP [8] has greatly improved.

V. CONCLUSION
Existing semantic segmentation approaches generally put strong emphasis on high-level context aggregation, but do not explicitly include low-and mid-level information. Here, we show that an optimal combination of features present at all these three stages leads to a refined recovery of local details. Specifically, we propose a high-level context aggregation block, combining the strengths of dilated (sparse) and wider (dense) kernel, to enhance the expressive power of the network and avoid unfairly neglecting the local information. Besides, we employ a light-weight feature pyramid augmentation and semantic refinement unit to generate low-and midlevel features that are mixed with high-level features at the decoder. The designed modules are very flexible that can be easily migrated to any network. Our extensive experiments on two segmentation benchmarks show compelling improvements over the previous best techniques. LIYUAN