HCNet: Hierarchical Context Network for Semantic Segmentation

Global context information is vital in visual understanding problems, especially in pixel-level semantic segmentation. The mainstream methods adopt the self-attention mechanism to model global context information. However, pixels belonging to different classes usually have weak feature correlation. Modeling the global pixel-level correlation matrix indiscriminately is extremely redundant in the self-attention mechanism. In order to solve the above problem, we propose a hierarchical context network to differentially model homogeneous pixels with strong correlations and heterogeneous pixels with weak correlations. Specifically, we first propose a multi-scale guided pre-segmentation module to divide the entire feature map into different classed-based homogeneous regions. Within each homogeneous region, we design the pixel context module to capture pixel-level correlations. Subsequently, different from the self-attention mechanism that still models weak heterogeneous correlations in a dense pixel-level manner, the region context module is proposed to model sparse region-level dependencies using a unified representation of each region. Through aggregating fine-grained pixel context features and coarse-grained region context features, our proposed network can not only hierarchically model global context information but also harvest multi-granularity representations to more robustly identify multi-scale objects. We evaluate our approach on Cityscapes and the ISPRS Vaihingen dataset. Without Bells or Whistles, our approach realizes a mean IoU of 82.8% and overall accuracy of 91.4% on Cityscapes and ISPRS Vaihingen test set, achieving state-of-the-art results.


Introduction
Semantic segmentation is a vital part of the visual understanding system. It aims to parse images through assigning a class label to each pixel for an image. Currently, semantic segmentation technology has been widely used in various fields such as automatic driving [4,46], and remote sensing image interpreting [27,28].
Traditional methods mostly adopt machine learning algorithms to perform image segmentation with various techniques, such as thresholding [37], region growing [19], edge detection [1,22], clustering [5,53], etc. Most successful works are based on hand-crafted features, such as HOG [10], SIFT [3], etc. However, with the rise of deep learning, traditional methods relying on feature engineering have gradually been replaced by the convolutional neural network (CNN) with adaptive feature learning. Block-based semantic segmentation is an early representative method based on CNN. This method first extracts regular blocks from the image in a sliding window and performs classification using common CNN (such as AlexNet [23], VGG [36], GoogLeNet [38] and ResNet [20]). The prediction result of the image block is regarded as the class of the center pixel. For example, Sakrapee et al. [31] exploit CNN for semantic pixel labelling by cropping multi-resolution image blocks. However, the block-based methods have great redundancy due to repeated feature extraction in overlapping regions.
Things changed thoroughly after the emergence of fully convolutional network (FCN). It learns a mapping from pixels to pixels, without extracting image blocks. However, due to the fixed geometric structures, the conventional FCN is inherently limited to local receptive fields. The limitation of insufficient global context information imposes a great adverse effect on its segmentation accuracy. To make up for the above deficiency of FCN, some works obtain global context information from the perspective of multi-scale aggregation. Multiple studies [25,26,33,51] adopt the pooling operation to generate multi-resolution features, which are then up-sampled and aggregated for prediction. In addition, other works [7,8] apply dilated convolution with diverse dilated rates to acquire multi-scale contextual information. However, The above method based on multi-scale aggregation adopts a non-adaptive extraction process for all pixels, which cannot meet the requirements of different context dependencies for specific positions.
Recently, some works have focused on using the selfattention mechanism [40] to capture global context infor- It can be seen that the non-local module models the dense pixel-level correlation between the current position and all other positions indiscriminately. But our method first models the sparse pixel-level context between pixels of the same class. Then, the regional context between different classes is captured through the proposed region context module.
mation for semantic segmentation. OCNet [48] aggregates objects context by computing the correlations of each pixel and all the other pixels. Similarly, DANet [17] and Relational Context-aware Network [30] explore dense pixel-level contextual correlations through the self-attention mechanism in both spatial and channel dimensions. However, we found that the correlation between pixels belonging to different classes is usually weak in these methods, which means that these low correlation positions have minimal impact on the feature representation of the current position. Therefore, performing dense pixel-level modeling between these pixels will give rise to enormous redundant computation.
To address the drawback of the self-attention mechanism, we propose a hierarchical context network (HCNet) to model global context information. Specifically, pixel-level correlation is still captured between pixels of the same class with strong correlation, and a unified region-level correlation is modeled for heterogeneous pixels with weak correlations. As illustrated in Figure 2, we append two streams named context stream and prior stream at the end of dilated ResNet. The prior stream is designed to provide region partition result to the context stream by the proposed multiscale guided pre-segmentation. The context stream consists of a pixel context module (PCM) and a region context module (RCM). Concretely, the PCM is first proposed for modeling pixel-level correlation between any two positions within each homogeneous region, as illustrated in Figure 1 (b). Subsequently, instead of performing dense pixel-level modeling between different homogeneous regions in self-attention mechanism, we capture the correlation between the region representations by proposed RCM, as illustrated in Figure 1 (c). The region representation is obtained by the proposed region pooling, and the enhanced region representation is restored to the pixel representation by region unpooling. Finally, we aggregate the output of the above hierarchical context modules to obtain features with global representation. In summary, our main contributions are threefold: • In order to improve the heterogeneous pixel redundancy modeling of the self-attention mechanism, We designed a HCNet to efficiently capture global context information for more accurate semantic segmentation.
• A PCM is proposed to learn pixel-level dependencies within each homogeneous region generated by proposed prior pre-segmentation. A RCM is designed to model region-level context between different regions with the help of the proposed region pooling and unpooling. Through aggregating fine-grained pixel context features and coarsegrained region context features, HCNet can harvest multigranularity representations to more robustly identify multiscale objects.
• The proposed HCNet achieves leading performance on two authoritative segmentation datasets used for autonomous driving and aerial interpretation, including Cityscapes and ISPRS Vaihingen datasets.

Related work
Multi-scale context for segmentation. Fully Convolutional Networks (FCNs) [26] successfully transform seman-tic segmentation into a per-pixel labeling task by replacing fully connected layers in DCNN [20,21,23,36,38] with convolutional ones. Following that, several FCN-based works have been proposed to capture rich contextual information from the perspective of multi-scale aggregation. RefineNet [25], ExFuse [50] and CCL [13] fuse multi-resolution features through an encoder-decoder structure, which achieves the complementation of detail information and semantic information as well as obtains rich multi-scale context. Correspondingly, PSPNet [51] and Deeplabv3 series [7,8] possess abundant contextual information using parallel multiscale branches with different sizes of pooling kernel or dilated convolution with diverse dilated rates. Self-attention for segmentation. The Self-attention mechanism, which is first proposed in machine translation [40], has been widely used to re-model feature space according to pixel-level dependencies between each pair of pixels in computer vision. [42] proposes a self-attention module to model dependencies in the space-time dimension. Due to the outstanding performance on capturing contextual information, the self-attention module has been increasingly applied in various computer vision tasks [17,24,42]. OC-Net [48] aggregates objects context by computing the similarities of each pixel and all the other pixels, which is essentially equivalent to the self-attention module. Similarly, DANet [17] and Relational Context-aware Network [30] explore contextual dependencies through the self-attention module in both spatial and channel dimensions.
Considering that pixel-level similarities between different classes are commonly insignificant, establishing dense pixel-level dependencies leads to massive redundant relationships and high complexity in time and space. Accordingly, our proposed method captures the dependencies between pixels within each homogeneous region and continues to model the correlation between different regions. It can hierarchically capture global context information while effectively reducing the computational complexity. Hierarchical structure. There are a lot of successful applications of hierarchical structure, such as document classification [45], response generation [43] and action recognition [14]. A hierarchical structure is designed to extract contextual information from both word and sentence levels for document classification and response generation [43,45]. In addition, for action recognition task, [14] divides the human skeleton into five parts according to the physical structure of the human body, and then extracts their features respectively, which is fused to produce a final representation of the skeleton sequence at a higher hierarchy.
Our method introduces the hierarchical structure into the semantic segmentation task for the first time, in which we partition the whole feature map into several class-based homogeneous regions and then explore contextual information within and between regions from pixel-level and region-level in a hierarchical structure.

Overview
The overall architecture of our proposed network is shown in Figure 2, which consists of a context stream and a prior stream. To begin with, an input image is processed by dilated ResNet [47] pre-trained on ImageNet dataset [11] to produce a feature map F with the spatial size of H × W . Considering the importance of global context, we further introduce two hierarchical context modules on the top of dilated ResNet in the context stream, including PCM and RCM, to hierarchically capture global context information from pixel-level and region-level. Meanwhile, the prior stream is designed to provide region prior information for the context stream, in which we conceived a multi-scale guided pre-segmentation strategy to partition the feature map into several class-based homogeneous regions. Finally, the feature map enhanced by global information is up-sampled to the original resolution, which is then fed into the softmax function to obtain the probability of each pixel belonging to each class. The class with the highest activation probability is considered as the final prediction of the pixel.

Multi-scale guided pre-segmentation
The prior stream aims to partition the input feature map and provide the region partition result for the context stream. At first, we tried to use superpixel to achieve that. However, most of superpixel segmentation methods perform unsupervised iterative clustering. On the one hand, the iterative process leads to huge computational complexity. On the other hand, these methods cannot guarantee accurate semantic homogeneous regions considering the difference in object appearance and unsupervised optimization process.
Therefore, we propose a multi-scale guided presegmentation module, which can flexibly partition features into class-based homogeneous regions according to the supervised guidance of ground truth. As shown in Figure 3, the input feature map F is fed into three parallel dilated convolutions with dilation rates (1,3,5). Each convolution output feature map has 64 channels Then, the feature maps of the three branches are aggregated through element-wise addition. Finally, a 1 × 1 convolution layers and softmax function is applied to obtain affiliated probability prediction Q ∈ R N ×H×W , in which N represents the number of classes. During training, we use prior loss L prior to supervise affiliated probability prediction Q.
The proposed multi-scale guided pre-segmentation can generate semantic homogeneous regions by introducing few convolution parameters. In particular, the convolu-   tions with three different dilated rates can integrate multiscale features to enhance the sensitivity of proposed presegmentation module to multi-scale objects. Moreover, the auxiliary supervision L prior can directly transfer the gradient to the shallower layer while accelerating the network training process.

Pixel Context module
Establishing pixel-level dependencies can capture rich contextual information to enhance the representation of features. Different from [17] [42] modeling dense pixel-level dependencies on the entire feature map, we introduce a relatively sparse PCM to establish pixel-level dependencies within each homogeneous region. As illustrated in Figure 4, taking the affiliated probability prediction Q ∈ R N ×H×W obtained by multi-scale guided pre-segmentation, we first perform argmax to get explicit region boundary prediction T ∈ R H×W . And then divide the input feature map X into N several homogeneous regions {B i |1, 2, ..., N }. For each region B i ∈ R C×Ki , in which K i represents the number of pixels of the i th region, we capture pixel-level dependencies using Positionindependent Attention Module (PiAM, detailed description below) to generate B i ∈ R C×Ki , and Finally, we aggregate features {B i |1, 2, ..., N } to reconstruct a new feature map X ∈ R C×H×W according to the explicit region boundary prediction T .
Quantitatively, given the number of regions N , the time complexity and spatial complexity of PCM are both O( for each block. With the help of explicit region boundary prediction T , our PCM models pixel-level context within the classbased homogeneous region. On the one hand, it can aggregate the features of strongly associated positions in a more sparse way to enhance the pixel representation. On the other hand, ignoring the feature of weakly associated positions at the pixel-level can improve the redundancy of self-attention while not affecting the model performance.

Region Context Module
The PCM only obtains the context between pixels within each homogeneous region. This section further proposes an RCM to capture the context between different regions. By combining the proposed hierarchical PCM and RCM, the global context can be completely constructed while avoiding redundant self-attention methods. As shown in Figure  5, our RCM mainly includes region pooling, region-level attention and region unpooling.
The purpose of region pooling is to achieve the scale conversion from fine-grained pixel representation to representative region representation. Considering that the explicit region boundary prediction T cannot provide sufficient information about the relationship between pixels and each superpixels, we adopt the affiliated probability prediction Q as the mapping index from pixel to region. To be more specific, given the input feature map X ∈ R C×H×W and affiliated probability prediction Q ∈ R N ×H×W , we reshape them to R C×HW and R N ×HW , respectively. Then the region representation R ∈ R C×N can be calculated as follows: where R i,j represents the feature of i th channel in the j th region. After getting the region representation, we apply the proposed PiAM to adaptively capture the region correlations and enhance region representations. Specifically, we feed the region representations R into PiAM to obtain new region features R ∈ R C×N . Finally, the region unpooling directly performs matrix multiplication between region features R and affiliated probability prediction Q ∈ R N ×H×W to recover pixel representation X ∈ R C×H×W as follows: where X ijk represents the feature of i th channel in row j and column k of feature map X . Previous pooling operations usually aggregate features of regular regions indiscriminately, while our region pooling selectively aggregates pixel features according to the pixel-region affiliated probability. Therefore, it can effectively deal with irregular region pooling, and the resulting coarse-grained region features are more representative. Following that, region-level attention can adaptively capture region-level dependencies and update region representations accordingly. The most important part is that the region unpooling can effectively recover the pixel representation with more accurate details through pixel-region affiliated probability.

Position-independent attention module
The self-attention mechanism has a talent for capturing the internal correlation of features, which is then used to update the original features. However, the previous self-attention module is usually designed for regular feature maps, and the proposed PCM and RCM put forward requirements for modeling the correlation between irregular feature set. Here we proposes a PiAM for feature correlation modeling and enhancement of irregular feature set.
As the structure shown in Figure 6, given an input feature set B ∈ R C×K , where C and K represent the number of feature channels and set length respectively, we first apply two different convolution layers to generate two feature maps O and P , in which {O, P } ∈ R C 4 ×K . Different from the squared difference of Euclidean distance, we calculate the correlation coefficient between any elements of the feature set through matrix multiplication operation: where A ∈ R K×K represents the correlation coefficient matrix, and A i,j represents the correlation coefficient between elements i and j. Subsequently, we perform normalizing rank aggregation (NRA) on matrix A: It ensures that the enhanced feature statistics will not change significantly, because the sum of the weight of different feature elements is 1. The feature set will then be updated through matrix multiplication between B and A to enhance feature representation. At last, the updated feature set is multiplied by a scale parameter α and performs a elementwise sum operation with the original feature set B to acquire the final output B ∈ R C×K , where α is initialized to 0 and can be used to confirm the stability at the beginning of training. In general, our proposed PiAM can perform feature similarity measurement and enhancement on irregular sets. Compared with the previous attention module for regular feature maps, it can effectively deal with irregular feature sets. Moreover, by ignoring the two square terms in the Euclidean distance expansion term and directly calculating the correlation coefficient matrix, the calculation cost can be significantly reduced.

Loss function
Considering the large variation in the number of pixels in each class in the training set, we adopt weighted crossentropy loss function L context to train proposed model: where y i represent the ground truth of current pixel and p i is the prediction result by softmax. w i represents the weight of the i th class, which is calculated through the median frequency balance [15]: where f median is the median of all these frequencies. In addition, we introduce an auxiliary loss L prior to the output  of multi-scale guided pre-segmentation module in Figure 2 based on above weighted cross-entropy loss. And the total loss is denoted as: in which λ is a hyperparameter used to control the weight between L context and L prior .

Experiments
To evaluate our proposed method, we conduct extensive experiments on Cityscapes dataset [9] and ISPRS Vaihingen dataset [16], which differ greatly in the spatial distribution and the scale of objects. The spatial distribution is related to the imaging perspective, the former is the front view and the latter is the top view. The scale variation depends on the distance between the object and the camera and the size of the object itself, where the distance between the objects and the camera in aerial image is approximately equal. For instance, the size of the car varies greatly in autopilot images, but roughly the same in aerial images, as shown in Figure  7. And the most important is that they are the authoritative benchmarks in the field of autonomous driving and remote sensing for semantic segmentation, respectively.

Datasets
Cityscapes. This is a large-scale dataset used for semantic urban scene understanding, which contains 5,000 images with fine annotations and 20,000 images with coarse annotations. This dataset is collected from 50 different cities and includes a total of 30 classes, 19 of which are used for actual training and validation. It is noted that in our experiments, we only use 5,000 images with fine annotations as our dataset, which is divided into 2,975, 500 and 1,525 images for training, validation and online testing.
ISPRS Vaihingen. This dataset consists of 33 airborne tiles of Vaihingen, whose size is about 2500 × 2000 pixels. Each of them contains a high-resolution TOP (True Ortho Photo) tile and corresponding DSM (Digital Surface Model) and nDSM (normalized Digital Surface Model) data with a GSD (Ground Sampling Distance) of 9 cm. The TOP file contains three bands corresponding to IR (nearinfrared), R (red) and G (green) bands respectively. Among these images, 16 tiles are used for training, in which all pixels are classified as impervious surface, building, low vegetation, tree, car and background. The remaining 17 tiles are withheld for testing. Note that we only adopt IRRG images without DSM and nDSM data during the process of training.

Implementation Details
We implement our method in Pytorch. Following [7,51], we initialize the learning rate to 0.01 and adopt the poly learning rate policy whose learning rate is updated by 1 − iter total iter 0.9 after each iteration. For the optimizer, we use stochastic gradient descent (SGD) with weight decay 0.0005 and momentum 0.9. To ensure the stability of parameters in normalization layers, Synchronized BN [34] is adopted to collect the statistics of batch normalization on the whole mini-batch. Specifically, all experiments are trained for 200 epochs with batch size 8 on 4 Tesla V100 GPUs with 16G memory per GPU. To avoid overfitting, we employ common data augmentation strategies, including random scaling in the range of [0.5, 2], random horizontal flipping, and random cropping. In particular, rotating at 90 • interval is employed for ISPRS Vaihingen dataset to simulate the changes in flight direction. We set the crop size to 512 × 1024 for Cityscapes dataset and 768 × 768 for IS-PRS Vaihingen dataset. As for the loss function, the weight λ of the prior loss for multi-scale guided pre-segmentation is set to 0.8.

Standard Pixel-wise Evaluation Metrics
Cityscapes. To assess performance, Cityscapes benchmark relies on the standard Jaccard Index, commonly known as intersection-over-union (IoU): where TP, FP, and FN are the numbers of true positive, false positive, and false negative pixels, respectively, determined over the whole test set.
ISPRS Vaihingen. Following the evaluation metrics of IS-PRS 2D Semantic Labeling Challenge [16], per-class F 1 score and overall accuracy (OA) are adopted to evaluate the ∩ ∪ performances of our proposed model. F1 score and OA are defined as follows:

Scale-sensitive IoU (S-IoU)
However, the above standard evaluation metrics regard pixel as the basic evaluation elements, and then count the pixelwise performance of each class separately. This method can only reflect the performance of the model in each class, and cannot evaluate the performance of the model against multiscale objects. To evaluate the performance of models at various scale objects, previous works [18,39] usually qualitatively divide objects of the same class into specific scales, for example, it was intuitively believed that buses belong to objects with large size while bicycles belong to small objects. Since the size of the objects varies greatly with the distance to camera in natural images, these methods cannot quantitatively reflect the performance of the model for multi-scale objects. Thus, we propose an evaluation metric called scalesensitive IoU (S-IoU) for quantitative evaluation. It regards each instance object as the basic evaluation unit with the help of the ground truth of the instance segmentation task.
Specifically, for each instance object, we first calculate its scale (area) and then match its mask in the prediction result. Subsequently, the intersection-over-union between the predicted mask and the label mask is calculated:

ABLATION study
In our network, PCM and RCM are used to hierarchically capture global context information from pixel-level and region-level. In this section, we show the results comparing to our baseline dilated ResNet-101 in terms of standard mIoU.
In Table 1, we evaluate the effectiveness of our proposed context modules on the Cityscapes validation set. It can be seen that employing the PCM individually yields a result of 78.59% in mIoU, which outperforms baseline by 3.22%. Meanwhile, HCNet with RCM individually real-izes 77.26% in mIoU, which results in a 1.89% improvement compared to baseline. It can also be seen that the RCM has achieved fewer improvements than the PCM. This is because only capturing region-level context information makes the model lack pixel-level detailed information. After integrating pixel-level and region-level context, the performance of our hierarchical context network improves to 79.86% as expected.
To better understand the context modules visually, we provide visualization results for our hierarchical context modules. As shown in Figure 9, we also visualize the presegmentation and final segmentation results in columns 3 and 4, with the yellow dashed boxes marking the improved areas. For PCM, we select one marker on each image and visualize the similarities with all the other positions belonging to the same homogeneous region as a similarity map of size H × W in column 5. Specially, the similarities between positions belonging to different regions are filled with zero according to the pre-segmentation results. For example, the first marker in the image (1) is pre-segmented as a car and the corresponding highlighted areas on the similarity map also belong to the car. Similarly, the second marker is pre-segmented into person, and its corresponding similarity map is only responded in areas of the corresponding region. For RCM, we can obtain the similarities of a certain class-based region with all class-based regions with shape of 1 × N . Corresponding to the two markers in Figure 9, we visualize the correlations of car and person as a histogram in Figure 10. It can be seen that these are specific correlations between different class-based regions, which can further enhance the discriminability of features.
We further exhibit the comparison of the increased FLOPs and GPU memory of HCNet and self-attention compared to baseline (dilated ResNet-101). As shown in Table  2, HCNet achieves a 1.03% improvement in mIoU compared to self-attention, but significantly reduces FLOPs and GPU memory by about 40% and 85%, demonstrating that our approach can capture global context information in a

Scale-sensitive Evaluation
In Table 3, we compare the performance of our HCNet with Baseline, DANet and PSPNet using our proposed S-IoU. Specifically, we first calculate the scale (measured by area) and S-IoU of all instance objects. Whereafter, all the objects are divided into three scale intervals by area (number of pixels) as shown in Table 3. For the sake of comparison, we calculate the interval evaluation metric mS-IoU by averaging the S-IoU of all objects in each interval. Similarly, the overall evaluation metric can also be obtained by averaging the S-IoU of all objects. It can be seen that HC-Net achieves the best interval and overall performance compared to the other three methods. This confirms that our proposed hierarchical context modules are extremely effective for identifying multi-scale objects.

Benchmark Evaluation
To get optimal performance on the benchmark test set, we use our best model (i.e., HCNet with multi-scale guided pre-segmentation and hierarchical context modules). Additionally, we use multi-scale inference schemes with scales 0.5, 1.0, and 2.0.

Cityscapes Benchmark
In Table 4, we compare against published methods on the Cityscapes test set without using the coarse data. Among these methods, DeepLab-v2, RefineNet and DenseASPP, which utilize multi-scale context aggregation, have not achieved leading performance due to the lack of adaptive feature aggregation for each position. However, DANet, ACFNet, and CCNet using self-attention mechanisms generally outperform multi-scale methods, which reflects the necessity of adaptive modeling global context information. By combining the PCM and the RCM, HCNet can model global context information in a sparse manner and effectively capture multi-granularity features. Experimental results show that Our model achieves the best performance 82.8%, which is extremely competitive with recent state-ofthe-art models. It is also important to stress that our model outperforms very strong to other state-of-the-art models in classes with large object size (such as bus, train, and wall) and small object size (such as pole and traffic light).

ISPRS Vaihingen Benchmark
We carry out experiments on ISPRS Vaihingen Benchmark to further evaluate the effectiveness of our method. Table  5 shows the quantitative results of our HCNet on ISPRS Vaihingen test set. ADL 3 [32], which adopts CNN and uses some post-processing schemes such as conditional random field for classification, obtains an overall accuracy of 88.0%. DST 2 [35] using the FCN variant and ONE 7 [2] using the SegNet variant are limited by the local receptive field, achieving overall accuracy of 89.1% and 89.8% respectively. The most powerful competitors are GSN [41] using gated convolution and DLR 10 [29] combined with boundary detection. Although we did not use auxiliary techniques such as conditional random fields and boundary detection, HCNet ranks 1 st both in per-class F 1 score and overall accuracy, compared with other state-of-the-art methods. In particular, our approach achieves 88.6% in car class (average size 38 × 18 pixels), outperforming the previous best model by a large margin. This is because our proposed HCNet can simultaneously model global context and capture multi-granularity features.

qualitative results
In Figure 11, we provide the qualitative comparison between the results of baseline and our HCNet on Cityscapes validation set. As shown in the yellow dashed boxes of the first example, baseline cannot identify truck considering of local receptive field, while our HCNet makes relatively accurate predictions due to the rich contextual information capturing through the hierarchical context modules. Furthermore, the scene in the second line has a highly heterogeneous appearance, which leads to a misjudgment of baseline. Thanks to the global context information provided by PCM and RCM, HCNet is able to effectively enhance the discriminability of features for accurate reasoning and prediction. In particular, in the third line, we can see that our model also has good performance for small objects. Figure 12 illustrates a few examples of segmentation results on ISPRS Vaihingen test set. It can be seen that HC-Net can accurately segment objects with great differences in appearance (such as buildings) and small-scale objects (such as cars), which is benefited from our proposed context modules that can effectively capture contextual dependencies and multi-scale features.

Conclusion
In this paper, we propose a hierarchical context network (HCNet) for semantic segmentation, which can cap-    ture global context information more sparsely than the selfattention mechanism. Specifically, based on the region partition prior generated by proposed multi-scale guided presegmentation, PCM and RCM are designed to hierarchically model global context dependencies from pixels-level and region-level. Meanwhile, through aggregating finegrained pixel context features and coarse-grained region context features, HCNet can harvest multi-granularity representations to more robustly identify multi-scale objects. Extensive experiments conducted on Cityscapes and ISPRS Vaihingen dataset prove the effectiveness of our proposed modules. Furthermore, our method also achieves the optimal multi-scale performance, which is evaluated through our proposed scale-sensitive IoU.