MISNet: Multiscale Cross-Layer Interactive and Similarity Refinement Network for Scene Parsing of Aerial Images

Although progress has been made in multisource data scene parsing of natural scene images, extracting complex backgrounds from aerial images of various types and presenting the image at different scales remain challenging. Various factors in high-resolution aerial images (HRAIs), such as imaging blur, background clutter, object shadow, and high resolution, substantially reduce the integrity and accuracy of object segmentation. By applying multisource data fusion, as in scene parsing of natural scene images, we can solve the aforementioned problems through the integration of auxiliary data into HRAIs. To this end, we propose a multiscale cross-layer interactive and similarity refinement network (MISNet) for scene parsing of HRAIs. First, in a feature fusion optimization module, we extract, filter, and optimize multisource features and further guide and optimize the features using a feature guidance module. Second, a multiscale context aggregation module increases the receptive field, captures semantic information, and extracts rich multiscale background features. Third, a dense decoding module fuses the global guidance information and high-level fused features. We also propose a joint learning method based on feature similarity and a joint learning module to obtain deep multilevel information, enhance feature generation, and fuse multiscale and global features to enhance network representation for accurate scene parsing of HRAIs. Comprehensive experiments on two benchmark HRAIs datasets indicate that our proposed MISNet is qualitatively and quantitatively superior to similar state-of-the-art models.

neural networks (DCNNs) have proven to be effective in many computer vision tasks, such as detection, segmentation, and classification [6], [7], achieving state-of-the-art (SOTA) performance. A DCNN automatically extracts hierarchical feature maps of various objects in an image. It can extract details in shallow layers and complex semantic cues in deep layers of the network. However, existing scene parsing methods for HRAIs often identify only a few categories and process single-source data, thereby limiting their applicability.
In addition to single-source data, scene parsing has benefited from auxiliary aerial image data, such as digital surface models (DSMs) [8] and synthetic aperture radar images [9]. The introduction of multisource data can effectively improve the robustness of the segmentation method [10]. As different forms of spectral data, these types of auxiliary aerial image data capture specific attributes of the same geospatial object and provide different insights for the overall learning of semantic objects [11], [12]. Therefore, complementary information and auxiliary aerial image data in a red-green-blue (RGB) representation can be applied to optimize the performance of scene parsing [9], [13]. We focus on the scene parsing of infrared-red-green (IRRG) images and DSMs, which have been extensively studied using DCNNs [9], [14], [15], [16].
Although progress has been made in recent years, various problems related to scene parsing and auxiliary aerial image data persist. For instance, the high data diversity leads to low interclass variance and high intraclass heterogeneity. Hence, confusion between trees and low vegetation as well as misidentification of human-made objects in urban areas can occur [17]. Moreover, existing methods, particularly those based on deep learning, suffer from two major problems: 1) insufficient spatial information for inference and 2) lack of contextual information. These problems result in poor segmentation around object boundaries and in other difficult areas such as shadowed regions [18].
Over the past few years, numerous studies on DCNNs have been conducted to improve the results of scene parsing in HRAIs. These images typically display complex scenes and cover large areas, posing challenges for scene parsing [10]. Similarly, veryhigh-resolution (VHR) images, multisource data images, and point clouds increase the complexity of scene parsing. For instance, building roofs can appear to be complex and diverse in urban areas captured in a VHR image. This is a typical issue, in which similar roofs have different spectra, and their imaging This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ is affected by occlusions and shadows. Consequently, accurate labeling remains challenging in the segmentation of VHR aerial images [19].
For accurate segmentation, we introduce DSMs as auxiliary data, whose elevation information allows us to reduce the segmentation problems caused by high objects. Accordingly, we introduce a multiscale cross-layer interactive and similarity refinement network (MISNet) consisting of four modules: 1) feature fusion optimization module (FFOM), 2) multiscale context aggregation module (MCAM), 3) dense decoding module (DDM), and 4) joint learning module (JLM). Considering the limitations of single-source data, we use the FFOM to extract and filter complementary information from multisource fused features. To handle changes in object scales and complex scenes in HRAIs, the novel MCAM captures multiscale background features and guides decoding processing. In the decoder network, the DDM fuses global information and cross-source high-level features layer by layer to filter redundant cues. Finally, the JLM calculates the cosine similarity of the auxiliary output and low-level fused features to refine segmentation weights and improve low-level information.
Our primary contributions can be listed and summarized as follows.
1) We propose the MISNet to learn interactions and continuity between image data and selectively merge complementary information extracted from the IRRG and normalized DSMs (nDSMs) of multispectral images to improve scene parsing. 2) The MISNet adopts an encoder-decoder architecture and contains the FFOM, MCAM, DDM, and JLM. The FFOM relates to cross-source hierarchies and cross-hierarchical continuity. The MCAM fully uses dilated convolutions to guide global context information. The DDM uses dense connections to fuse multiscale features and global guidance information. The JLM refines the decoding features and adds auxiliary supervision for optimization. 3) Comprehensive experimental evaluations show that the MISNet achieves higher scene parsing performance than 10 SOTA methods on the Potsdam and Vaihingen benchmark datasets.

A. Single-Source Scene Parsing
Long et al. [19] presented a novel framework for the scene parsing task. Ronneberger et al. [20] used skip connections to combine shallow features with deeper features and reused low-level features to recover more data. Badrinarayanan et al. [21] introduced an encoder-decoder architecture and applied up-pooling with the recorded pooling method to recover distinct data, such as edges and complex shapes. Chen et al. [22] presented the atrous spatial pyramid pooling, where parallel dilated convolution operation with different rates extracts multiscale cues. Xu et al. [23] proposed a context extraction architecture based on a high-resolution module to solve the class scale imbalance and uncertain boundary information problems. Mou et al. [24] introduced two simple and effective architecture units, spatial correlation block and channel correlation block, to learn and analyze the global correlation between any two spatial positions or feature graphs and then generate the relation-augmented feature representation. Aerial images contain a considerable amount of detailed information about ground objects that result in the images showing large inclass and small interclass variances. This makes it difficult for the images to be recognized. Therefore, an attention mechanism was proposed to solve the convolution locality limitation problem. Li et al. [25] proposed an ABCNet based on a bilateral framework, using a context path to capture global contextual cues and a spatial path to retain spatial details, and designed new modules to integrate and enhance features. Zhao et al. [26] presented a scene parsing architecture based on end-to-end attention, in which a pyramid attention pool module introduced the attention principle into the multiscale block for feature refinement. Based on the characteristics of aerial images, Zhao et al. [27] presented a model based on the regional self-attention mechanism, which can mine the relationships between pixels in the surrounding area. This attention module can effectively decrease the noise of feature mapping and the interference of redundant features.

B. RGB-D Scene Parsing
Unlike single-source scene parsing, RGB-D scene parsing incorporates depth features into the RGB features to improve scene parsing accuracy.
In a previous study on RGB-D scene parsing, Couprie et al. [28] first used the depth cues in the feature learning method to mark the whole scene, laying the foundation for the field of RGB-D indoor scene parsing. Gupta et al. [29] presented a height-above-ground, angle-with-gravity image-learning, and horizontal-disparity DCNN, which differs from the depth image and found that the feature learning effect was better than that of the depth image.
In existing research, the application of depth images in RGB image scene parsing is relatively mature, but there are many aspects that can be improved. Lin et al. [30] presented a multibranch DCNN, which segmented available depth images into feature layers with a common resolution, enriched context information with feature cascade, and improved scene parsing performance. Jiang et al. [31] presented the residual encoder and decoder architecture (RedNet) for the indoor scene analysis task. The effective combination of the long skip connection between the decoder and encoder and the short skip connection in the residual unit enables RedNet to achieve efficient performance. Hu et al. [32] proposed the complementary attention network (ACNet), which selectively collects the features of two different RGB and depth modes to extract weighted features. Chen et al. [33] presented a separation and aggregation gate (SA-Gate) operation to calibrate RGB features and multistage depth information extraction and aggregated the two alternately. Zhou et al. [34] presented a three-branch self-attention architecture (TSNet), which obtained RGB and depth inputs from the two backbone networks. Seichter et al. [35] presented the efficient scene analysis network (ESANet), which has high robustness and achieves fast inference. Qian et al. [36] presented a gated residual block to effectively fuse RGB and depth signals and achieved excellent segmentation performance through complementary features calculated by the gate mechanism and specific features aggregated by residual blocks. Zhou et al. [37] designed a coattention fusion module that fused RGB and depth information into channel and spatial dimensions. Zhou et al. [38] presented a kind of collaboration of low-level and high-level cues optimized by depth enhancement and progressive guided fusion networks for indoor scene analysis.

C. Multisource Scene Parsing of HRAIs
Unlike RGB-D scene parsing, multisource scene parsing of HRAIs requires further development for fusing multisource features. However, only few studies are available on this topic.
Considering the automatic extraction of representative features and intensive image classification, Bittner et al. [39] presented a fully convolutional network (FCN), which includes three parallel modules combined at a late stage. The FCN applies three upper inputs, namely, a panchromatic image, RGB image, and nDSM, to combine height and spectral information from different image data. In addition, full-resolution binary building masks are automatically generated, helping propagate fine details from lower to higher layers to obtain accurate building profiles. Peng et al. [40] proposed a dual-branch DP-DCN based on dense connections and FCNs to automatically obtain finegrained scene parsing maps. The network prevents both gradient explosion as the network deepens and overfitting with scarce labeled aerial image data. It handles the differences between aerial and natural images efficiently. Zheng et al. [41] proposed a gather-to-guide network to improve the fusion of RGB and auxiliary aerial image data. The key part of this model is a gather-to-guide block, including a guider and a gatherer. The gatherer captures complementary cues from the RGB and auxiliary data and generates cross-source descriptors. The guider uses guide weights extracted from the descriptor to calibrate RGB by reducing redundancy and noise while retaining feature data. However, during final feature mapping fusion, the final semantic prediction output is directly obtained by simple addition and not combined with the differences of the two kinds of image data, thereby introducing too much noise and affecting the segmentation effect. In contrast, the proposed MISNet uses feature extraction and optimized filtering to ensure proper multisource interactions and cross-layer continuity; it establishes dependencies between global information and cross-source features for refining multisource complementary features.

A. Overview
The encoder-decoder framework of MISNet is shown in Fig. 1. We use an IRRG image and the corresponding nDSM image as inputs and Res2Net-50 instead of the conventional ResNet as the backbone. In Res2Net, the hierarchical residual connection in a single residual block enables a change in the receptive field at a finer granularity to capture both details and global characteristics [42]. We remove the pooling operation and all the fully connected layers of the original Res2Net, thereby increasing the network performance. We use Res2Net pretrained on the ImageNet dataset [43].
We extract five multilevel features, R i (i = 1, 2, 3, 4, 5) and D i (i = 1, 2, 3, 4, 5), in source-specific encoders for the IRRG and nDSM images, respectively. The input resolution of the sourcespecific encoder is W × H. Thus, H/4 × W/4 is the resolution for the first and second layers, and H/2 m × W/2 m is the resolution for layer m > 2. In addition, the number of channels of the features in the ith layer is given by C i (i = 1, 2, 3, 4, 5) and C = [64, 256, 512, 1024, 2048].
For the encoder, we use the FFOM to fuse the two kinds of image data (i.e., IRRG and nDSM images) features of each layer. Then, the MCAM extracts multiscale context features. For the decoder, the MCAM extracts global information for guidance, and the DDM gradually integrates high-level features added to the global cues. Moreover, we add the output of the last DDM as an auxiliary output, and the JLM further refines the initial fused feature to obtain the final scene parsing map. For prediction, we discard the first-layer fused features because they are noisy and may undermine segmentation.

B. Feature Fusion Optimization Module
There are two main problems related to the fusion of the IRRG spectral and nDSM elevation features. One is the inherent morphological differences that cause feature incompatibility, and the other is the noise and redundancy in low-quality elevation data. Inspired by the method in [44], we introduce the FFOM to optimize the compatibility of multisource features and mine spatial information from elevation features. As shown in Fig. 1, the FFOM includes two parts that are detailed in the following.
Feature Extraction and Optimized Filtering: In MISNet, we apply feature extraction and optimized filtering to the multisource data at each layer, as shown in Fig. 1. Let features R i and D i represent feature maps of the ith layer (i = 1, 2, 3, 4, 5) from the IRRG and nDSM branches, respectively. First, R i and D i are simply fused by elementwise addition. Then, we extract shared information using elementwise multiplication to highlight the similarity between the two kinds of image data, focusing on the common elements between the original features R i and D i and the fused feature [45], and thus, obtaining R i D (i = 1, 2, 3, 4, 5) and D i R (i = 1, 2, 3, 4, 5). The corresponding formulation is represented as follows: where × and + represent elementwise multiplication and addition, respectively. The optimized features R i D and D i R are concatenated along the channel dimension, but simple concatenation causes redundancy. Thus, we use two gating units for feature filtering and purification. The gate unit is depicted in Fig. 1(a) and includes the following parts: one 3 × 3 convolution and one 1 × 1 convolution connected in series with a rectified linear unit (ReLU) operation activated by a sigmoid function after passing through batch normalization (BN). G i RD (i = 1, 2, 3, 4, 5) denotes the outputs of the gate unit that are multiplied by the original features R i and D i . The gate unit is described as follows: where Conv n×n represents a convolution operation with an n × n kernel, RELU denotes ReLU activation, BN represents batch normalization, and σ represents the sigmoid function. Hence, G i RD can be obtained as follows: where Cat represents the channelwise concatenation and Gate represents the gate unit. Finally, we purify and filter the original information through pixel-level multiplication and concatenate the two purified features to obtain a i (i = 1, 2, 3, 4, 5) as follows: Feature Guidance Module (FGM): Inspired by the method in [46] and [47], we introduce the FGM, as shown in Fig. 2, to optimize the compatibility of multisource features. The FGM contains channel and space attention mechanisms. Channel attention uses the relation between feature channels, whereas spatial attention aims to find locations with informative cues.
We apply the FGM to extract the features, optimize filtering, and optimize the fused feature a i along the channel and spatial dimensions. The FGM is primarily divided into two parts. The first part uses channel attention to perform weighted optimization along the channel dimension of a i . The second part uses spatial attention to perform weighting along the spatial dimension of the features after channel optimization. Let CA and SA denote the channel and spatial attention mechanisms, respectively. In addition, we define the convolution block CBR n×n that includes a convolutional layer with n × n kernel, BN, and ReLU activation, as follows: CBR n×n (x) = RELU (BN (Conv n×n (x))).
Hence, the feature b i (i = 1, 2, 3, 4, 5) after channel attention adjustment can be formulated as follows: The fused feature output is denoted as f i and given by where Avg represents adaptive average pooling.

C. Multiscale Context Aggregation Module
Considering cross-source fused features, we capture the crosssource complementary information by selectively fusing singlesource features using the MCAM (see Fig. 3). Inspired by the method in [48], we obtain multiscale single-source features by adding the fused features of the fifth layer and the two kinds of single-source features. Then, multiscale single-source features with different scales are fused to extract multiscale cross-source features. First, we add the single-source features and fused features to obtain multiscale single-source features M fr and M fd as follows: M fd = f 5 + D 5 .
Then, features M fr and M fd are operated with dilated convolutions at different scales and added together. The multiscale fusion results are concatenated, and the MCAM is formulated as follows: where M i (i = 1, 2, 3, 4, 5) represents summation at scale i, ACU represents the adaptive average pooling followed by 1 × 1 convolution and upsampling, Dconv i represents dilated convolution with dilation rate i, and M represents the MCAM output.

D. Dense Decoding Module
Recent scene parsing methods use encoder-decoder architecture to generate pixel-level predictions. For instance, in the decoder, DeconvNet [49] uses stacked deconvolution operations to gradually recover a full-resolution prediction. SegNet [21] used indices in the encoder pooling block to guide the recovery of image resolution in the decoder, and DeepLabV3+ [22] implements a cascaded decoder. Although decoder improvement is being pursued, the limitations on the resolution of fused features and difficulties of feature aggregation are challenging to address. In addition, fusing low and high-level features and resolving resolution limitations between them are difficult problems. High-level features containing rich semantic cues allow object location and background noise elimination, but they lack details such as object contour and texture. Conversely, low-level features can capture spatial details, but are noisy and unsuitable for accurate segmentation.
We propose the DDM with the architecture shown in Fig. 4. It aims to provide a multiscale receptive field with cascaded dilated convolution to handle changes in the object scale during decoding. The DDM only fuses the features of adjacent layers, avoiding interference caused by large resolution differences. Furthermore, it extracts complementary information to enhance multiscale multilevel features.
The output of the MCAM is the input to the first DDM, which proceeds as follows. First, as concatenation causes redundancy in the feature space, we use the spatial attention module to optimize the hierarchical features along the spatial dimension. Then, we introduce three cascaded dilated convolutions to expand the receptive field for K 1 , K 2 , and K 3 , and multiply the results layer by layer to extract common elements, obtaining e. Subsequently, a residual connection and upsampling provide the DDM output, ddm i (i = 1, 2, 3). The input of the first DDM is M (i.e., the MCAM output), and the input of the remaining DDMs is the output of the previous decoder block, Y i-1 (i = 2, 3). For convenience, let x represent the DDM input. The DDM is formulated as follows: where θ represents the softmax function and Up n represents n times upsampling. Output Y 1 of the first decoder block is given by where ddm 1 represents the first DDM output. The remaining decoder blocks are formulated as follows: where Y i represents the output of the ith decoder block and ddm i represents the ith DDM output. Moreover, we get our auxiliary output Out aux as follows:

E. Joint Learning Module
To suppress noise in low-level detailed features, we present the JLM based on cosine similarity, as shown in Fig. 5. First, we calculate the cosine similarity of the auxiliary output and the low-level fusion feature to segment the redundancy of detail information, and the similarity weight S DF obtained is as follows: where Cos denotes the cosine similarity function. We then use a pair of reverse similarity weights to denoise and enhance the cross-source feature maps of the lower levels as follows: In the JLM, the S DF calculates the similarity between the Out aux and f 2 , whereas the reverse weight (1 − S DF ) measures the difference between them. We found that the low-level fusion feature contains redundant information or interference noise in the mixed details. Hence, the reverse weight can be multiplied by the low-level cross-source feature to suppress noise in the details. The features refined by the JLM are denoted as JL. Out aux and JL are combined to obtain the final full-resolution prediction map Out prim after upsampling four times using a convolution operation with a 3 × 3 kernel as follows: Out prim = Conv 3×3 (Up 4 (JL × Out aux + Out aux )). (27)

F. Loss Function
The loss function used by MISNet is the most extensively applied cross-entropy loss function L, used to supervise the auxiliary and final outputs where V i is used to indicate whether the predicted class is consistent with the sample class and Q denotes the number of classes; if so, it is 1; otherwise, it is 0. P i denotes the predicted probability that the sample belongs to class i. The total loss functions include the primary loss function L prim and the auxiliary loss function L aux , which jointly supervise the model output

IV. EXPERIMENTAL RESULTS AND ANALYSIS
This section describes the experimental setup and results of quantitative and qualitative evaluations and an ablation study.

A. Datasets
The Vaihingen and Potsdam benchmark datasets from the ISPRS 2-D Semantic Labeling Contest [40] were used to validate the MISNet. These benchmark datasets contain six scene parsing classes: buildings (blue), impervious surfaces (white), clutter/background (red), low vegetation (cyan), trees (green), and cars (yellow). The Vaihingen benchmark dataset contains 33 VHR images showing 9 cm/pixel in images with a 2500 × 2000 resolution. Each HRAI shows the IRRG bands and corresponding nDSMs [18]. As the dataset is publicly available, we considered the settings used in the contest. Specifically, 17 ground-truth images were applied as the test set, five HRAIs (ID 11,15,28,30,and 34) were applied as the validation set, and the remaining 11 HRAIs were applied as the training set.
The Potsdam benchmark dataset contains 38 HRAIs showing 5 cm/pixel in images with a 6000 × 6000 resolution. Each HRAI shows the IRRG bands and corresponding nDSMs. The groundtruth images of the Potsdam dataset were applied as the test set, and the remaining seven HRAIs (ID 2_11, 2_12, 4_10, 5_11, 6_7, 7_8, and 7_10) were applied as the validation set.

B. Performance Measures
To quantitatively verify the effectiveness and robustness of the proposed MISNet, we selected intersection over the union (IoU), mean intersection over the union (mIoU), class accuracy (Acc), and mean class accuracy (mAcc) as the performance measures [50], [51], [52].

C. Implementation Details
All relevant experiments were conducted on the PyTorch framework and a 12 GB NVIDIA TITAN Xp GPU. In the  [42] to initialize. The input size of the proposed method is 256 × 256. The ground truth and input images are then enhanced by applying cropping and random scaling, a counterclockwise 90°, 180°, 270°rotation, and vertical and horizontal flipping [53], [54], [55].
During the training stage, we apply the Adam method for optimization. Its initial learning rate is 0.0001, weight decay is 0.0005, and momentum parameter is 0.9, which is reduced to 0.1 times of the original every 20 epochs. It took approximately 100 epochs for the proposed MISNet to converge.
Quantitative Experiments: Tables I and II list the obtained Acc, mAcc, IoU, and mIoU for the evaluated methods. The measurements for the proposed MISNet in the multiclass labels indicate excellent performance, demonstrating the effectiveness of our proposal. In general, single-source scene parsing is inferior to the multisource approach. Therefore, auxiliary data nDSM are introduced to capture 3-D spatial information for scene parsing of HRAIs, and the scene parsing performance is significantly improved.
Qualitative Experiments: To further evaluate the excellent performance of our proposed MISNet, we show the results of scene parsing in Fig. 6. Owing to the limitation of space, we chose HRCNet [23] as the typical representative singlesource approach and selected all multisource approaches for comparison. The results in lines 4 and 6 in Fig. 6 show that the multisource approach using nDSM data can easily extract staggered buildings; the single-source approach identifies large impervious surface areas as buildings. The results in lines 1 and 7 show that the proposed MISNet is superior to other approaches because the other approaches tend to confuse similar areas, such as low vegetation and trees; moreover, it can easily extract small objects, such as cars that were hidden under the shadows of buildings. Furthermore, our method of segmentation has a sharper profile, as listed in lines 2 and 3.

E. Ablation Studies
We performed ablation studies on the Potsdam and Vaihingen benchmark datasets to investigate the contribution of the different modules to this approach, as listed in Table III. 1) Effectiveness of FFOM: Based on the backbone network, the FFOM is added to the encoding part. The DDM and JLM are replaced by a simple convolution operation and upsampling  operation in the decoding part, where the convolution kernel is 1 × 1. It is used to adjust the channel number to discard the interactivity between cross-layer features, so that decoding blocks and fusion features are concatenated in series layer by layer, and the prediction graph is finally obtained. The cyan low vegetation and green trees in the forecast diagram of the scenario analysis shown in Fig. 7 are similar in color. Without the FFOM to fuse IRRG and nDSM, incorrectly classifying trees as low vegetation would be easy. In addition, in the hyperspectral image, the top area of the building resembles the shaded area. The addition of the FFOM can make the contour of the building flat and smooth.
2) Effectiveness of MCAM: Based on the backbone network, we use simple addition instead of the FFOM in the encoder to merge the two kinds of image features and retain the MCAM. In the decoder, we use convolution and upsampling instead of the DDM and JLM to obtain the final output. Fig. 8 shows that if the global multiscale context information processing of  the MCAM is omitted, objects of different scales cannot be completely divided, for example, details such as the zigzag texture of a building cannot be accurately divided. Further, with the addition of the MCAM, some areas of the impervious surface can be segmented. As shown in Table III, the MCAM facilitates accurate scenario resolution.
3) Effectiveness of DDM: Based on the backbone network, we replace the FFOM with simple element addition at the corresponding nDSM and IRRG flow level, add the DDM in the decoding part, retain the decoding operation in the original model, and eliminate the JLM at the same time. Fig. 9 shows that the segmentation effect becomes significantly worse after the DDM is removed. We use the DDM to integrate and refine multiscale features from top to bottom, solving segmentation errors and incomplete problems, such as confusion of low vegetation and trees and unclear segmentation of jagged textures of buildings.

4) Effectiveness of JLM:
Based on the backbone network, the FFOM is replaced by simple element addition in the encoding part; 1 × 1 convolution and upsampling are used in the decoding part to replace the original DDM, and the JLM and its related operations are retained. Fig. 10 shows that without the addition of the JLM, the area of the trees and impervious surfaces could not be completely and correctly segmented, and segmentation errors occurred, such as impervious surfaces and buildings. The JLM can be seen to improve the integrity and correctness of segmentation.

V. CONCLUSION
We propose the MISNet, a cross-source interaction model that exploits the dependence between two kinds of aerial image data in different convolution layers. The FFOM performs crosssource fusion and guides interactive features layer by layer. The MCAM establishes a transition between the encoder and decoder, and the DDM refines and denoises details in low-level contextual features based on high-level semantics. To further improve the MISNet, the JLM based on similarity learning is introduced, and auxiliary supervision is added to optimize the scene parsing performance. Experimental results indicate that the proposed MISNet is superior to 10 SOTA approaches in terms of various evaluation measures on two benchmark datasets.