MSRF-Net: A Multi-Scale Residual Fusion Network for Biomedical Image Segmentation

Methods based on convolutional neural networks have improved the performance of biomedical image segmentation. However, most of these methods cannot efficiently segment objects of variable sizes and train on small and biased datasets, which are common for biomedical use cases. While methods exist that incorporate multi-scale fusion approaches to address the challenges arising with variable sizes, they usually use complex models that are more suitable for general semantic segmentation problems. In this paper, we propose a novel architecture called Multi-Scale Residual Fusion Network (MSRF-Net), which is specially designed for medical image segmentation. The proposed MSRF-Net is able to exchange multi-scale features of varying receptive fields using a Dual-Scale Dense Fusion (DSDF) block. Our DSDF block can exchange information rigorously across two different resolution scales, and our MSRF sub-network uses multiple DSDF blocks in sequence to perform multi-scale fusion. This allows the preservation of resolution, improved information flow and propagation of both high- and low-level features to obtain accurate segmentation maps. The proposed MSRF-Net allows to capture object variabilities and provides improved results on different biomedical datasets. Extensive experiments on MSRF-Net demonstrate that the proposed method outperforms the cutting-edge medical image segmentation methods on four publicly available datasets. We achieve the dice coefficient of 0.9217, 0.9420, and 0.9224, 0.8824 on Kvasir-SEG, CVC-ClinicDB, 2018 Data Science Bowl dataset, and ISIC-2018 skin lesion segmentation challenge dataset respectively. We further conducted generalizability tests and achieved a dice coefficient of 0.7921 and 0.7575 on CVC-ClinicDB and Kvasir-SEG, respectively.

lesion assessment, such as polyps in the colon, to inspect if they are cancerous and remove them if necessary. Thus, the segmentation results can help to detect missed lesions, prevent diseases, and improve therapy planning and treatment. The significant challenge in medical imaging is the requirement of a large number of high-quality labeled and annotated datasets. This is a key factor in the development of robust algorithm for automated medical image segmentation task.
The manual pixel-wise annotation of medical image data is very time-consuming, requires collaborations with experienced medical experts, and is costly. During the annotation of the regions in medical images (for example, polyps in still frames), the guidelines and protocols are set based on which expert performs the annotations. However, there might exist discrepancies among the experts, e.g., while considering a particular area in the lesion as cancerous or non-cancerous. Additionally, the lack of standard annotation protocols for various imaging modalities and low image quality can influence annotation quality. Other factors such as the annotator's attentiveness, type of display device, image-annotation software and data misinterpretation due to lightning conditions can also affect the quality of annotations [4]. An alternative solution to manual image segmentation is an automated computer aided segmentation based diagnosis-assisting system that can provide a faster, more accurate, and more reliable solution to transform clinical procedures and improve patient care. Computer aided diagnosis will reduce the expert's burden and also reduce the overall treatment cost. Due to the diverse nature of medicalimaging data, computer aided diagnosis based segmentation models must be robust to variations in imaging modalities [5].
In the past years, Convolutional Neural Networks (CNNs) based approaches have overcome the limitations of traditional segmentation methods [6] in various medical imaging modalities such as X-ray, Computed Tomography (CT), Magnetic Resonance Imaging (MRI), endoscopy, wireless capsule endoscopy, dermatoscopy, and in high-throughput imaging like histopathology and electron microscopy. Modern semantic and instance segmentation architectures are usually encoderdecoder based networks [7], [8]. The success of deep encoderdecoder based CNNs is largely due to their skip connections, which allows the propagation of deep, semantically meaningful, and dense feature maps from the encoder network to the decoder sub-networks [9], [10]. However, encoder-decoder based image segmentation architectures have limitations in optimal depth and design of the skip connections [11]. The optimal depth of the architectures can vary from one biomed-ical application to another. The number of samples in the dataset used in training also contributes to the limitation on the complexity of the network. The design of skip connections are sometimes unnecessarily restrictive, demanding the fusion of the same-scale encoder and decoder feature maps. Moreover, traditional CNN methods do not make use of the hierarchical features.
In this paper, we propose a novel medical image segmentation architecture, called MSRF-Net, which aims to overcome the above discussed limitations. Our proposed MSRF-Net maintains high-resolution representation throughout the process, which is conducive to potentially achieving high spatial accuracy. The MSRF-Net utilizes a novel Dual-Scale Dense Fusion (DSDF) block that performs dual scale feature exchange and a sub-network that exchanges multi-scale features using the DSDF block. The DSDF block takes two different scale inputs and employs a residual dense block that exchanges information across different scales after each convolutional layer in their corresponding dense blocks. The densely connected nature of blocks allows relevant high-and low-level features to be preserved for the final segmentation map prediction. The multi-scale information exchange in our network preserves both high-and low-resolution feature representations, thereby producing finer, richer and spatially accurate segmentation maps. The repeated multi-scale fusion helps in enhancing the high-resolution feature representations with the information propagated by low-resolution representations. Further, layers of residual networks allow redundant DSDF blocks to die out, and only the most relevant extracted features contribute to the predicted segmentation maps.
Additionally, we propose adding a complimentary gated shape stream that can leverage the combination of high-and low-level features to compute shape boundaries accurately. We have evaluated the MSRF-Net segmentation model using four publicly available biomedical datasets. The results demonstrate that the proposed MSRF-Net outperforms the State Of The Art (SOTA) segmentation methods on most standard computer vision evaluation metrics.
The main contributions of this work are as following: 1) Our proposed MSRF-Net architecture is based on a DSDF block that comprises of residual dense connections and exchanging information across multiple scales. This allows both high-resolution and low-resolution features to propagate, thereby extracting semantically meaningful features that improve segmentation performance on various biomedical datasets. 2) MSRF-Net computes the multi-scale features and fuses them effectively using a DSDF block. The residual nature of DSDF block improves gradient flow which improves the training efficiency, i.e., reducing the need for large datasets for training.
3) The effectiveness of MSRF-Net is demonstrated on four public datasets: Kvasir-SEG [12], CVC-ClinicDB [13], 2018 Data Science Bowl (DSB) Challenge [2], and ISIC 2018 Challenge [14], [15]. We conduct a generalizability study of the proposed network for which we trained our model on Kvasir-SEG and tested on the CVC-ClinicDB and vice versa. The experimental results and their comparison with established computer vision methods confirmed that our approach is more generalizable.

A. Medical image segmentation
Long et al. [16] proposed a Fully Convolutional Network (FCN) that included only convolutional layers for semantic segmentation. Subsequently, Ronneberger et al. [17] modified the FCN with an encoder-decoder U-Net architecture for segmentation of HeLa cells and neuronal structures of electron microscopic stacks. In the U-Net [17], low-and high-level feature maps are combined through skip connections. The high-level feature maps are processed by deeper layers of the encoder network and propagated through the decoder whereas, the low-level features are propagated from the initial layers of the network. This may cause a semantic gap between the highand low-level features. Ibtehaz et al. [18] proposed to add convolutional units along the path of skip connections to reduce the semantic gap. Oktay et al. [19] proposed an attention U-Net that used an attention block to alter the feature maps propagated through the skip-connections. Here, the previous decoder block output was used to form a gating mechanism to prune unnecessary spatial features passing from the skipconnections and to keep only the relevant features. In addition, various other extensions of the U-Net have been proposed [10], [11], [20]- [22]. To incorporate global context information for the task of scene parsing, PSPNet [23] generated hierarchical feature maps through a Pyramid Pooling Module (PPM). Similarly, Chen et al. [24] used the Atrous Spatial Pyramid Pooling (ASPP) to aggregate the global features. Later, the same group proposed the DeepLabV3+ [25] architecture that used skip connections between the encoder and decoder. Both of these networks have been widely used by the biomedical imaging community [26]- [28].
Hu et al. [29] proposed SE-Net, which pioneered channelwise attention. The Squeeze and Excitation (S&E) block was able to model interdependencies between the channels and derive a global information map that helps in emphasizing relevant features and suppressing irrelevant features. FED-Net [30] incorporated these S&E blocks in their modified U-Net architecture. Kaul et al. [31] incorporated both types of attention, i.e., spatial and channel-wise attention, in their proposed FocusNet. Jha et al. [20] modified ResUNet [32] adding ASPP, S&E block [29] and attention mechanisms to boost the performance of the network further. Taikkawa et al. [33] proposed Gated-SCNN, which pioneered the idea of gated shape stream to generate finer segmentation maps leveraging the shape and boundaries of the target object. The shape stream was recently also employed by Sun at al. [22] to capture the shape and boundaries of the target segmentation map for medical segmentation problems. Fan et al. [34] devised a parallel partial decoder (PraNet) that aggregated high-level features to generate a guidance map that estimates a rough location of the region of interest. The guidance map used in PraNet was then used with a reverse attention module to extract finer boundaries from the lowlevel features. Kim et al. [35] modified the U-Net architecture and added additional encoder and decoder modules. Saliency maps computed by a prediction module in the UACA-Net is used to compute foreground, background and uncertain area maps for each representation. The relationship between each representation is computed and used by the next prediction module. A detailed summary of advances of deep-learning based methodologies in medical image segmentation can be found in [36]- [38].

B. Residual dense blocks
Dense connections are a unique approach of improving information flow and keeping a collection of diversified features. The architectures based on dense connections are characterized by each layer receiving inputs from all previous layers. Various medical image segmentation methods [11], [39]- [42] leverage the diversified features captured by such dense connections to improve segmentation performance. Guan et al. [39] modified the U-Net architecture by substituting standard encoderdecoder units with densely connected convolutional units. Zhou et al. [11] conceived an architecture where the encoder and decoder are connected through dense and nested skip pathways for efficient feature fusion between the feature maps of encoder and decoder. Zhang et al. [40] proposed Residual Dense Blocks (RDB) to extract local features via densely connected convolutional layers. Additionally, their architecture allowed them to connect the previous RDB block to all the current RDB blocks and a final global fusion through 1 × 1 convolutions for maintaining global hierarchical feature extraction. In ResUNet [41] and Residual Dense U-Net (RD-U-Net) [42], the RDBs are included in a standard U-Net based architecture to make use of hierarchical features. Dolz el al. [43] proposed HyperDense-Net, which introduced a twostream CNN designed to process each modality in a separate stream for multi-modal image segmentation. The dense connections were used across layers of the same path and also between layers of a different path, therefore, increasing the capacity of the network to learn more complex combination between different modality.

C. Multi-scale fusion
Maintaining a high-resolution representation of the image is important for segmentation architectures to precisely capture the spatial information and give accurate segmentation maps [44]. Rather than recovering such representations from low-level representations, multi-scale fusion can help exchange high-and low-resolution features throughout the segmentation process. Wang et al. [44] demonstrated that such exchange of features improves the flow of high-resolution features and can potentially lead to a more spatially accurate segmentation map. They achieved this by processing all the resolution streams in parallel, keeping the resolution representation for each resolution, and performing the feature fusion across all resolution scales.
The previous works by Ronneberger et al. [17] and Badrinarayanan et al. [45] used skip-connections to concatenate high-resolution feature representations at each level with the upscaled features in the decoder to preserve both high-and low-resolution feature representations. Zhao et al. [23] used pyramid pooling to perform multi-resolution fusion while Chen et al. [24] used ASPP and multiple Atrous convolutions with different sampling rates. Similarly, Yang et al. [46] used densely connected atrous convolutional layers in their DenseASPP network to gather multi-scale features with a large range of receptive fields. Lin et al. [47] proposed ZigZa-gNet, which fused multi-resolution features by exchanging information in a zig-zag fashion between the encoder-decoder architecture. Wang et al. [48] proposed Deeply-Fused Nets that applies fusion of intermediate resolutions allowing varying receptive fields with different sizes. Additionally, the authors used the same-sized receptive field derived from two other base networks to capture different characteristics in the extracted features. Deep fusion was further studied in [44], [49], [50].

D. Our approach
To address the challenges of the existing approaches, we introduce a DSDF block that takes two different scale features as input. While propagating information flow in the same resolution, the DSDF block also performs a cross resolution fusion. This establishes a dual-scale fusion of features that inherit both high-and low-resolution feature representations. An encoder network is used to feed the feature representations to the MSRF sub-network that consists of multiple DSDF blocks, thereby performing multi-scale feature exchange. Later, decoder layers with skip-connections from our sub-network and a triple attention mechanism are used to process our fused feature maps together with the shape stream. It is to be noted that the fusion strategy is interchangeable, i.e., low-to-high resolution and vice-versa. Figure 2(a) represents the MSRF-Net that consists of an encoder block, the MSRF sub-network, a shape stream block, and a decoder block. The encoder block consists of squeeze and excitation modules, and the MSRF sub-network is used to process low-level feature maps extracted at each resolution scale of the encoder. The MSRF sub-network incorporates several DSDF blocks. A gated shape stream is applied after the MSRF sub-network, and decoders consisting of triple attention blocks are used in the proposed architecture. A triple attention block has the advantage of using spatial and channel-wise attention along with spatially gated attention, where irrelevant features from MSRF sub-network are pruned. Below, we briefly describe each component of our MSRF-Net.

A. Encoder
The encoder blocks (E1-E4) in Figure 2(a) are comprised of two consecutive convolutions followed by a squeeze and excitation module. The S&E block in the network increases the network's representative power by computing the interdependencies between channels. During the squeezing step, global average pooling is used to aggregate feature maps across the channel's spatial dimensions. In the excitation step, a collection of per-channel weights are produced to capture channelwise dependencies [29]. At each encoder stage, max pooling with the stride of two is used for downscaling the resolution, and drop out is utilized for the model regularization.

B. The DSDF block and MSRF sub-network
Maintaining the resolution throughout the feature encoding process can help the target images become more semantically richer and spatially accurate. The DSDF block helps to exchange information between scales, preserve low-level features, and improves information flow while maintaining resolution. The block has two parallel streams for two different resolution scales (Figure 1(a)). If we let a 3 × 3 convolution followed by a LeakyRelu activation be represented by the operation CLR(·), then each stream has a densely connected residual block with five CLR operations in series. The output feature map M d,h of the d-th CLR operation is computed from the high-resolution input X h as follows: Here, ⊕ is the concatenation operation, and h represents CLR operation is on the higher resolution stream of the DSDF block. M d−1,l is processed by a transposed convolutional layer with a 3 × 3 kernel size and stride of 2 before being concatenated. Similarly, for lower resolution stream the output of the d-th CLR operation is denoted by M d,l and represented as: Here, M d−1,h is processed by a convolutional layer with kernel size of 3 × 3 and stride of 2 before being concatenated. In Equation 1 and Equation 2, d ranges from 1 ≤ d ≤ 5.
Initially, X h (or M 0,h ) and X l (or M 0,l ) are the higher and lower resolution stream input, respectively. The output of each CLR has k output channels denoting the growth factor, which regulates the amount of new features the layer can extract and propagate further in the network. Since the growth factor varies for each scale, we only use two scales at once in the DSDF to reduce the model's computational complexity for making the training feasible. Further, local residual learning is used to improve information flow, and residual scaling is used to prevent instability [51], [52]. Scaling factor 0 ≤ w ≤ 1 can be used for residual scaling. The final output of the DSDF block can be written as (see Figure 1(a)): where r ∈ [h, l] is the resolution with h indicating highresolution representation and l for low resolution representation. Next, we present an MSRF sub-network that comprises of several DSDF blocks to achieve a global multi-scale context using the dual-scale fusion mechanism. As shown in [40], our approach has a contiguous memory mechanism that allows retaining multi-scale feature representations since the inputs of each DSDF is passed to each subsequent DSDF blocks in the same resolution stream.

12:
Update:Xl ,2 ,Xĥ ,3 = Xl +1,2 ,Xĥ +1,3 13: end for 14: Xĥ +1,p , Xl +1,q = DSDF(Xĥ ,p , Xl ,q ) 15: Update:Xĥ ,p = w.Xĥ +1,p +X 0,p ,Xl ,q = w.Xl +1,q +X 0,q In Algorithm 1, we define inputs in the MSRF sub-network as the process of demarcating all the resolution scale pairs and feeding them in their respective DSDF blocks. For this, we start with the first layer with each layer consisting of four resolution scales with H and L representing a highresolution and low-resolution set of features, respectively, and each respective block is denoted byĥ andl. The DSDF(·)  function performs feature fusion across scales in the DSDF block, where Xĥ ,p , Xl ,q is jointly computed from the p and q scale pairs. Moreover,X represents the feature exchange in the center DSDF. Already after the fourth layer of the MSRF sub-network, we effectively exchange features across all scales and attain global multi-scale fusion (refer to the red rectangular block in Figure 1(b)). We can observe that X 0,r , ∀r ∈ {1, 2, 3, 4} is able to transmit its features to all the parallel resolution representations through multiple DSDF blocks. Using this method, we exchange features globally in a more effective way, even when the number of resolution scales is greater than 4. Similar to the DSDF block, the output of the last layer of the sub-network is again scaled by w and added to the original input of the MSRF sub-network.

C. Shape stream
We have incorporated the gated shape stream [33] in MSRF-Net for the shape prediction (see the shape stream block in Figure 2(a)). The DSDF blocks can extract relevant high-level feature representations that include important information about shape and boundaries and can be used in the shape stream. Similar to [22], we define S l as the shape stream feature maps where l is the number of layers and X is the output of the MSRF-sub-network. Bilinear interpolation is used so that X can match spatial dimensions of S l , attention map α l at the gated convolution is computed as: where σ(·) is the sigmoid activation function. Finally, S l+1 is computed as S l+1 = RB(S l ×α), where RB represents residual block with two CLR operations followed by a skip-connection. The output of the shape stream is concatenated with the image gradients of the input image and merged with the original segmentation stream before the last CLR operation. This is done to increase the spatial accuracy of the segmentation map.

D. Decoder
The decoder block (D2-D4) has skip-connections from the MSRF sub-network and the previous decoder output (say D − ) except for D2, where the previous layer connection is the MSRF sub-network output of the E4 (Figure 2(a)). In the decoder block (Figure 2(b)), we use two attention mechanisms. The first attention mechanism applies channel and spatial attention, whereas the second attention uses a gating mechanism. We have used a S&E block for the calculation of channel-wise scale coefficients denoted by X αse . Spatial attention is also calculated at the same top stream where the input channels C are reduced to 1 using 1 × 1 convolution. The sigmoid activation function σ(·) is used to scale the values between 0 and 1 to produce an activation map, which is stacked C times to give X αs . The output of the spatial and channel attention can be represented as: where ⊗ denotes the Hadamard product, and X αs is increased by a magnitude of 1 to amplify relevant features determined by the activation map. We also use the attention gated mechanism [19]. Let the features coming from MSRF-Net be X, and the output from the previous decoder block be D − , then the attention coefficients can be calculated as: where θ(·) is the convolution operation with stride 2, kernel size 1, and G channel outputs; φ(·) is a convolution operation with stride 1 and kernel size 1 × 1 applied to D − giving the same G channels; and Ψ(·) is convolution function with 1 × 1 kernel size applied to a combined features from θ(·) and φ(·) making output channel equal to 1. Finally, σ(·) is applied to obtain the activation map on which transpose convolution operation Ω(·) is applied. D AG captures the contextual information and identifies the target regions and structures of the image.D AG = D AG ⊗X allows the irrelevant features to be pruned and relevant target structure and regions to be propagated further.D AG is updated as: Now, the final output of the triple attention decoder block (i.e., the combination of channel, spatial and gated spatial attention) is D α = D sc ⊕D AG , which is then followed by two CLR operations.

E. Loss computation
We have used binary cross-entropy loss L BCE as defined in Equation 8 where y is the ground truth value andŷ is the predicted value. We have also used dice loss L DCS , which is defined in Equation 9.
The sum of the two loss functions, L comb = λ 1 L BCE + λ 2 L DCS , is used for gradient minimization between the predicted maps and the labels, while only L BCE has been used for shape stream. Here, we set the values of λ 1 and λ 2 to 1. For the latter loss, predicted edge maps and ground truth maps are used during computation. Deep supervision is also used to improve the flow of the gradients and regularization [53]. Thus, our final loss function can be represented as: where L DS 0 comb and L DS 1 comb representing the two deep supervision outputs losses (see Figure 2(a)) and L SS is the loss computed for the shape stream. Here, we set the values of α = 1, β 1 = 1, β 2 = 1 and γ = 1 for our experiments.

IV. EXPERIMENTAL SETUP A. Dataset
To evaluate the effectiveness of the MSRF-Net, we have used four publicly available biomedical imaging datasets; Kvasir-SEG [12], CVC-ClinicDB [13], 2018 Data Science Bowl [2], and ISIC-2018 Challenge [14], [15]. The details about the datasets, number of training and testing samples used, and their availability is presented in Table I. All of these datasets consist of the images and their corresponding ground truth masks. An example of each dataset can be found in Figure 3. The chosen datasets are commonly used in biomedical image segmentation. The main reason for choosing diverse imaging modalities datasets is to evaluate the performance and robustness of the proposed method.

B. Evaluation metrics
Standard computer vision metrics for medical image segmentation such as Dice Coefficient (DSC), mean Intersection over Union (mIoU), recall, precision, and Frames Per Second (FPS) have been used for the evaluation of our experimental results. The standard deviations for DSC, mIoU, r and p are also provided. Additionally, we conduct a paired t-test between the DSC achieved by our proposed MSRF-Net and the DSC attained by other SOTA methods. The p-values of the paired t-tests are also reported.

C. Implementation details
We have implemented the proposed architecture using the Keras framework [54] with TensorFlow [55] as backend. All experiments are conducted on an NVIDIA DGX-2 machine that uses NVIDIA V100 Tensor Core GPUs. The Adam optimizer was used with a learning rate of 0.0001, and a dropout regularization with p = 0.2 was used. The scaling factor for our DSDF and MSRF sub-network was set to 0.4 (w = 0.4). The growth factor k is set to 16, 32, and 64 for resolution scale pairs in the DSDF. For Kvasir-SEG and 2018 DSB, the images are resized to 256×256. ISIC-2018 images are resized to 384 × 512, and images from CVC-ClinicDB are resized to 384 × 288 resolution. We have used the batch size of 16 for Kvasir-SEG and 2018 DSB, eight for CVC-ClinicDB, and four for the ISIC-2018 Challenge dataset. We have empirically set the number of epochs for all datasets to 200 epochs. We have used 80% of the dataset for training, 10% for validation, and the remaining 10% for testing. Data augmentation techniques such as random cropping, random rotation, horizontal flipping, vertical flipping, and grid distortion were applied. It is to be noted that we have used open-source code provided by the respective authors for all the baseline comparisons. The proposed model is available at https://github.com/NoviceMAnprog/MSRF-Net.

A. SOTA method comparisons
In this section, we present the comparison of our MSRF-Net with other SOTA methods.
1) Comparison on Kvasir-SEG Early detection of polyps, before they potentially change into colorectal cancer, can improve the survival rate [58]. Therefore, we have selected two popular colonoscopy datasets in our experiment. The first colonoscopy dataset is Kvasir-SEG. We report the quantitative evaluation of MSRF-Net in Table II and qualitative results in Figure 3. From the quantitative results, we can observe that our method outperforms all the other SOTA methods on all metrics. It achieves 1.39% improvement on DSC as compared to PraNet [34], 3.39% improvement on mIoU as compared Deeplabv3+ with Xception backbone [25]. Our method also achieves an improvement of 1.70% on precision and 1.04% on recall as compared to Deeplabv3+ with Xception backbone and U-Net [17], respectively. The network's ability to segment polyps can be observed from the ground truth comparison with the predicted mask. (Figure 3).
2) Comparison on CVC-ClinicDB CVC-ClinicDB is the second colonoscopy dataset used in our experiment. The quantitative results from Table III show  TABLE II  RESULT COMPARISON ON THE KVASIR-SEG DATASET. WE HAVE NOT COMPUTED PAIRED T-TEST (P VALUES)   that our approach surpasses other SOTA methods and achieves a DSC of 0.9420 ± 0.0804, which is 1.76% improvement in DSC over the best-performing HRNetV2-W48 [44]. We report a mIoU of 0.9043 ± 0.1009 and a recall of 0.9567 ± 0.0620, which is 1.44% improvement in mIoU and 2.82% improvement in recall over SOTA combination of ResUNet++ and conditional random field [57] and UACANet-S [35], respectively. Additionally, MSRF-Net achieves a precision of 0.9427, which is competitive with the best performing DoubleUNet [10]. Our method produces prediction masks with nearly the same boundaries and shape of the polyp as compared to the ground truth masks (Figure 3).

3) Comparison on 2018 Data Science Bowl
Finding nuclei in a cell from a large variety of microscopy images is a challenging problem. We experiment with the 2018 Data Science Bowl challenge dataset. Table IV shows the comparison of the result of the proposed MSRF-Net with some of the presented approaches. MSRF-Net obtains a DSC of 0.9224 ± 0.0538, mIoU of 0.8534 ± 0.0870, recall of 0.9402 ± 0.0734 and precision of 0.9022 ± 0.0601 which outperforms the best performing ColonSegNet [5] in most metrics (see Table IV). From the qualitative results in Figure 3, we observe that the predicted masks are visually similar to the ground truth masks.

4) Comparison on ISIC-2018 Skin Lesion Segmentation challenge
An automatic diagnosis tool for skin lesions can help in accurate melanoma detection, which is also a commonly occurring cancer and can save life up to 99% [59] of cases. The quantitative results for the ISIC-2018 challenge are shown in Table V. Our method achieved a DSC of 0.8824 ± 0.1602, mIoU of 0.8373 ± 0.1818, recall of 0.8893 ± 0.1889, and precision of 0.9348 ± 0.1488. We can observe an improvement of 0.43% and 1.37% over Deeplabv3+ with the Mobilenet backbone [24] in DSC and mIoU, respectively. We also observe a 0.63% improvement in recall over Deeplabv3+(MobileNet) [24]. Our results are comparable with DoubleU-Net [10] which reports the highest DSC of 0.8938. A higher recall shows that our method is more medically relevant, which is considered as the major strength of our architecture [60]. From Figure 3, we can observe that our method can segment skin lesions of varying sizes accurately.

B. Generalization study
To ensure generalizability, we have trained our model and other SOTA methods on one dataset and then experimented on a new unseen dataset which comes from a different institution, consisting of different cohort populations and acquired using different imaging protocols. To this end, we have used the Kvasir-SEG collected in Vestre Viken Health Trust in Nor-way for training and tested our trained model on the CVC-ClinicDB, which was captured in Hospital Clinic in Barcelona, Spain. Similarly, we conducted this study on an opposite setup as well, i.e., training on CVC-ClinicDB and testing on Kvasir-SEG. Table VI shows the generalizability results of the MSRF-Net model trained on Kvasir-SEG and tested on CVC-ClinicDB. Despite using two datasets acquired using two different imaging protocols, MSRF-Net obtained an acceptable DSC of 0.7921 ± 0.2564, mIoU of 0.6498 ± 0.2729, recall of 0.9001 ± 0.2980, and precision of 0.7000 ± 0.1572. We observe that our MSRF-Net performs better than other SOTA methods in terms of DSC. HRNetV2-W48 [44] obtained a competitive DSC of 0.7901.

1) Generalizability results on CVC-ClinicDB
2) Generalizability results on Kvasir-SEG Similarly, we present the results of the models trained on CVC-ClinicDB and tested on Kvasir-SEG in Table VII. We report that our model achieves a DSC of 0.7575 ± 0.2643, mIoU of 0.6337 ± 0.2815, recall of 0.7197 ± 0.2775 and precision of 0.8414 ± 0.2731, which outperforms other SOTA methods in DSC and mIoU. The second best performing method is PraNet [34] with DSC of 0.7293 ± 0.3004, and mIoU of 0.6262 ± 0.3128. Our method outperforms PraNet [34] by 2.82% in DSC and 0.75% in mIoU but PraNet [34] records the highest recall of 0.8007.

C. Ablation study
We have conducted an extensive ablation study on the Kvasir-SEG. For this, we ablated the impact of the MSRF sub-network, scaling mechanism used in the network, the effect of the number of DSDF blocks used, the impact of the MSRF sub-network on shape prediction in the shape stream in Section V-C, the effect observed when shape stream and deep supervision is removed from the architecture of MSRF-Net. Table VIII shows the quantitative results of our ablation study. Initially, we removed the MSRF sub-network which resulted in the least DSC of 0.8771. The addition of a subset of the original MSRF sub-network (The red dotted region in Figure 1(b)) raises the DSC to 0.8986. Further, we removed the DSDF with second and third scale inputs (also see Figure 1(b), where middle DSDF blocks represent them, i.e., layer three and layer five are removed) from the original MSRF subnetwork to achieve a DSC of 0.9013. To further investigate the contribution of our MSRF sub-network, we remove the shape stream to achieve a DSC of 0.9194 which is comparable to highest 0.9217 DSC reported by the original MSRF-Net configuration. We disable the triple attention mechanism in the decoder block to get a DSC 0.9067 ± 0.1834. Our ablation on further removing deep supervision resulted in a lower DSC of 0.8988. We also report the effect of using a combination of dice loss and binary cross entropy loss in L comb used in Equation 10 to supervise MSRF-Net during training. First, we set L comb = L BCE and secure a DSC of 0.9059 which was followed by setting L comb = L DCS which scored a DSC of 0.8861. A similar trend was observed for other metrics.

VI. DISCUSSION
Multi-scale fusion methodologies have been studied previously, however, there are some disadvantages. For example, U-Net [17] uses skip-connections for feature fusion, but the resulting combination of features suffer from semantic gap since it combines low level features of the encoder and high level features of the decoder. Similarly, U-Net++ [11] performs low to high feature fusion to overcome this problem, but high to low feature fusion remains lacking. Pyramid features are fused in Deeplabv3+ [25] while without maintaining the high resolution representations. Similar to our approach, HR-Net builds upon the multi-scale feature fusion process by adding repeated feature fusion while keeping high resolution representation, however, their fusion modules consist of a larger number of trainable parameters and informative low level features are also lost during the segmentation process [61]. The disadvantages stated above can result in Deeplabv3+ [25] and HRNetV2 [62] to perform considerably worse on the 2018 Data Science Bowl challenge where finer segmentation maps were required for a high DSC score. The results on Kavisir-SEG, CVC-ClinicDB, and ISIC-2018 also show similar performance gaps between proposed and other multi-scale fusion methods (see Table I -IV).
The proposed MSRF-Net uses DSDF blocks (arranged as described in Algorithm 1) to attain global multi-scale fusion while increasing the frequency of multi-scale fusion operations and reporting a lower computational complexity as compared to HRNetV2 [62]. The DSDF blocks itself allow effective feature fusion between high-and low resolution scales by continuous feature exchange across different scales. Additionally, its residual structure permits the relevant high-and lowlevel features to be deftly propagated enabling the proposed MSRF-Net to effectively capture the variability in size, shape and structure of the region of interest. We can observe that the residual densely connected nature of the DSDF blocks and its subsequent arrangement allows our proposed MSRF-Net to achieve highest DSC of 0.9217 and mIoU of 0.8914, on the Kvasir-SEG (see Table II). Similarly, we report the highest values for DSC, mIoU and recall of 0.9420, 0.9043 and 0.9567, respectively, on CVC-ClinicDB (see Table III). The ability of our MSRF-Net to recognize smaller and finer cell structures in 2018 Data Science Bowl is evident in Table IV, where we report the best DSC of 0.9224. Additionally, we report best mIoU and recall on the ISIC-2018 skin lesion dataset. Our result is competitive to DoubleUNet in terms of DSC. We present training loss of MSRF-Net (Kvasir-SEG) with respect to the number of epochs elapsed in Figure 6(a). We can see that the model starts converging from epoch number 75 steadily.
In practical clinical environments, the performance of deep learning based segmentation methods decreases due to differences in the imaging protocols and patient variability. The models which are able to generalize across multi-center dataset are more desirable in a clinical setting [63]. MSRF-Net achieves the highest DSC of 0.7921 when trained on Kvasir-SEG and tested on CVC-ClinicDB (see Table VI). Similarly, MSRF-Net achieves highest DSC of 0.7575 and mIoU of and 0.6337, when trained on CVC-ClinicDB and tested on Kvasir-SEG (see Table VII). HRNetV2-W48 [62] was competitive to our method. The above results suggest that our proposed MSRFNet is more generalizable. This can be evidently due to our multi-scale fusion that exploits the feature at different scales, preserving some class representative features.
We performed an ablation study (see Table VIII) to demonstrate that the combination of relevant high-and low-level multi-scale features obtained by the MSRF sub-network is instrumental in recognizing the shape or boundaries of the target object that can boost the segmentation performance. To verify the contribution of the MSRF sub-network, we disable the entire MSRF sub-network from the full network while keeping each component of the network intact and train the model. Table VIII shows that when the MSRF sub-network is removed from the proposed MSRF-Net, the DSC drops by 4.46%. This performance degradation illustrates that each MSRF subnetwork contributes to the network. The combination of highand low-level resolution feature representations of varying receptive fields extracted from the MSRF sub-network contribute significantly towards improving the model's performance. We also ablated if multi-scale fusion was suitable for the entire network. Sub-Network without DSDF refers to the removal of DSDF with 2nd and 3rd scale inputs (also see Figure 1(b), where middle DSDF blocks represent them, i.e., layer three and layer five are removed). Table VIII shows the result when global multi-scale fusion is absent from the network. As a result, we observe a 2.04% performance drop in DSC. Therefore, it is noticeable that the multi-scale fusion used in the MSRF sub-network improves performance. To study the impact of the number of DSDF blocks on the segmentation performance, we reduced the number of DSDF layers from six (ours) to three, i.e., only red rectangular block in Figure 1(b) is used. Even though this enables us to exchange global multiscale feature representations, our results in Table VIII show that reducing the number of DSDF blocks decreases the DSC by 2.31% .
The sub-network without scaling in Table VIII demonstrates the influence of scaling factor w in the network (see Equation 3). For this experiment, we did not scale the output of DSDF by a constant while adding to the block's input. Drop of 0.80% in DSC was observed when the features were not scaled. Furthermore, our empirical experiments (see Figure 6(b) using different scaling values of w introduced in Equation 3) demonstrate our optimal choice of w to be 0.4.
We design a variant model where, the MSRF sub-network is placed after the shape-stream in the MSRF-Net. Here, we keep the number of parameters same for both the models (i.e., MSRF-Net and variant model) to analyze the impact of MSRF sub-network on the shape stream. The qualitative results (see Figure 5) show that the MSRF-Net can define more precise and more spatially accurate boundaries than the variant model. The variant model fails to recognize the boundaries of the target structure as it is deprived of the multi-scale features extracted by the MSRF sub-network. This validates our choice of putting the MSRF sub-network before the shape stream block. Only a minor drop of 0.23% in DSC is seen when no shape stream is applied and it still outperforms most SOTA methods.
We also investigated the impact of our triple attention block by disabling the mechanism prior to training the MSRF-Net. We disable deep supervision in another experiment, while training MSRF-Net. Both of these experiments showed performance drop compared to our proposed MSRF-Net (1.50% drop in former and 2.29% drop in latter on DSC metric). We also evaluate the impact of the combination of L BCE and L DCS used in L comb (see Section III-E). For this, we trained the MSRF-Net with L comb = L DCS and then with L comb = L BCE . When L comb = L DCS + L BCE , we obtained an increase of 3.56% in DSC, 4.68% in mIoU, 0.59% in recall and 4.90% in precision as compared to the L comb = L DCS setting. Similar trend was observed when L comb was equal to L BCE (see Table VIII).
MSRF-Net clearly shows the strength of fusing low-and high-resolution features through DSDF blocks and MSRF sub-network. Alongside, complementary inclusion of scaling factor, deep supervision in the encoder block and triple at-tention in the decoder block showed further improvements. In Figure 4, we show the qualitative results for the suboptimal cases. The qualitative results show poor performance for oblique samples in polyp datasets. Similarly, the model also failed for extremely low contrast images with 2018 DSB and scattered similar patches in ISIC 2018.
VII. CONCLUSION In this paper, we proposed the MSRF-Net architecture for medical image segmentation that takes advantage of multiscale resolution features passed through a sequence of DSDF blocks. Such densely connected residual blocks with dualscale feature exchange enable efficient feature extraction with varying receptive fields. Additionally, we have also shown that the features from DSDF blocks are better suited to capture a target object's entire shape boundaries, even for objects with variable sizes. Our experiments revealed that MSRF-Net outperformed several SOTA methods on four independent biomedical datasets. Our investigation using cross-datasets testing to evaluate the generalizability of the MSRF-Net confirmed that our model can produce competitive results in such scenarios. We also identified some challenges of the proposed method, such as that the model fails when extremely low contrast images are part of the data. For future work, we plan to investigate the identified challenges further and adjust the design of the network to address the challenging cases.