FANet: A Feedback Attention Network for Improved Biomedical Image Segmentation

The increase of available large clinical and experimental datasets has contributed to a substantial amount of important contributions in the area of biomedical image analysis. Image segmentation, which is crucial for any quantitative analysis, has especially attracted attention. Recent hardware advancement has led to the success of deep learning approaches. However, although deep learning models are being trained on large datasets, existing methods do not use the information from different learning epochs effectively. In this work, we leverage the information of each training epoch to prune the prediction maps of the subsequent epochs. We propose a novel architecture called feedback attention network (FANet) that unifies the previous epoch mask with the feature map of the current training epoch. The previous epoch mask is then used to provide a hard attention to the learned feature maps at different convolutional layers. The network also allows to rectify the predictions in an iterative fashion during the test time. We show that our proposed \textit{feedback attention} model provides a substantial improvement on most segmentation metrics tested on seven publicly available biomedical imaging datasets demonstrating the effectiveness of FANet. The source code is available at \url{https://github.com/nikhilroxtomar/FANet}.


I. INTRODUCTION
I MAGE segmentation is one of the most studied problems in computer vision, where the main goal is to classify each pixel of an image to a specific class instance. This can either be pixels of any arbitrary objects such as cars or humans in natural scene data [1], satellite data in remote sensing [2], [3], or pixels of cancerous area or cells in biomedical imaging data [4]. Substantial progress has been made in biomedical imaging due to which various modalities such as X-ray, Computerized Tomography (CT), Magnetic Resonance Imaging (MRI), endoscopy imaging, fundus imaging, Electron Microscopy (EM), and histology imaging exists. While Machine Learning (ML) methods usually provide improved performance over traditional computer vision methods, most of them require ground truth labels from domain experts, which are often scarce and may not represent enough variability in biomedical imaging data. This can affect ML models resulting in only sub-optimal predictions. Furthermore, existing methods for semantic segmentation are based on a singlestep prediction process that does not allow them to rectify their own predicted segmentation masks. Thus, these networks are constrained to only one set of learned weights that may not be enough to capture inter-and intra-class differences present in biomedical imaging data. In this work, we introduce an iterative approach that can refine the segmentation masks from previous mask predictions in a few iterative steps. This iteration process enables the network to steer towards the improved feature representation by taking advantage of subsequent attention mechanisms from previous mask, unlike classically used one-step segmentation methods [1], [5]. Thus, aggregating these results over a few iterations provides improved segmentation masks (see illustration in Figure 1). Current developments of Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and attention modules have improved automated methods in biomedical image analysis. Widely used supervised end-to-end CNNs methods require a large and diverse training dataset to avoid overfitting. RNNs can be used to preserve the model compactness and can be effectively used for segmentation tasks in resource-constrained settings via an iterative update on internal states of network layers [6]. However, they are known for their memory inefficient memory-bandwidth-bound computation and model complexity [7]. Additionally, spatial visual attention mechanisms used for image captioning for natural scene images [8] and for medical image segmentation [9] showed improvements in terms of both model convergence and performance metric. An attention mechanism allows networks to focus on a concrete class instance, thereby penalizing nonspecific regions. Our model is thus inspired by the success of both visual attention mechanism and recurrent learning paradigm.
A mask-guided contrastive attention model was used by Song et al. [10] to deal with the background clutter. Unlike classical training mechanisms and motivated by the work of Song et al. [10], we propose to propagate the samplespecific mask output from the previous epoch to the successive epoch in a recursive fashion. Such a feedback mechanism can provide prior information that can help to learn sample variability, thereby enabling to train effectively on diverse Fig. 1: Semantic segmentation using our FANet architecture. Otsu thresholding is used for generating the initial mask used during 0 th iteration. Then the predictions are iteratively updated with the predicted mask. It can be observed that already at the 2 nd iteration, the results converge. The corresponding feature maps before and after feedback attention at the last decoder layer of our FANet are shown as color images on the right.
datasets. Here, iterative prediction can be used to prune the predicted masks during the inference (see Figure 1). This allows the network to learn both local and global features that can rectify the mask output from the learned weights. Unlike Test-Time Augmentation (TTA) [11], where different transforms are utilized to mimic sample representations and data diversity, we embed mask rectification during the training process. To our knowledge, Feedback Attention Network (FANet) is the first deep learning model that incorporates the ability to self-rectify its predictions without requiring heavy transformations, ensemble strategies, and prior sample-specific knowledge. FANet uses a single end-to-end trainable network that allows information propagation during both train and test time.
A feedback mechanism during the training is central to our novel FANet approach for semantic segmentation. The predicted map of each sample from the previous epoch unified with the current state feature map is used to provide attention. FANet uses an attention mechanism to different feature scales in the network, allowing it to capture variability in image samples. Additionally, our residual block with Squeeze and Excitation (SE) layer allows us to improve channel interdependencies, which can be critical to tackle image quality issues. The main contributions of this work can be summarized as follows: 1) Feedback attention learning -A novel mechanism to utilize the variability present in each training sample. The mask outputs are propagated from one to subsequent epochs to supress the unwanted feature clutter. 2) Iterative refining of prediction masks -Using feedback information helps in refining the predicted masks in training as well as inference. During testing, we iterate over the input image and keep updating the input mask with the predicted mask for up to 10 iterations (empirically set).
3) Embedded run-length encoding strategy -Binary mask outputs of each samples are efficiently compressed before being propagated to the next epoch. This provides a memory efficient mechanism for passing sample specific masks. 4) Systematic evaluation -Experiments on seven vastly different biomedical datasets suggest that FANet outperforms other state-of-the-art (SOTA) algorithms. 5) Efficient training -FANet achieves near SOTA performance with far fewer training epochs.

II. RELATED WORK
In this section, we summarize relevant advances in medical image segmentation and feedback attention networks. We also highlight recent contributions to iterative refinement methods for image segmentation.

A. Biomedical image segmentation
The basis of most modern CNN-based semantic segmentation architectures are either Fully Convolutional Network (FCN) [12] or an encoder-decoder architecture such as U-Net [5] originally designed for cell segmentation. Various modifications of these networks have been proposed both for semantic segmentation of natural images [13], [14] and biomedical image segmentation [9], [15]- [20]. In general, in the encoder, the image content is encoded using multiple convolutions to capture from low-level to high-level features, whereas in the decoder part of the network the prediction masks are obtained by multiple upsampling mechanism or deconvolution operations. Methods like PSPNet [13] and DeepLab [1] incorporate convolutional feature maps of varying resolutions to segment both small and large-sized objects effectively. While PSPNet used a pyramid pooling module, DeepLab used Atrous Spatial Pyramidal Pooling (ASPP) for encoding the multi-scale contextual information. Both PSPNet and DeepLab based architectures have been used widely in the medical imaging community for biomedical image segmentation [21], [22].

B. Feedback attention networks
Visual attention has been widely used in computer vision for pose estimation [23], object detection [24], and image segmentation [25], [26]. Chu et al. [23] incorporated the multi-context attention method into their end-to-end eight stack hourglass CNN network where each sub-network of the hourglass generated a multi-resolution attention map. Attention mechanisms [27], [28] have also been utilized for posing explicit focus on the target region in medical imaging. Schlemper et al. [28] proposed a novel attention gate model that automatically learned to focus on the target structure of the varying shape and sizes by suppressing the irrelevant features and highlighting the silent feature for the specified medical image segmentation task. Attention U-Net [9] used a gated operation in the U-Net architecture to focus on the target abdominal regions of CT datasets. Feedback mechanism for attention using two U-Net architectures with shared weights was used for cell segmentation [29], [30]. The latter used a standard U-Net architecture with the second U-Net incorporating ConvLSTM [31] to store the feature map (inputto-state) from the first U-Net network. However, feedback is only applied to the same epoch with state-to-state transitions. On the contrary, our approach utilizes a feedback mechanism that propagates information flow from the previous epoch to the current epoch in an attention mechanism. We employ the predicted masks from the previous epoch as hard attention to prune the segmentation output.

C. Iterative refinement for segmentation
An iterative refinement of the segmentation mask by feeding the input image and the predicted segmentation mask to a modified U-Net architecture was done by Mosinska et al. [32]. In this work, the authors used an iterative refinement pipeline to enhance the quality of the predicted segmentation mask. Similarly, iterative update of latent space and minimization of the Structure Similarity Index Measure (SSIM) loss was used to refine the predicted segmentation maps during test time in [33]. Recently, iterative refinement strategies have also been used for pose estimation [34], [35] that used consecutive modules for refinement of the predictions with a loss function for the evaluation of output in each module. These iterative refinement processes show improved predictions and are able to handle domain shifts or object shape variability without requiring very deep networks [33]. However, a major bottleneck in these methods is the requirement of a large number of iterations for model convergence. Unlike these methods, our proposed FANet provides attention to the specific regionof-interest and can prune the predicted segmentation masks in less than ten iterations without requiring any optimization scheme.

III. METHOD
In this section, we describe the components of the proposed FANet architecture. The overall design along with the proposed feedback attention learning mechanism is illustrated in Figure 2.

A. SE-Residual block
Deeper networks improve the performance of the model significantly, but an increase in depth can cause either vanishing or exploding gradients problem [43]. To deal with this, we take advantage of shortcut connections between layers in the residual learning paradigm. Our SE-Residual block uses two 3×3 convolutions and an identity mapping, where each convolution layer is followed by a Batch Normalization (BN) layer and a Rectified Linear Unit (ReLU) non-linear activation function. The identity mapping is used to connect the input and the output of the convolution layer (Figure 2 a).
Similar to the work by Hu et al. [44], we add a SE layer in the residual network. The SE layer acts as a content-aware mechanism that re-weights each channel accordingly to create robust representations. Hence, it allows the network to become more sensitive to significant features while suppressing irrelevant features. This goal is accomplished in two steps. First, the feature maps are squeezed by using the global average pooling to get a global understanding of each channel. The squeeze operation results in a feature vector of size n, where n refers to the number of channels. In the second step: excitation, this feature vector is feed through a two-layered feed-forward neural network, where the number of features is first reduced and then expanded to the original size n. Now, this n sized vector represents the weight of the original feature maps, which is used to scale each channel.

B. MixPool block
The proposed MixPool block shown in Figure 2 (b) is used in multiple layers of our FANet architecture. This block facilitates the flow of sample-wise feedback information between consecutive epochs providing a hard attention to the learned features from SE-Residual block. The layer provides focus to the relevant features in both contraction path and expansion path layers. The 'hard' attention map consists of the values 0 and 1, i.e., attention to a specific region only unlike soft attention where the probability map is estimated. The advantage of hard attention it that it allows to keep only the important features and ignore irrelevant features. During the element-wise multiplication, the values from the input feature map, if multiplied by 0, becomes 0, leaving the essential features for further operations. Another advantage of such methods is their computational speed, scalability, and ease of interpretation [8], [45]. The input mask used during training is compressed using the run-length encoding technique to save the memory footprint.
As in Figure 2 (b-c), fist feature maps from the SE-Residual blocks F l in each layer is passed through a 3 × 3 convolution followed by a BN and a ReLU activation function. Then, we apply a 1 × 1 convolution and a sigmoid activation function σ(·) with a threshold of 0.5 to obtain the binary mask M ′ l to contribute to the spatial attention map generation given by: Secondly, we apply appropriate max-pooling on the input mask (from the previous epoch) and resize it to the size of the spatial attention map M ′ l . A union operation is then applied between the resized mask and the spatial attention map. This confirms that we obtain the feature from both the feedback and the spatial attention maps to further create a new unified spatial attention map. Next, an element-wise multiplication operation is applied between the unified mask and the original feature map that suppresses the irrelevant features and enhances the important ones. The enhanced and the original feature maps are then followed by a 3 × 3 convolution, BN, and a ReLU. These operations are used to improve the network's ability to learn non-linearity in the model prediction.
Finally, we concatenate the output of both activation functions, which constitutes the output of our MixPool block given by: where ⌢ denotes the concatenation operator, ⊗ is elementwise multiplication, and ∪ represents the union operation.

C. Proposed FANet architecture
The block diagram of FANet is illustrated in Figure 2 (c). It uses an encoder-decoder design common to many semantic segmentation architectures. We combine the strength of a residual network enhanced with SE as SE-Residual block, and MixPool block that facilitates the attention and propagation of information flow from the current learning paradigm and that of the previous epoch. We implement a recurrent learning mechanism in both encoder and decoder layers that allows to achieve efficient segmentation. The MixPool block uses the previous segmentation map (as an input mask through RLE encoding), which contains the information from prior training and uses it to improve the semantic representation of the feature maps.
We first use the Otsu thresholding [46] to generate an initial input mask for training the proposed architectural model. The variability in the input mask is refined over the training epochs and the model learns over time to prune input or previous epoch masks with learned semantically meaningful features together. To achieve this, we use the novel MixPool block that uses the input mask and applies hard attention over the subsequent input feature maps. The hard attention enables the network to highlight semantically meaningful features for the target region-of-interest in the entire network. The network thus not only learns to predict features maps but also strengthens a joint pruning mechanism that is dependent on the input mask. As a result, the devised network is able to rectify the predicted segmentation maps in an iterative fashion unlike conventional methods which do not have such pruning ability. This provides a strong rational behind our work that is applicable beyond single step inference prediction with capability of refining prediction maps.
The proposed network architecture is a Fully Convolutional Neural Network (FCNN) consisting of four encoder and four decoder blocks. The encoder takes the input image, downsamples it gradually, and encodes it in a compact representation.
Then, the decoder takes this compact representation and tries to reconstruct the semantic representation by gradually upsampling it and combining the features from the encoder. Finally, we receive a pixel-wise categorization of the input image. Both the encoder and the decoder are built using the SE-Residual block, and an additional concatenation of the original resolution feature representation in the encoder is added at each resolution scale. This mechanism minimizes the loss of feature representations during downscaling and upscaling processes.
Each encoder network starts with two SE-Residual blocks, which consist of two 3 × 3 convolutions and a shortcut connection known as identity mapping, connecting the input and output of the two convolution layers. Each convolution is followed by a BN and a ReLU activation function. The output of the second SE-Residual block acts as skip connection for the corresponding decoder block. After that, it is followed by the MixPool block, which has the previous epoch segmentation mask and provides a hard-attention over the incoming feature maps. This process is repeated for each of the downscaled layers.
Each decoder network starts with a 4 × 4 transpose convolution that doubles the spatial dimensions of the incoming feature maps. These feature maps are concatenated with feature maps from the corresponding encoder block through skip connections. The skip connections help to propagate the information from the upper layers, which are sometimes lost due to the depth of the network. The skip connections are followed by two SE-Residual blocks, which help to eliminate the problem of vanishing gradient. The MixPool block that utilizes the segmentation mask from the previous epoch is then applied creating a hard-attention over the learned feature maps. Next, we concatenate the feature maps from the last decoder block and the segmentation mask from the previous epoch. Finally, we apply a 1 × 1 convolution with the sigmoid activation function. The output of this is used to both minimize the training loss, using a combined binary cross-entropy and dice loss, and to generate segmentation masks that are stored as a run-length encoded compression for each sample and propagated during the next epoch. The RLE is updated after each epoch. Similarly, the network learns to adapt the weights in iterative training, this mechanism is also utilized during the test time. As shown in Figure 1, test results are pruned in a few iterations during the test time. Unlike many methods in literature [32], [33], we utilize the same network without any complementary loss function optimization.

1) Dataset and Evaluation Metrics:
To evaluate the proposed architecture, we have selected seven datasets that capture different segmentation tasks in biomedical imaging. The details of each dataset can be found in Table I. The dataset images contain the images of organs and lesions acquired under different imaging protocols. For the retina vessel segmentation task, we use DRIVE and CHASE-DB1 datasets. These two datasets are aimed at various diseases related to diseases of retina vessels, such as retinopathy, retinal vein occlusion, and retinal artery occlusion. The ISIC 2018 dataset, which is a dermoscopy dataset that is useful in the diagnosis of skin cancer, is the third dataset focused on medical imaging data. This dataset contains a wide variety of skin cancer images of different sizes and shapes, which helps in a better understanding of the disease. We have further included Kvasir-SEG and CVC-ClinicDB colonoscopy datasets. These datasets contain the image frames extracted from different colonoscopy interventions and are focused on colorectal polyps that are one of the cancer precursors in the colon and rectum. It highly increases the chance of avoiding lethal cancer by early detection. In addition, we have included two datasets acquired from biological imaging aimed at understanding of the cellular processes. These include the 2018 Data Science Bowl and the EM datasets. The 2018 Data Science Bowl dataset contains images with a large number of variable shaped nuclei acquired from different cell types, magnification, and imaging modalities. This dataset is designed for automated nuclei segmentation. Similarly, the EM dataset contains the transmission EM images of the neural structures of the Drosophila nerve cord. This dataset is aimed at the automated segmentation of the neural structures. All experiments on these datasets are conducted on the same train, validation, and test splits as provided by the previously published works reported in this paper.
To evaluate SOTA deep learning methods and our proposed FANet, we have used standard evaluation metrics that includes Dice Coefficient (DSC) (a.k.a. F1), mean Intersection over Union (mIoU), precision, and recall. We have additionally calculated specificity for those datasets where this metric was previously used for benchmarking.
2) Implementation details: All the training is performed on a Volta 100 GPU and an NVIDIA DGX-2 system using the PyTorch 1.6. framework. For test inference, we have used an NVIDIA GTX 1050 Ti GPU for our method and all SOTA methods used in the paper as this hardware is widely available. Our model is trained for 100 epochs (empirically set) using an Adam optimizer with a learning rate of 1e −4 for all the experiments except for the Digital Retinal Images for Vessel Extraction (DRIVE) and the CHASE-DB1 dataset where the learning rate was adjusted to 1e −3 due to the small size of the training dataset. Datasets were chosen such that the efficiency of our model could be compared to the SOTA methods. A combination of binary cross-entropy and dice loss has been used as the loss function. ReduceLROnPlateau callback was used to monitor the learning rate and adjust it to obtain optimal training performance. All the images used in the  3) Ablation study: In order to evaluate the strength of our proposed FANet architecture, we perform a thorough ablation study. For this, we have used all seven datasets and evaluated on several metrics for baseline (FANet without MixPool), baseline with MixPool, and the combination of baseline, MixPool, and feedback (proposed).

B. Results
Below we present quantitative results on seven different biomedical imaging datasets and compare with corresponding SOTA methods.
1) Results on Kvasir-SEG: Kvasir-SEG [36] is a publicly available polyp segmentation dataset acquired from clinical colonoscopy procedures. This dataset has been widely used for algorithm benchmarking. We have trained our model and compared it with recent SOTA methods on Kvasir-SEG. A comparison with widely accepted segmentation methods with different backbones (see Table II) shows that our approach is improved performance compared to the SOTA methods (on the same train-test split). Our FANet outperforms all the SOTA methods on almost all metrics. While outperforming most U-Net and its variants, it can be observed that FANet achieved an F1 score of 0.8803, which is 1.6% and 3.57% better than the most accurate DeepLabv3+ with ResNet101 backbone and the recent HRNet.  2) Results on CVC-ClinicDB dataset: CVC-ClinicDB is another commonly used dataset for colonoscopy image analysis. FANet architecture outperforms all the SOTA methods on this dataset by a large margin with F1 of 0.9355, mIoU of 0.8937, recall of 0.9339, and precision of 0.9401 (see Table III). FANet achieves the best trade-off between recall and precision compared to the ResUNet-based architectures [18], [47]. The strength of the FANet can be observed by the large improvement of 23.17% in the recall and 5.24% in the precision over the SOTA ResUNet++ [18]. The recall suggests that our method is more clinically preferable than the SOTA. A higher recall is desired in the systems used for clinical diagnosis [51].
3) Results on 2018 Data Science Bowl: Cell nuclei segmentation in microscopy imaging is a common task in the biological image analysis [38]. We used the publicly available 2018 Data Science Bowl (DSB) challenge dataset and compared our results with the SOTA methods. Table IV shows that FANet produces an F1 of 0.9176, mIoU of 0.8569, and recall of 0.9222 with an improvement of 2.02% in F1 with respect to SOTA UNet++ [16] and 28.15% improvement in recall compared to the best performing DoubleU-Net [19]. In general, FANet achieves the best trade-off between precision and recall compared to the SOTA methods resulting in the highest F1 score (0.9176). The qualitative results with 2018 DSB also show that the predicted FANet produces high-quality segmentation masks for cell nuclei with respect to the ground truth (see Figure 3).

4) Results on ISIC 2018 dataset:
Skin cancer is one of the most commonly diagnosed cancers in the US. Early detection of melanoma can improve the five-year survival rate and help prevent it in 99% of the cases [54]. Table V shows    specificity and precision were also recorded. From the qualitative results in Figure 3, we can see that the input mask produced by Otsu thresholding shows under segmentation, which is improved significantly using FANet. The masks produced by FANet have smooth boundaries.

5) Results on DRIVE dataset:
The automated segmentation of vessels in fundus images can assist in the diagnosis and treatment of diabetic retinopathy. The quantitative result on the publicly available DRIVE dataset is presented in Table VI. We can observe that the proposed FANet achieves an F1 score of 0.8183, mIoU of 0.6927, recall of 0.8215, and precision of 0.8189. The proposed method achieves an improvement of 4.24% in the recall over SOTA IterNet [57]. Although the F1 of the IterNet is 0.35% higher than FANet, the recall is relatively lower, and other metrics such as mIoU and precision are not presented. For our proposed FANet, the precision of 0.8189 is well balanced with the obtained recall. The higher recall produced by FANet shows that our method is more clinically relevant. The quality of the segmentation masks in Figure3 demonstrates the efficiency of FANet.
6) Results on CHASE-DB1 dataset: CHASE-DB1 is the second retinal image segmentation dataset used to evaluate our method. For this dataset, there is no official training and test split. We have used 20 images to train our model and 8 images to test as reported in the work of Li et al. [57]. From Table VII, we can observe that our method achieved the highest F1 of 0.8108, mIoU of 0.6820, and the highest recall of 0.8544. FANet achieved an improvement of 3.67% in the recall compared to the SOTA DenseBlock-UNet. 7) Results on EM dataset: The EM dataset aims to develop an automatic ML algorithm for the segmentation of the neural structures so that difficulties due to manual labeling can be resolved. Table VIII shows the quantitative results on the EM dataset. The proposed FANet also obtains F1 of 0.9547, mIoU of 0.9134, and a recall of 0.9568. The presented results Fig. 3: Qualitative results of FANet on seven biomedical image segmentation datasets. The initial "input mask" is generated using Otsu thresholding. The "output mask" is the predicted segmentation mask from the FANet model.

C. Qualitative results
The qualitative results on all seven datasets are presented in Figure 3. It can be observed that for colonoscopy datasets (Kvasir-SEG and CVC-ClinicDB), even though the initial input mask covers the entirety of the image, our model is able to prune and provide accurate masks. The same can be observed for the two retina vessel segmentation datasets, DRIVE and CHASE-DB1. It can be observed that our model is able to segment the challenging retinal vessels, including small retinal vessel bifurcations, and it well resembles the ground truth mask. For the 2018 DSB, ISIC-2018, and EM cell data, again, the input masks are finely rectified, achieving close to ground truth results by the proposed FANet model.

D. Ablation study
In this section, we ablate our model architecture and present extensive experimental results related to the effectiveness of the proposed FANet. To evaluate the contribution of the MixPool block and the feedback, we created the following configurations: 1) Baseline (B1): It refers to the FANet without the MixPool block, which means "no feedback mechanism" or "iterative pruning". We require the MixPool block to provide feedback as it unifies the attention from the network feature map and input mask (refer to Figure 2  FANet architecture, with MixPool block in all encoder and decoder blocks and the feedback (iterative pruning) mechanism is used during the inference. Table IX presents the ablation results on these four configurations performed on all seven datasets. Below we provide detailed analyses of the use of different model architectural settings and validate them with the above described four network configurations (B1-B4): 1) Effectiveness of MixPool block: The MixPool block is an essential part of the proposed FANet architecture. It uses the previously predicted mask as the attention to improve the semantically meaningful features and allows higher-level abstractions. The effectiveness of the MixPool block can be evaluated by comparing the network configurations B1 and B4.
From the experiments in Table IX, we can conclude that the B4 outperforms the B1 on all the datasets. On the F1 metric, B4 shows an improvement of 2.87% on the Kvasir-SEG dataset, 1.89% improvement on the CVC-ClinicDB, 0.55% improvement on the 2018 Data Science Bowl dataset, 0.84% improvement on the ISIC 2018 dataset, 0.11% improvement on the DRIVE dataset, 2.92% improvement on the CHASE-DB1 dataset, and a 0.03% improvement on the EM dataset. These performance gains are significant and thus demonstrate the effectiveness of the use of MixPool block in the proposed FANet. IX: Detailed ablation study of the FANet architecture. Flop is calculated in terms of GMac. "Rec" stands for Recall, "Prec" stands for precision, "Spec" stands for Specificity, "Acc" stands for Accuracy, and "Param" stands for total number parameters. B1 -B4 denote different network configurations. 3) Significance of feedback during evaluation: The proposed architecture uses the feedback information (input mask)

E. Algorithm efficiency
We have analyzed the algorithm efficiency in terms of the number of parameters, flops, and inference time for SOTA methods and FANet (see Table X). During the architectural design, we limit the number of trainable parameters in order to minimize the computational cost of our model. The proposed FANet has only 7.72 million parameters and 94.75 GMac flops, i.e. FANet has the least number of parameters and flops as compared to other deeper architectures. However, our inference time is higher than the other baseline networks which is due to the introduction of novel MixPool block in the FANet that incorporates additional operations such as element-wise multiplication from the readout of the RLE encoded mask that resulted in larger computational time. However, in terms of FPS per iteration this is still above 60 (see Table IX). In the FANet, the MixPool block facilitates attention and propagation of information flow from the current learning paradigm and that of the previous epoch, which helps to achieve a performance boost (refer Table IX). To verify the efficiency of the MixPool block, we have compared our network with and without the MixPool block in Table IX. It is also evident that removing the MixPool block reduces the overall performance in all datasets.

F. Extended ablation study
We have performed an extended ablation study to demonstrate the architectural effectiveness of the proposed FANet. Here, we begin with the experimental verification of the MixPool block by removing its certain components. From the Table XI, we can observe a performance drop, F1 drops by 22.76% and mIoU drops by 26.3% when feature map F l is not used during feature concatenation. In order to justify the use of  (Figure 2 (b)), (ii) removing and adding SE-Residual networks in FANet (Figure 2 (c)), and (iii) series concatenation of FANet in contrast to iterative mechanism. Further, we modified the FANet architecture by adding three SE-Residual blocks and we observed again a decrease in the performance. For this case, F1 drops by 2.33% and mIoU by 2.45%. Next, we added one more SE-Residual block and a severe performance drop can be observed. The F1 dropped by 6.45% and the mIoU drops by 7.36%. In our proposed architecture we used an iterative pruning. However, we experimented an alternative strategy by concatenating the four FANet together in a series. From this experiment we observed a drop of F1 by 1.21% and 1.49% drop in mIoU with nearly four times increase in the number of trainable parameters.

V. DISCUSSION
While deep learning semantic segmentation has been widely implemented, to the best of our knowledge, only direct inference strategies have been published till date. In this work, we utilize a segmentation map pruning mechanism that demonstrates a clear advantage over the current SOTA models due to its ability to self-rectify the predicted mask during the evaluation process (see Table II-Table VIII). The process of self-rectification or iterative pruning helps to improve the performance of the proposed FANet architecture. This improvement is due to the feedback provided by the input mask in the MixPool block which is further validated from our two ablation studies (Table IX and Table XI). Furthermore, a joint configuration together with mask and the feature embeddings allow learning to achieve better feature representation of target regions and learning to adjust weights dependent on the input mask. This establishes an effective pruning mechanism of the network enabling input mask to be steered in the direction of the relevant learned features of the network. Additionally, it can capture the variability in datasets (e.g., shape distributions, surface morphology etc.), allowing the network to rectify the predicted/input masks. Table IX shows the complete ablation study of the Mix-Pool block in the FANet architecture. In this ablation, we For each dataset, we have included three diverse images. The provided heatmaps demonstrate the impact of the weights for different networks. Here, red and yellow regions in the heat map refer to the most important features, and the blue region refers to the region of less importance. From the heat map, it can be observed that FANet has a better feature representation than other baseline networks for most of the datasets. F l represents the input feature map in the MixPool block (refer Figure 2). provide experimental results with (proposed network, B4) and without the MixPool block (B1). Here, B1 refers to the "no feedback mechanism" as no MixPool block is applied. In the proposed FANet, we require the MixPool block to provide feedback through a unified attention mechanism taking into account the network feature map and the input mask from the previous epoch (refer to Figure 2 (b)). However, for the MixPool block without feedback (i.e., B2), we provide the attention from the generated feature map and the input mask but we do not perform the iterative pruning during the evaluation. Thus, even though B2 and B4 networks have the same number of parameters (7.72 Million parameters), the removal of the feedback mechanism affects the algorithm performance (see Table IX). SE-Residual blocks that serve as a self-attention mechanisms on feature channels by performing global average pooling followed by multi layer perceptron that allows to explicitly model the interdependencies between feature channels. Further, in our network we introduce spatial attention mechanisms. Multiple SE-residual blocks allow to learn complex non-linear feature interdependencies (also see Table XI for different combinations of SE-Residual blocks). Further, other ablation experiments such as series concatenation of FANet and removal of F l layer in MixPool block in Table XI showed that the proposed FANet achieves the highest performance. This justifies the importance of different components integrated in the proposed FANet architecture. Further, qualitative results in Figure 4 demonstrate the effectiveness of our network over different configurations, for example, removing of F l layer in mixpool block and using only one SE-Residual block. Also, it can be observed that FANet has more apparent segmentation maps, that is easily distinguishable regions from the background, than the SOTA methods.
With the introduction of the iterative pruning in our FANet architecture, we introduce a new hyperparameter, i.e., the number of iterations during the evaluation. The optimal number of iterations is 10, which was empirically established across datasets. The computed number of iterations is the same for all datasets. For this, we have plotted a graph ( Figure 5) showing the iterative pruning on different dataset images. From the graph, it is observed that there is a significant improvement from iteration 1 to 5. However, from 5 to 10 iterations, there is a minor to negligible improvement. Thus, we considered the highest of 10 iterations during the evaluation. The iterative pruning over the input image increases the inference time. However, this process allows us to refine the predicted segmentation masks, unlike most current methods. For obtaining a better trade-off between efficiency and accuracy, we advise using a lesser number of iterations. We plot the F1 score for different dataset images for five iterations during evaluation. Figure 5 shows that our proposed FANet benefits with just two iterations. Additionally, we have used the NVIDIA GTX 1050 Ti (released in 2016) for inference, and thus using a more recent GPU with higher performance can provide better inference time.

VI. CONCLUSION
With the FANet architecture, we proposed a novel approach for biomedical image segmentation that can self-rectify the predicted masks. By introducing a feedback mechanism, we achieved an improvement on seven publicly available biomedical datasets when compared with existing SOTA methods. Our approach requires far fewer epochs for training and is wellsuited to diverse biomedical imaging datasets. The feedback mechanism integrated in the FANet design effectively acts as hard attention that is used with the existing feature maps to boost the strength of feature representations. The experimental results demonstrate that the proposed architecture achieves accurate and consistent segmentation results across several biomedical imaging datasets despite its simple and straightforward network architecture. The ablation study also reveals that FANet requires less training time to achieve near SOTA performance. In the future, we will use a contrastive learning approach to improve the performance of FANet further and test it on additional multimodal biomedical images.