UNet Deep Learning Architecture for Segmentation of Vascular and Non-Vascular Images: A Microscopic Look at UNet Components Buffered With Pruning, Explainable Artificial Intelligence, and Bias

Biomedical image segmentation (BIS) task is challenging due to the variations in organ types, position, shape, size, scale, orientation, and image contrast. Conventional methods lack accurate and automated designs. Artificial intelligence (AI)-based UNet has recently dominated BIS. This is the first review of its kind that microscopically addressed UNet types by complexity, stratification of UNet by its components, addressing UNet in vascular vs. non-vascular framework, the key to segmentation challenge vs. UNet-based architecture, and finally interfacing the three facets of AI, the pruning, the explainable AI (XAI), and the AI-bias. PRISMA was used to select 267 UNet-based studies. Five classes were identified and labeled as conventional UNet, superior UNet, attention-channel UNet, hybrid UNet, and ensemble UNet. We discovered 81 variations of UNet by considering six kinds of components, namely encoder, decoder, skip connection, bridge network, loss function, and their combination. Vascular vs. non-vascular UNet architecture was compared. AP(ai)Bias 2.0-UNet was identified in these UNet classes based on (i) attributes of UNet architecture and its performance, (ii) explainable AI (XAI), and, (iii) pruning (compression). Five bias methods such as (i) ranking, (ii) radial, (iii) regional area, (iv) PROBAST, and (v) ROBINS-I were applied and compared using a Venn diagram. Vascular and non-vascular UNet systems dominated with sUNet classes with attention. Most of the studies suffered from a low interest in XAI and pruning strategies. None of the UNet models qualified to be bias-free. There is a need to move from paper-to-practice paradigms for clinical evaluation and settings.


I. INTRODUCTION
Segmentation, the branch of computer vision that has shown its variations in the form of cycles over the last 50 years [1] from regional-based [2] to boundary-based techniques such as parametric-based snakes [3] or geometric-based level sets [4], and just recently in the artificial intelligence (AI)based framework [5], [6], [7]. It was exactly 20 years ago that the geometric-based level set paradigm for segmentation was introduced by Sethian and extended in medical imaging by Suri [8], [9].
The concept of level set image segmentation was based on the paradigm of traversing the zero-level curves using partial differential equations (PDE) and clamping the final boundaries at the high gradient edge points [10], [11]. Level set-based geometric curves suffered due to initialization of the segmentation curves which was later automated [12], however, it still needed several speed functions or regularization terms to prevent bleeding of the boundaries of the segmented organs [13]. This bleeding was due to factors like local noise, incomplete shape information, poor digitization or acquisition of target organs in images, leading to lower performance of Jaccard index or Dice similarities [14]. Thus, there was a clear need for knowledge-based innovation.
The fundamental concept of deriving knowledge by fusing the image features extracted from the training database and corresponding gold standard, and later applying it in the classification or characterization framework, was tapped by the machine learning framework [15], [16]. A mammoth of literature exists in various applications under the class of computer-aided diagnosis [17], [18], [19], [20] and covers several medical imaging modalities such as magnetic resonance [21], computed tomography [22], and ultrasound [23]. Although such training models are powerful, they suffer from (a) the ad-hoc training feature extraction and (b) training optimization frameworks to prevent gridlock due to noise while using gradient search for local minima [24]. Thus, under the class of AI, machine learning (ML) transpired for some duration, until recently, the power of automated feature extraction for segmentation was desired in the deep learning (DL) framework [6], [20], such as UNet [25]. A typical flowchart using UNet-based diagnosis is shown in Figure 1. The UNet-based segmentation is designed in a pruning paradigm (removing redundant weights during propagation) and visualized using explainable AI (XAI), with performance evaluated using the mean alignment index (MAI) where expert rates the error between AI output and Ground Turth (GT).
While UNet-based DL foundation models have been evolving for the last five years, the concept of (i) scales, size, shape, orientation, position, (ii) filter sizes, (iii) feature extraction using encoders, (iv) image reconstruction using decoders, (v) fusion of low-level and high-level features via skip connection, (vi) dimensions of image modalities, and finally, (iv) covering the spectrum from vascular (angiography) to non-vascular paradigms is not well presented [26]. While there are very limited number of UNet-based review articles, which we will discuss in the benchmarking subsection in discussion section, here, we present a classic framework to better understand these black boxes, which are adopted in a plug-and-play framework. Further, the computer vision industry is relentlessly pursuing to cherry-pick the hybridization process and tailor UNet components for their niche applications. While the computer vision industry gallops by wearing its own blinkers, there is a need to explore the scientific validation paradigm using explainable AI (XAI), which seems to be left behind. Further, these training models can be large in storage and slow in speed, thus, demand pruning of redundant weights. Lastly, since AI is prone to bias due to (i) poor data collection, (ii) low performing models, and (iii) overemphasis on the accuracy of these UNet models by memorizing the models, the concept of generalization needs to be revised in an ''unseen AI'' framework.
Thus, this study offers special attention to (i) showing the variations of UNet-based DL into five innovative categories, namely conventional UNet (cUNet), superior UNet (sUNet), attention-channel UNet (acUNet), hybrid UNet (hUNet), and ensemble UNet (eUNet); (ii) eight sUNet types were identified based on scale, parallel connection, cascade connections, integration of probability maps, role of residual models, role of feedback systems, context derivation, high-dimensional inputs, and finally loss function designs; (iii) understanding the 81 variations in UNets due to six types of variations in fundamental cUNet due to changes in encoder, decoder, skip connection, bridge network, and loss function; (iv) role of UNet in vascular vs. non-vascular segmentation paradigms in medical imaging: its architectural characteristics, difference, and similarities; (v) introducing ''key for segmentation challenges and corresponding architecture solutions''; (vi) scientific validation using XAI; (vii) pruning paradigm to reduce the storage sizes and to improve speed; (viii) biases in UNetbased DL architectures; and finally, (ix) the recommendations ensuring the mapping between the segmentation type and UNet variations.
The hypothesis of this study states that the variations in the UNet, such as an encoder, decoder, skip connection, bridge network, loss function design, or input to the UNet, itself shall improve the performance of the UNet-based segmentation system. The performance can be in the form of speed, accuracy, receiver operating curves (ROC), or performance matrices. Such a variational UNet system can lead to better design towards XAI, bias, and possible new innovations in compressing the model size. This is the first study of its kind, which has spearheaded the understanding of these concepts by expanding the experienced wings of authors while demonstrating the architectural variations for vascular and non-vascular applications for the healthcare industry.

II. SEARCH STRATEGY AND STATISTICAL DISTRIBUTION
The statistical distribution of the literature is necessary to understand the types of UNet in vascular and non-vascular paradigm distributions, understanding the variations in the components of UNet, participation of the feature extraction methods, types of performance parameters, and their frequency in the selected studies, pruning models for storage reduction, XAI techniques for UNet, and bias in the AI-based solutions. Thus, we adapt the PRISMA model for the selection of the studies for the UNet, XAI, pruning, and bias assessment [27], [28], [29], [30], [31], [32], [33], [34]. Therefore, this section is therefore divided into two parts: section II.A discusses the study selection criteria and section II.B presents the statistical distributions.

A. PRISMA MODEL
The selection and searching of the studies for this review were conducted using the PRISMA model. The keywords used for the search were ''UNet for vascular studies'', ''UNet for non-vascular studies'', ''UNet variations for segmentation'', ''UNet-based segmentation'', ''UNet-based segmentation of coronary artery using IVUS'', ''UNet-based segmentation of carotid artery'', ''UNet-based segmentation of aorta or aortic artery'', ''UNet-based segmentation of peripheral artery'', ''UNet-based segmentation of retinal scans or fundus images'', ''UNet-based segmentation of brachial images'', ''UNet-based segmentation of brain, liver, kidney, knee, prostate, COVID-19 lesions'', ''AI-Bias in UNet'', ''Pruning methods for UNet'', ''Storage size of UNet-based'', ''Explainable-AI for UNet-based segmentation methods'' and combination of these. The different search platforms used were Science Direct, IEEE Xplore, PubMed, and Google Scholar. The PRISMA flow chart for selected studies is shown in Figure 2. An exhaustive search resulted in a total of 2,672 studies. The three criteria used for exclusion were (a) non-relevant studies (b) articles removed after search and screening of the studies (c) records rejected due to insufficient data. The implementation of exclusion criteria provided 2,307, 88, and 10 studies for exclusion, shown by E1, E2, and E3 ( Figure 2). The important scientific knowledge from these final studies was gained, and the statistical classification was drawn. The architecture, their features, UNet classification, bias estimation, explainable AI, and pruning were used to do the analysis [35].

B. STATISTICAL DISTRIBUTION
The statistical distribution was done to analyze the aspect or feature of UNet-based DL systems. The distribution was done for the publication per year (Figure 3 (a)), field of view or application (Figure 3 (b)), UNet types used among the systems (Figure 3 (c)), and performance parameter used (Figure 3 (d)) vascular vs. non-vascular (Figure 3 (e)), and UNet variation types (Figure 3 (f)). VOLUME 11, 2023

III. STRATIFICATION OF UNET ARCHITECTURES A. BASIC UNET ARCHITECTURE AND ITS COMPONENTS
UNet-based DL has recently dominated the medical image segmentation industry in nearly all body imaging modalities, harnessing the power of automated feature extraction and reconstruction of desired shapes. It was in 2015 when Ronneberger et al. [25] first introduced UNet as a way of image segmentation for benchmarking against standard conventional segmentation approaches. This UNet architecture is shown in Figure 4. The main components of UNet architecture are encoders, decoders, bridge network, skip connection, loss function criteria, and the process of binary conversion (so-called softmax layer). This historical innovation of down-convolution, and up-convolution, when combined with the ability to pick the highest-level relevant information or features, so-called max pooling, added the fuel to the fire towards the process of automated feature extraction [25]. The ability to transfer the feature information from the encoder to decoder phases of the UNet-based DL model retains the desired features during the shape reconstruction (decoder phase). In contrast to level set-based geometric curves, UNet-based DL does not need manual placement of the initial curves. However, it requires the gold standard for training of the UNet-based DL models. Thus, it is supervised UNet in nature, which is the focus of this study. It was only a minor improvement in image segmentation, but it has now dominated the computer vision, image processing, and artificial intelligence industries.

C. SUPERIOR UNET TYPES -A SPECIAL NOTE
The most well adapted UNet observed in our study was sUNet, which had evolved from cUNet by adding the variations in them. These sUNet had evolved based on the applications of the individual studies. We have taken special care by categorizing the sUNet into eight distinct types, that integrates concepts such as (i) scales (sUNet.Scale) [57], (ii) parallel connection of convolutions (sUNet.Par) [57], (iii) cascading (or tandem connection) of convolutions (sUNet.Cascade) [60], (iv) integration of probability maps for boundary extraction (sUNet.Bndy) [65], (v) tailoring of fundamental cUNet by residual network (ResNet) models (sUNet.Res) [59], [70], [75], [76], (vi) introducing feedback system to improve cUNet performance (sUNet.Feed) [58] (vii) deriving the contextual encoder network information during the down sampling process (sUNet.Context) [74], (viii) change in dimensionality from 2-D to 3-D (sUNet.Dim) [59], [60], and (ix) adjustment in the loss function upgrades while up sampling during the reconstruction process (sUNet.Loss) [76]. The components of UNet that were changed are encoder (E), decoder (D), skip connection (SC), bridge network (BgN), and the loss function (LF). The sUNet tree with variations in E, D, SC, BgN building blocks were displayed in the Table 2 keeping vascular and non-vascular frameworks. As for the sUNet, the maximum variation is in the encoder (E) component for both vascular and non-vascular. A more detailed analysis for vascular vs. non-vascular will be presented in section VI.

IV. ANALYZING UNET COMPONENTS: A MICROSCOPIC LOOK
It is vital to understand the ''components of the UNet architecture'' which are responsible for processing the image data for the objective either in (i) segmentation (S) of medical organs or (ii) joint segmentation and classification (JSC) of the disease. Each of the components of the UNet architecture has a unique role in handling the complex nature of the image data. These UNet components are either used independently or jointly to effectively meet the objectives. Thus, we have now divided the UNet architecture into six types components variations, namely, (i) encoder; (ii) decoder; (iii) skip VOLUME 11, 2023 connection; (iv) bridge network; (v) incorporating; (vi) loss functionality, and miscellaneous UNet design. These components are used in its entirety or the alterations in UNet that are categorized by changing the components of UNet, and hence classified as miscellaneous. Note that each of the components has its own function to handle shape, position, size, and scale of image objects in the image domain. Table 3 presents the indepth coverage of the variations for each of the components of the UNet which are now discussed below.
The loss function can be mathematically described as given in Eq. 1 if α BCE represented the BCE-loss function, a i represented the classifier's probability utilized in the AI model, x i represented the input gold standard label 1, (1-x i ) represented the gold standard label 0.
Here × represents the product of the two terms. The dice loss is named after the Dice-Sørensen coefficient, a statistic VOLUME 11, 2023 developed in the 1940s to evaluate the similarity between two samples. When X is the input image and Y is the target or ground truth image, the Dice loss employed in this manuscript can be represented as given in Eq. 2.
The Inception block being another kind of modification for UNet architecture that contains various convolutional and pooling layers stacked together thereby improving the results and diminishing calculation costs [164]. Initiation networks have improved gradually with more up-to-date and fresher variants and have outperformed different structure (Figure 7 (a)). The other modification is the transpose convolution [46], [64], [71], [113], [115], which is opposite of the convolution.

V. ADVANCED UNET TYPES
Due to challenges in the time complexity and large number of parameters in deep learning models, there has been recent advances which addresses these issues. We have characterized them as ''advanced UNet types''. The three most important advanced UNets which is prominently dominated the UNet industry are Half-UNet, AM-UNet, and Efficient-UNet discussed in V.A, V.B, and V.C, respectively.

A. HALF-UNET
Half-UNet was invented by Lu et al. [165] which was flushed with three kid innovations, all geared towards a common spirit of reducing complexity while retaining the performance of the feature extraction compared to original UNet. These three ideas were conceptually labeled as (i) unification of channels i.e., number of channels in each layer should be same. Further, removal of the decoders to a single bar decoder i.e., optimization of the architecture ensuring performance approximately equivalent to cUNet; (ii) fullscale feature fusion that consists of different scaled features maps obtained from contractual path (encoders), which was fused using an ''addition operation'' after upsampling, and (iii) Ghost model for reduction in complexity of convolution. The spirit of unification of channel was felt after looking at the complexity of UNet, UNet3+ models. In these models, the number of channels were doubled after every downsampling step. In UNet3+, because of an unequal number of channels 3 × 3 convolution operation is added after every max pool operation to unify the channel numbers, hence 604 VOLUME 11, 2023 FIGURE 7. (a) Inception block [97], [102]. (b) Transpose convolution block [66], [88], [95], [137], [139]. increasing the required number of parameters and Floating-Point Operations per Second (FLOPs). On the contrary, in Half-UNet, the number of the channel of all feature maps is unified, which reduces the number of filters in the convolution operation and contributes feature fusion on the decoder. This is because the decoder does not need 3 × 3 convolution. This can be seen in Figure 8, where the decoder layers are removed by a single stacked decoder, which received the input from the bottleneck and subsequently inputs via skip connection. This reduces the complexity in Half-UNet. The second important feature of Half-UNet was the Full-Scale Feature Fusion. Note that in the original UNet and UNet3+ use concatenation operation for feature fusion. Concatenation operation is a great choice as it provides better results but it also takes more memory and time and hence complexity increases. He et al. [166] proposed a ResNet, which uses addition operation as a feature fusion method. In this operation, the authors perform identity mapping and add their outputs to the outputs of the stacked layer. This operation does not increase the dimension of an image but increases the information for each dimension. This operation does not increase the number of parameters, as a result, does not increase complexity. This concept is used in Half-UNet and is shown in Figure 9. It shows the ⊕ sign which signifies the merger of the skip connections and fused in a single decoder. The third architectural feature of Half-UNet was Ghost Module design ( Figure 10 (a) and Figure 10 (b)). The whole idea behind this was the reduction of convolution complexity ( Figure 10(a)). We already know that deep convolutional neural networks [166], [167], [168] often consist of many convolutions that results in massive computational costs. Although recent works such as MobileNet [169], [170] and ShuffleNet [171] have introduced depth-wise convolution or shuffle operation to build efficient CNNs using smaller convolution filters (floating number operations), the remaining 1 × 1 convolution layers would still occupy considerable memory and FLOPs. The idea behind the Ghost module is to generate more feature maps while using cheap operations, i.e., a smaller number of operations. The parameters and FLOPs can be calculated during convolution operation: where k is kernel size, Cin is input size, Cout is output size, Hout is the height of the output maps and Wout is the width of output maps, and * represents the arithmetic product.
Han et al. [172] proposed a Ghost module to generate more feature maps while using cheap operations. In Ghost module (s = 2, s represents the reciprocal of the proportion of intrinsic feature map), half of the feature map is generated by convolution operation and the other half of the feature map is generated by depth-wise separable convolution and finally concatenated to form the output of the same dimension VOLUME 11, 2023 FIGURE 9. Half-UNet architecture [165]. as input. (6) For example, if the image size is 128 × 128, 3 × 3 convolution, and both input and output channels are 64 then the required number of parameters and FLOPS is 36.92K and 12.08G while using the Ghost module required a number of parameters and FLOPS are 18.78K and 0.61G only. Therefore, the Ghost module is used in Half-UNet.
Advantage and Application of Half-UNet: We already know from previous discussions that the variants of UNet showed to improve model performance without affecting the U-shape model architecture. In Half-UNet, the encoder and decoder are simplified. Half-UNet took advantage of the unification of channel numbers, full-scale feature fusion, and Ghost module. Authors compared the results of Half-UNet with UNet and its variants and obtained similar segmentation accuracy results but parameters and FLOPS were reduced by 98.6% and 81.8% respectively as compared to UNet. The authors compared the results of Half-UNet with

B. AM-UNET
This class of advanced UNet was again to simply the complexity of UNet paradigm. AM-UNet is a lightweight and scalable solution that has achieved state-of-art accuracy. It reduces the complexity, time required for segmentation. Albishri et al. [173], [174] proposed an automatic optimized UNet-based 3D segmentation model named as automated mini-UNet (AM-UNet), Figure 11, designed as an end-toend process for human brain claustrum (CL) segmentation. AM-UNet was adapted for CL segmentation since it was challenging due to its thin, sheet-like structure, heterogeneity of its image modalities and formats, imperfect labels, and data imbalance. In AM-UNet authors reduced the model size to half by removing the last two layers of the original UNet (five vs. three) and expanding the bottleneck layer of the segmentation model ( Figure 11). The system consisted of three steps: preprocessing, segmentation, and postprocessing. In preprocessing step, the 3D-MRI volumes are converted into a series of 2D slices, and regions-of-interest is created. In the second step, the preprocessing of regionof-interest selection is done from the 2D selected slices and segmentation of CL was conducted. In the third step data augmentation and normalization are applied to images to 3-D reconstruct. The postprocessing step was used to ensure high prediction accuracy for 3D claustrum segmentation.
Applications: The authors have predicted that AM-UNet is very useful in vascular i.e., for the brain, cardiology, and non-vascular i.e. for lung, liver, and kidney medical image segmentation because it is capable to segment the image even if it is very thin, heterogeneous in nature, imperfect labels, and data imbalance.

C. EFFICIENT-UNET
Another advanced UNet where encoders were drastically altered to improve the computation burden was Efficient (Eff)-UNet. The birth of this innovation came from the spirit that when the Indian road and driving environment conditions are not structured, then such segmentation paradigms are well suited. They are far superior to semantic scenes using conventional models of DL and CNN.
It was Baheti et al. [175] who proposed an architecture called Efficient (Eff)-UNet that combined the compound scaled Eff-Net as the encoder for feature extraction and decoder having same function as original cUNet for reconstructing the fine-grained segmentation map. The combination between the high-level feature information as well as low-level spatial information was important for the precise segmentation.
Tan et al. [176] proposed a novel compound scaling method that uniformly scales the network depth, width, and resolution for improved performance based on a fixed set of scaling factors. A new architecture called EfficientNetB0 was designed initially and scaled up to generate a family of Eff-Net by the compound scaling method. There are eight variants of the EfficientNets, namely EfficientNetB0 to EfficientNetB7. Scaling the network systematically improves model performance balancing all compound coefficients of the architecture width, depth, and image resolution. The basic building block of the Eff-Net architecture was mobile inverted bottleneck convolution (MBConv) [170] with squeeze and excitation (SE) optimization, shown in Figure 12. The shortcut connections between the thin bottleneck layers are the shortcut connections in MBConv are based on an inverted residual structure. Lightweight depth wise convolutions are used in the intermediate expansion layer as a source of nonlinearity to filter features.
The best-performing model EfficientNetB7 outperforms other state-of-art CNNs in terms of accuracy using ImageNet. It also has 8.4× smaller and 6.1× faster than the best existing CNN [176]. The network architecture of EfficientNetB7 is shown in Figure 13. It can be divided into seven blocks, which were based on filter size, striding, and the number of channels. The authors used EfficientNetB5, and Efficient-NetB7 as an encoder with UNet decoder and achieved the best performance with EfficientNetB7. The authors proposed to use Efficient-Net as an encoder in the contracting path instead of a conventional set of convolution layers. The decoder module is similar to the original UNet. The Eff-UNet showed in Figure 14. The number of levels, resolution, and number of channels of each feature map, and the detailed architecture of blocks in the encoder can be found in Figure 13.
Advantage and Application of Efficient-UNet: The main advantage of Eff-UNet is its ability to offer strong semantic segmentations in an unstructured environment. Both vascular and non-vascular medical images are sometimes unstructured so Eff-UNet is useful in both cases.

VI. UNDERSTANDING VASCULAR AND NON-VASCULAR APPLICATIONS
One of the innovations of this study is to compare and contrast UNet architecture in vascular vs. non-vascular applications.
We have attempted this comparison in four different UNet classes (sUNet, acUNet, eUNet, and hUNet). Further, this section also presents the similarities and differences between the vascular vs. non-vascular architectures. All the above analysis is discussed in graphical representation format. Sections VI.A-VI.D discuss vascular vs. non-vascular architectures. Section VI.E presents the UNet characteristics for vascular vs. non-vascular applications. Section VI.F presents the key for segmentation challenges and architecture solutions for vascular and non-vascular paradigm. Finally, the section concludes with similarities and differences between vascular and non-vascular architectures.
The sUNet architecture is created for obtaining a better result of segmentation. The sUNet architecture for the non-vascular paradigm is displayed by Pezzano et al. [57] (Figure 15 (a)).
There is an addition of multiple convolution block (MCL) along with max pooling layer, which is a modification in the encoder layer of the UNet system (shown in light blue color). It has four layers in it, of which three is mainly for convolution of the input image, while in the fourth layer is copied by using the identity function. Finally, all the four layers are concatenated (represented as ''cat'') and finally up-sampled (represented as ''up'') and convoluted one time (represented as ''conv'') ( Figure 15 (a)). The filters are increased to double in each layer. The key factors of this architecture and study are: (i) use of loss function with a parameter VOLUME 11, 2023 FIGURE 13. Architecture of EfficientNetB7 with MBConv as basic building blocks. The overall architecture can be divided into seven blocks which are shown in different colors. The basic building block of the network is MBConv (mobile inverted bottleneck convolution). Each MBConvX block is shown with the corresponding filter size and the X = 1 and X = 6 denote the standard ReLU and ReLU6 activation function respectively [175] Copyright 2020, IEEE. for maximizing sensitivity; (ii) addition of MCL; (iii) a mask calculation formula used for refining the input by removing the unitary values only; (iv) post-processing procedure used for reducing false positives and increase specificity; (v) two additional levels of depth of the network; and (vi) an extensive validation.

2) VASCULAR
The sUNet architecture for the retinal-based vascular application is described in Figure 15 (b). Chen et al. [66] has introduced patches convolution attention-based transformer UNet (PCAT-UNet) architecture. This architecture has PCAT block ( Figure 15 (c)) in the encoder for local feature extraction along with the feature grouping attention modules (FGAM) (Figure 15 (d)) basically for global information extraction for getting more detailed feature maps of multiscale characteristics. Note that in the encoder, the size of the image decreases by half while in the decoder the size the image increases by twice. This architecture helps in achieving better results, improves sensitivity and performance. It also involves attention between different patches and pixels, which in turns reduces the calculation and increases input resolution. The encoder extracts spatial and semantic information by the process of down sampling. The dropout block is added that suppress over-fitting during training. Overall, the architecture improves segmentation sensitivity and has a good segmentation performance.

B. AC-UNET ARCHITECTURE 1) NON-VASCULAR
The acUNet architecture is basically the addition of attention channel block as a fundamental block into any of the parts such as encoder, decoder, and skip connection. It is used in a variety of applications, including liver [71], tumor [71], lung [50], and neuron segmentation [177]. Generally, it is in the skip connection to obtain a better transfer of the feature extracted in the encoder to the decoder layer. Figure 16 (a) shows the fundamental architecture of the acUNet.  Wang et al. [127] have added a convolution block attention module (CBAM) into the architecture (light blue color). The addition was made after each convolution layer and up sampling layer. It provides a better segmentation effect. The feature graphs were generated at the last layer and then maximized, average pooled, for obtaining spatial context descriptors. These descriptors were made to enter into the shared network called multi-layer perceptron (MLP) and the final eigenvectors are merged by using a summation process. The ''Res connect'' (depicted by solid lines) called as jump connection is used in each convolution layer and upper sampling layer (for the same dimensions) that provides better segmentation results (Figure 16 (a)).

2) VASCULAR
The acUNet architecture (Figure 16 (b)), the self-attention mechanism in CNN-Transfer hybrid network was implemented by Shen et al. [133]. It helps the system to learn the correlation between any two pixel-wise feature maps. Also, the residual attention block (bottom left) was constructed to improve the process of feature extraction. The third block added was the squeeze-excitation (SE) block for constructing a more efficient multi-head attention process by focusing on effective weights and neglecting invalid weights of the attention heads. This process is carried out by the SE block in the SE transformers. The SE transformer is the length of value vectors in the transformer layers. The SE transformer decreases the weights of weak correlation embedding vectors massively. It helps in distinguishing the strong and weak correlation vectors, and hence helps in focusing on vascular VOLUME 11, 2023  connectivity image patches (Figure 16 (b)). The detailed structure of the convolution block attrition module (CBAM) in shown in (Figure 16 (c)).

C. H-UNET ARCHITECTURE 1) NON-VASCULAR
The hUNet architecture involves liver [131] and brain tumor segmentation [131] and has two UNet types. For the non-vascular domain, the architecture shown here has two attention module, namely attention module 1 and attention modules 2 (Figure 17 (top)) [131], labelled as RA-UNet. It was used for liver and brain tumor [131] for segmentation map generation.

2) VASCULAR
The hUNet for the vascular paradigm uses SegNet-UNet+, which is the combination of SegNet and UNet+ (Figure 17 (bottom)) [53], [141], and VGG-UNet [178] and ResNet-UNet [179]. The same input images were given to SegNet and UNet+ separately and the outputs were obtained. Finally, the outputs from both UNets were merged and supplied to the SoftMax layer of the overall system.
Such a system has application for the cardiovascular field such as carotid segmentation for carotid ultrasound. Recently, a non-UNet based segmentation paradigm was attempted using an encoder-decoder combination [98], [180]. There have been non-AI based methods, so-called conventional strategies based on the scale-space paradigm [181], [182], [183], [184].
The eUNet architecture in the non-vascular stands for ensemble UNet architecture. It combines two different UNet types, two processes, and more than one classifier. Here, the eUNet non-vascular paradigm in Figure 18 (top) [146] shows different applications like kidney segmentation and renal mass localization. Two UNets were used, first for kidney segmentation for training and the second for renal mass localization or identification (Figure 18 (top)).

2) VASCULAR UNET DESIGNS CHARACTERISTICS
David et al. [122] designed a UNet system where the authors used different scaled image patches for each contractual layer as input. The idea was to learn more multiscale data. To obtain more spatial features, the authors used dense blocks. The color retinal images were first preprocessed to create enhanced grey images. The image patches around the vessel pixels were then retrieved and reutilized for UNet architecture improvement. According to Du et al. [102], extracted features from an input image for inception multiscale convolution and dense block convolution, respectively, and then fused these features, which were then used in the subsequent network. The inception network enhanced the ability to extract features of the thin vessels. The DenseNet was introduced to enhance the reuse of extracted features through dense connectivity. It effectively reduced the gradient vanishing problem, enhanced the feature transfer, and reduced the loss of feature information. In another vascular network, Guo et al. [139] used structure dropout convolution to avoid overfitting problems, and spatial attention (SA-UNet). The spatial attention module (SAM) was introduced in the bottleneck as a part of the convolutional block attention module for classification and detection. Huang et al. [103] introduced SE block to promote useful features and suppresses less valuable features and also introduced dropout to avoid the overfitting problems. Jin et al. [186] proposed a (3AUNet) triple attention UNet combination of spatial attention, channel attention, and context attention. Spatial attention allows the segmentation network to find the blood vessel region that needs attention, thereby suppressing noise. Channel attention makes the expression of features more diverse and highlights the feature channels with key information while the context attention helps in guiding the attention. Xiao et al. [105] introduced the ResNet weighted attention mechanism so that model only pays attention to the target ROI area and discards the irrelevant noisy background. The authors introduced the contrast limited adaptive histogram equalization (CLAHE) operation as a preprocessing step to enhance the image contrast. Zhang et al. [106] used multiscale pyramid blocks and a deep supervision concept. Pyramid scale aggregation blocks (PSAB) were used in both the encoder and decoder sections to the reduce loss of information during scaling. For using PSABs in the encoder, scaled input images were added as extra inputs. While using PSABs in the decoder, scaled intermediate outputs were supervised by the scaled segmentation labels. He et al. [187] proposed semi-supervised 3D fine renal artery segmentation framework, DPA-DenseBiasNet, which combines deep prior anatomy (DPA), dense biased network (DenseBiasNet) and hard region adaptation loss (HRA). Dense biased connection, the DenseBiasNet fuses multi-receptive field and multi-resolution feature maps for large intra-scale changes. This dense biased connection also obtains a dense information flow and dense gradient flow so that the training is accelerated and the accuracy is enhanced. DPA features extracted from an autoencoder (AE) are embedded in DenseBiasNet to cope with the challenge of large interanatomy variation and thin structures.

3) NON-VASCULAR UNET DESIGNS CHARACTERSTICS
Chahal et al. [88] proposed an automatic segmentation model based on UNet and Xception for the prostate regions in MRI scans. The authors used one convolution and 12 separable convolutions in the contractual path. Separable convolution gives similar performance while being much more efficient in terms of using much fewer parameters and fewer floatingpoint operations (FLOPs). In the decoder phase, the authors used residual and transpose convolution. Chen et al. [89] proposed a 2D bridge network with a combination of ReLU and e-ReLU functions for deeper networks. To bridge the networks, authors used a concatenation operation that guarantees the information flow or better feature fusion that merges the feature at a different encoder and decoder level. In skip connection, the authors used addition to avoid redundancy and combine low-level features with high-level semantic features. The authors introduced the concept of a combination of ReLU and e-ReLU functions to improve segmentation performance. He et al. [83] proposed the HF-UNet that had two complementary branches for two tasks, with the novel proposed attention-based task consistency learning block to communicate at each level between the two decoding branches. Therefore, HF-UNet had the ability to learn the shared representations hierarchically for different tasks and preserve the specificities of learned representations for different tasks simultaneously. Liu et al. [92] proposed an improved 2D UNet model that integrated the squeeze-and-excitation (SE) layer for prostate cancer segmentation. The SE layer was used to extract only the important features. A dropout block was used to avoid overfitting problems. Machireddy  proposed an attention-based UNet for prostate segmentation. The attention mechanism preserves only the regions of the feature maps relevant for malignancy detection. The attention mechanism was incorporated in the form of attention gates integrated into the UNet architecture before feature concatenation. The attention gate takes input from the encoder via skip connections and just below the layer information was also passed as input to the attention gate. Dropout rates of 50% were also introduced to avoid over fitting problems. Xiangxiang et al. [84] proposed a UNet with eight layers and a residual block for prostate segmentation. Residual blocks were used to solve the problem of degradation. Vacacela et al. [188] proposed prostate segmentation which was based on two UNet, one for global and another for local. Global UNet segmented the whole prostate gland while local UNet segmented the central gland. Umapathy et al. [189] proposed a cascaded multi-residual UNet (MRes-UNet) for prostate segmentation. The first MRes-UNET predicts the mask for the prostate gland. The detected prostate mask was concatenated to the input image. The second MRes-UNet CNN used this multi-channel data to predict the central gland within the prostate. The residual block was introduced to avoid the problem of vanishing gradients. In skip connection, instead of using concatenation, the authors introduced feature addition to avoid redundancies in feature maps.
Zhang et al. [85] proposed Z-Net, which contained five pairs of Z-block and decoder Z-block with different sizes and numbers of feature maps assembled in a way similar to that of UNet. The proposed architecture can capture more multilevel features by using concatenation and dense connectivity. Zhu et al. [93] proposed a cascading UNet for prostate segmentation.
Step 1 consisted of segmentation of the whole prostate gland (WPG), while step 2 consisted of another identical network to segment the peripheral zone (PZ). According to the segmented result in step 1, an image that contains the WPG area was passed as an input to the next UNet (as part of the step 2), which segmented the PZ area. Based on the above discussions, we conclude the following similarities and dissimilarities broadly.
Zeng et al. [190] proposed a 3D UNet with Multi-level Deep Supervision because 3D-UNet allows segmentation of 3D volumes, with high accuracy and performance and multilevel deep supervision remove the problem of potential gradient vanishing problem during training.

5) DIFFERENCES BETWEEN VASCULAR AND NONVASCULAR UNET PARADIGMS
It is observed that in the case of retinal vascular, multiscale input was preferred [102], [122], but in the prostate, nonvascular multiscale input was not preferred (Figure 23, Figure 24, Figure 25, Figure 26); (ii) It is observed that the bottleneck used different mechanisms in the case of vascular like David et al. [122] introduced dense block in the bottleneck, Guo et al. [139] introduced spatial attention, Jin et al. introduced context aggregation block, Zhang et al. [106] introduced PSAB block. In the case of nonvascular, was observed that bottleneck, there was no such change ( Figure 21, Figure 22, Figure 23, and Figure 24).Visualization of UNet classification results in the vascular and non-vascular application are detailed in the Figure 27 and Figure 28.

6) SEGMENTATION CHALLENGES: ARCHITECTURE SOLUTIONS-KEY a: VASCULAR PROBLEMS AND CORRESPONDING UNET VARIATIONS AS A SOLUTION
with the ''reference type''. We conclude that when we have large intra-scale image, large inter-anatomy variation, and thin structure, then dense network can we used [102], [106], [114], [122], [187]. This is because dense network fuses multi receptive field and multi-resolution feature map. If an image contains lots of noise, then we can use attention-based UNet because attention gate chooses the relevant part and suppresses the irrelevant part [49], [102], [139], [186]. If the image contains thin blood vessels, then we can use UNet  [122] and nonvascular (bottom) [144] (Copyright 2020, IEEE) applications [132].
with the inception block because inception block has multiple scale or size of convolution so it extracts more features and also uses less number of parameters [102]. If image contains thin blood vessels, then we can UNet with multiscale input because multiscale input provides a way to learn more multiscale data [102], [106], [122]. If the image contains low contrast, then residual block can also be used, as in order to extract more features. This is because residual block helps in deepening the network and therefore better feature extraction [108]. But at a certain point, while going to deeper into network saturation occurs due to which further increase of layers cannot be made as it causes degradation due to gradient loss. The residual UNet [53] overcome this problem  [187]; and Bottom: Non-vascular [190].
and are good at low contrast as it adds the skip connection. By skip connection, the feature map is passes from previous layer into next layer. This process allows to preserve better feature map and improved performance when going deeper into network''.
Li et al. [192] proposed, a deep learning network framework based on the low-order residual network is to detect low contrast defects. Especially, a low-order feature extraction module is designed in order to effectively extract target features with low contrast and small size. The size of the convolution kernel directly affects the receptive field of the model. The kernel size used in the AlexNet is very large, in 2012, for example 11 × 11 and 5 × 5. At first, it was considered that the receptive field increases with the enlargement of the convolution kernel, so that more picture information and better features can be acquired. However, large convolution kernels would lead to a huge increase in computational complexity, which is not conducive to the increase of model depth, and reduces the computational performance. Therefore, in VGG and Inception Networks, the combination of two 3 × 3 convolution kernels is better than one 5 × 5 convolution kernels, and the parameters are reduced from 26 (5 × 5 × 1 + 1) to 19 (3 × 3 × 2 + 1). Thus, 3 × 3 kernels are widely used in various models. The receptive field of 1 × 1 convolution kernel is 0, so it is generally not used for feature extraction. However, as for low-contrast features, 3 × 3 convolution kernel may inhibit the expression of some features at the beginning of training. Therefore, the convolution kernels with size 3 × 3 are used as feature extraction part, while loworder residual blocks with convolution kernels size of 1 × 1 are used to enrich the features to be extracted. Although the receptive field of a 1 × 1 kernel is 0, it can effectively retain the feature information for the defective target with lowcontrast and only one pixel size, and is not disturbed by the neighborhood pixels. Table 7 below shows the challenges in the segmentation of non-vascular type and the corresponding UNet-based solution. We conclude that if variations in terms of dynamic range, voxel size, position, field-of-view as well as anatomical appearance are present, then more than one UNet [85], [89], [91], [188] with dense block is suitable. For example, the ZNet (Zhang et al. [85]) is capable of capturing more features in a multi-level fashion by using concatenation and dense connection. Attention-based UNet [83], [92] was suitable when image contains large background noise as the attention gate is capable of extracting relevant parts and ignoring irrelevant ones. When we need segmentation of the prostate gland and peripheral zone in one pass, then we can use cascaded MRes-UNet [189], as the first MRes-UNet predicts the mask for the prostate gland. The detected prostate mask is then concatenated to the input image. The second MRes-UNet CNN uses this multi-channel data to predict the central gland within the prostate. The peripheral zone is identified using the central gland prediction as an exclusion mask within the prostate prediction.

VII. EXPLAINABLE AI IN VASCULAR AND NON-VASCULAR PARADIGM
DL has dominated the field of image segmentation in both vascular and non-vascular areas. Our study has shown the role of five kinds of UNet in both categories, filled with innovative designs demonstrating superior performance against the conventional models. While the engineering mission of design and performance meets the objectives, but the black box nature of DL possesses unanswerable ''Wh'' questions like what or why or even how the DL systems performed and met the objectives. Such challenges are categorized as a subfield of AI, called ''explainable AI (XAI)'' [193], [194]. Several studies have been published in XAI, but are limited in the field of vascular and non-vascular applications for segmentation utilizing UNet variations. The need XAI is even more important when building a relationship or correlations or links between the quantified vascular segments of different kinds of clinical outcomes [16], [195], [197]. There are two reasons, (a) XAI started around the corner less than seven years ago (2015), and (b) some of the tools like Shapley Additive Explanations (SHAP) [198], [199] and UMAP [200] are not integrated with DL packages, which are typically adapted in the computer vision industry. The European general data protection regulation (GDPR) has elaborated on the role of fairness, privacy, transparency, and explainability in DL paradigm [201]. Since XAI incorporates the feedback loop, the customized seven steps of DL can be exhibited in Figure 30, consisting of DL training, quality assurance (QA), installation/deployment, prediction, and cross-validation-based testing (A/B test), monitoring, and debugging. The few limited UNet-XAI systems are briefly summarized here [202], [203], [204]. The Heatmap produced by Grad-CAM have been used for XAI in several applications (Figure 31-32) [205], where, the generated heatmaps are the threshold to compute the lesions, which are then compared against the gold standard [179], [202], [206], [207].

A. A NOTE ON EXPLAINABLE ARTIFICIAL INTELLIGENCE
Since DL applications have outperformed humans in many tasks, including picture and speech recognition, and recommendation systems, they have attracted a lot of attention. These applications, however, are not reliable or comprehensible. DL models are frequently viewed as opaque, difficultto-understand black boxes with complicated underlying mechanisms. People can't trust them because they don't provide reasons for their choices or predictions. On the other hand, depending on the application, errors made by artificial intelligence algorithms could be fatal. More specifically, a mistake in an autonomous vehicle's computer vision system could cause a collision, while in the medical field, patient lives depend on these choices. Explainable AI (XAI) enters the scene to address the aforementioned problems. Machine learning models perform as a black box (Figure 33 (a)) i.e. model predicts the results only but not able to explain ''wh family'' like why do you do that?, why can I trust you?, why not something else?, when do you success?, when do you fail? and many more. Figure 33 (b) shows comparison between deep learning model and explainable model with the help of example. In the figure w1b, want to predict the particular object is car or not. In the deep learning model (2D convolutional neural network used), gives the prediction 0.89 percent for particular object is car but not explain why is this a car? how did you predict that? In explainable model, model explain it has wheels, lights and also visual features obtained from the model so that user understands why the particular object is car.

B. IMAGE SEGMENTATION USING UNET WITH XAI MODEL
There are some deep learning models like GPU-Net [208], CA-Net [113] are trying to provide explanations of their predicted outcome, however, most of UNet require explanations. Chaterjee et al. [209] proposed a unified, flexible and scalable interpretability and explainability pipeline named TorchEseGeta (Figure 34). The proposed architecture provided posthoc interpretability and explainability methods and incorporates all libraries related to interpretability and explainability like LIME, SHAP and TorchRay and extended to apply on 2D and 3D deep learning models for images. Authors used the segmentation model from DS6 [210] paper and the models were UNet, UNet-MSS(multi-scale supervision) and UNet-MSS with deformation. In order to evaluate proposed architecture for segmentation model, vessel segmentation was chosen.  Dasanayka et al. [211] proposed an architecture ( Figure 35) for brain tumor analysis using MRI and whole slide images [WSI]. Proposed architecture divided into three steps. The first step was MRI segmentation module in which variational AutoEncoder (VAE) 3DUNet [212] was used. Input for this step was 3D MRI volumes and output was segmented 3D MRI volumes. Second step was MRI classification module for this DenseNet was used because DenseNet classify the problem accurately with less number of parameters. For the interpretability Grad-CAM was included in this step. The third step WSI classification module. Feature extraction was carried out by already train ResNet50 model. The output of ResNet50 model was a feature vector of size 1024 × 1 for each patch which later send to classification phase carried out by a model which was composed by densely connected layers. Melching et al. [213] proposed a model ParallelNet which was shown in Figure 36 (a). In ParallelNets architecture original UNet was fused with fully connected neural  network (FCNN) at bottleneck. Crack tip segmentation was performed by UNet and crack tip position perform by FCNN regressor. Authors employed the Grad-CAM interpretability approach, as illustrated in Figure 36 (b), to test interpretability. The neural network's internal features were gathered during the forward pass of input data and aggregated by weighting the average pooled gradients computed during the backward pass.
Poudel et al. [214] proposed a novel architecture which was based on Eff-UNet [176] and focusing on precise segmentation of polyps. The architecture shown in Figure 37.
The architecture divided into two modules. First module was UNet encoder that uses Efficient-Net as a backbone that provides different semantic level details at different stage, in second module decoder combined all spatial information from multiple stage and finally predict the segmentation mask. Zhang et al. [215] proposed an attention UNet, an interpretable classification model that can generate high resolution localization feature maps for predicted class. This model adopt up-sampling-concatenation-convolution structure to create fine grained segmentation map and use attention pooling over the prior mask for bridging segmentation with classification. Authors integrate this model with GradCAM for explainability. The structure shown in Figure 38.  Sun et al. [216] proposed a novel architecture SAUNet: shape attentive UNet for interpretable medical image segmentation, shown in Figure 39 (a). Proposed architecture comprises two streams. First one is texture stream, which had similar structure as UNet but encoders was replaced by dense blocks and decoders are replaced by proposed dual attention decoder block as shown in figure 39 (b). The second stream was shape stream, which had gated convolutional layers and residual layers. The gated convolutional layer used to fused shape features with texture features and the use of residual layer was to fine tune the shape features as shown in Figure 39 (a).

C. APPLICATION
All medical image applications need explainability and interpretability either vascular or non-vascular applications. In lieu of this UNet with explainability (XAI) gives a new horizon in medical field both vascular and non-vascular.

VIII. PRUNING STRATEGIES IN UNET-BASED DEEP LEARNING
While the UNet-based DL has provided a gold mine for segmentation solutions, the inherent ''deep approach'' in neural networks has created a bottleneck in the model generation. Due to many epochs and a large number of training iterations per epoch, besides the heavy weightlifting of several layers in UNet during encoder and decoder phases, there is an increase in both storage space and time during the training paradigm of UNet-based DL. It poses a threat to real-time processing, especially in healthcare frameworks. The computer vision industry has provided alternatives, such as the introduction of graphical processing units (GPU) and supercomputers; however, this is a game in which the ''rich get the highest,'' and several good talents are starving to get their hands on it. Thus, the computer vision field has now started looking into methods that can improve training model storage and speed. This strategy banks on the optimization of hyperparameters during the deep learning process, where the objective is to ''shave'' the unlikeable weights in deep neural networks as the deep cycles churn. Analytical methods cannot be used to determine a neural network's weights. Instead, the weights VOLUME 11, 2023 must be found using the stochastic gradient descent empirical optimization method. The optimization problem for neural networks that stochastic gradient descent attempts to solve is difficult, and the space of solutions (sets of weights) may contain both many excellent answers (known as global optima) and simple, low-skill ones that are also easy to find (called local optima). The ''learning rate''-also known as the step size-is the measure of how much the model is altered throughout each phase of this search process. It is possibly the most crucial hyperparameter to adjust for your neural network in order to get optimal performance on your challenge. The learning rate will nteract with many other aspects of the optimization process, and the interactions may be nonlinear. Nevertheless, in general, smaller learning rates will require more training epochs. Conversely, larger learning rates will require fewer training epochs. Further, smaller batch sizes are better suited to smaller learning rates given the noisy estimate of the error gradient. A robust strategy may be to first evaluate the performance of a model with a modern version of stochastic gradient descent with adaptive learning rates, such as Adam, and use the result as a baseline. Then, if time permits, explore whether improvements can be achieved with a carefully selected learning rate or simpler learning rate schedule. Here optimization of hyper-parameters means optimization of stochastic gradient descent and Adam optimizer. The computer vision industry has now started using ''evolutionary algorithms'' to optimize these hyperparameters, such as (i) differential evolution (DE), (ii) genetic algorithm (GA), (iii) particle swarm optimization algorithm (PSO), and (iv) whale optimization algorithm (WO) [207,217]. It has been shown recently that such optimization methods can be embedded in deep learning frameworks such as (i) Fully connected network (FCN) and (ii) SegNet. There has been no attempt to fuse such evolutionary methods with UNetbased DL for vascular applications, but we foresee this in the near future. Therefore, we have attempted to summarize the pruning methods into the following categories. The current pruning literature has been classified into three categories: (i) channel pruning (so-called filter pruning), (ii) network pruning, and (iii) hybrid pruning. The main principle of channel pruning is to cut down the filters at an early stage of the AI model design [122], [123], [124], [125], [126], [218], [219], [220], [221], [227]. We also call it as early pruning. In the network pruning, we remove the neurons of the network that are low in weight [228], [229], [230], [231], [232], [233], while in hybrid pruning, we fuse the process of weight reduction using temporal and spatial information [234], [235], [236].

A. CHANNEL PRUNING METHOD
Channel pruning (Figure 40), sometimes referred to as filter pruning, makes use of certain algorithms to identify the crucial and superfluous filters in the model [237]. The model's redundant filters are eliminated without compromising quality. There are two types of filters pruning techniques. One is unstructured, which means that individual weights have been removed, and the other is structured, which means that convolutional channels have been removed [227]. Channel weights from all layers are reduced to their smallest sums when non-sequential layers are assessed [238]. Convolutional inputs are removed from the network through other methods, such as channel pruning, which have the least influence on the model output [227]. FIGURE 40. Adaptive channel pruning model [239]. FC: fully connected network.

B. NETWORK PRUNING METHOD
Network (weight) pruning methods provide condensed representation and seek to create a small and faster model. This pruning strategy's fundamental tenet is to trim weights using lp-norm regularization. Additionally, if the weights are not essential, the model's accuracy can be maintained without them [240]. In order to find the low-contributing weights that may be either trimmed or fine-tuned, a specified threshold is taken into consideration.

C. HYBRID PRUNING METHOD
Hybrid pruning ( Figure 41) is a combination of more than one pruning technique, either (a) weight pruning with filter pruning or (b) course-grained channel pruning with finegrained weight pruning [241].

IX. BIAS IN UNET-BASED DESIGNS FOR VASCULAR AND NON-VASCULAR APPLICATIONS A. RANKING-BASED RISK OF BIAS SCORE METHOD
There were 54 vascular and 56 non-vascular studies in our cohort that used UNet-based architecture. For each study, 35 AI-based attributes were created; for a total of 1,890 attributes and 1,960 attributes corresponding to vascular and non-vascular diseases, respectively. These UNetbased features were initially qualitative and then quantified by assigning a score between 0 and 5 based on the nature of attributes by AI scientists with 10 years' experience [33], [34], [242], [243], [244], [245]. The study's aggregate score is the sum of all attribute values for that selected study. Using the ranking method, the mean values (Table 8 and  Table 9) of the 110 UNet-based investigations ranged from 2.7 (left) to 1.1 (right) for the vascular and 2.0 (left) to 1.1 (right) for non-vascular studies, respectively. The higher the mean value, the lower is the risk-of-bias (RoB). Hence, the studies were arranged in the order of low-, moderate-, and high-bias, according to the decreasing order of their aggregate scores. The low-moderate (LM) cutoff was 2.6, and the moderate-high (MH) cutoff was 2.0 determined for the non-vascular UNet-based studies for RoB by using the intersection of the ''cumulative plot and the mean plot curve of the studies'' (Figure 42 (bottom)). Similarly, the LM cutoff for the vascular was 3.1 and MH cutoff of 1.9 was determined ( Figure 42 (top)). According to the ranking score graph, most of the studies had a moderate-bias (ranging from 1.8 to 1.3, in decreasing order left to right, and this accounted for 35 studies (59%) in the non-vascular framework and similarly for the vascular, it ranged from 1.9 to 1.5, with 30 studies (46%).
Note that the studies with higher normalized mean values in the AI attributes were considered as low-bias. These lowbias studies showed more innovation in the design for vascular diagnosis. On the contrary, the tail-enders showed low AI attribute mean scores (high-bias) and were not clinically substantial compared to low-bias or moderate-bias studies. We will discuss the analysis of the studies between the three quantitative and innovation methods in next section.

B. RADIAL-BIAS MAP METHOD
Since the UNet technology applied for vascular and nonvascular diagnosis prevails in different stages such as demographics, architecture, performance evaluation, and clinical application, the strengths of different DL attributes were determined (A1 to A35 in Table 10) in these stages (called clusters). The DL attributes in each cluster was 9, 7, 12, and 7, respectively. For estimating the strengths of AI attributes, we used a pictorial representation of the ''spokes and wheel model'' in 360 directions, where each spoke represents the product of the weight of the attribute times the radius of  (256)). (iii) Calculate the sum of spoke lengths corresponding to four clusters (say C1 , C2 , C3 , and C4 ). (iv) Calculate the sum of the top two and bottom two clusters (say A and B ). (v) Compute the £ radial = | A -B |, as the absolute difference between A and B . (vi) The normalized bias value (£ norm radial ) = ( £radial α ), where α is the total number of AI attributes. The weight matrix (Tables 10 and 11) presents the weights of the AI attributes based on the experience and judgment of AI professionals. In all, each study has 49 attributes corresponding to every 7.3 (∼360/49) degrees. The Bezier spline curve is then fitted through the endpoint of each spoke to represent the smooth curve.
Since the curve has four sectors (corresponding to four clusters), the radial-bias map resembles butterfly wings, as shown in Figure 43 (right), laid out in 8×7 grid, representing 56 non-vascular UNet-based DL studies and 54 studies for vascular UNet-based DL paradigm, laid in 9 × 6 grid shown in Figure 43 (left). These studies are arranged from low to high-bias, where the bias of each study is in the corner of the radial-bias map (where the name of the bias map is: ''Sn-Name:BiasValue'', for example, ''S18-Che:10'', where ''18'' represents the study number, ''Che'' is the first three letter of the last name of the first author in the study, and ''10'' represents the normalized value of the bias). Note that the following is the sequence of AI attributes for each of the four clusters (A1 to A35 in Tables 10 and 11). The AI demographic cluster (A1-A9) consisted of (i) total patients, (ii) family history, (iii) type of risk factors, (iv) body mass index, (v) ethnicity, (vi) hypertension, (vii) smoking, (viii) data type, (ix) magnetic resonance imaging, (x) CT, (xi) X-ray, (xii) PET, (xiii) US, (xiv) multicenter, (xv) application, (xvi) field of view (FOV), and (xvii) UNet type. The second cluster (A10-A16) of AI-based attributes are the nine architecture parameters used in the DL study. These are the (i) encoder layer, (ii) decoder layer, (iii) convolution type, (iv) maxpooling type, (v) loss function (LF) was done or not, (vi) LF type, (vii) optimizer type, (viii) filter size, and (ix) bridge network type. The third cluster (A17-A28) of attributes includes the performance evaluation parameters such as (i) number of PE parameters, (ii) sensitivity, (iii) specificity, (iv) accuracy, (v) precision, (vi) F1-score, (vii) P-value, (viii) hamming loss, (ix) Dice coefficient, (x) Jaccard-index, (xi) Mathew's correlation coefficient (MCC), (xii) positive predictive value (PPV), and (xiii) Hausdorff surface distance (HSD). The last and fourth cluster (A29-A35) consists of ten benchmarking and clinical validation parameter attributes. These include the (i) statistical analysis, (ii) power analysis, (iii) scientific validation, (iv) benchmarking, (v) hazard analysis, (vi) survival analysis, (vii) paired t-test, (viii) Kruskal-Wallis test, and (ix) FDA approval. C. REGIONAL-BIAS AREA METHOD The regional-bias area (RBA) was estimated by evaluating the difference in the area of the best DL performing attributes and the worst performing DL attributes. Figure 44 (left) displays the RBA for every vascular study, while Figure 44 (right) shows for the non-vascular studies in the increasing order of the area of bias (white region). In each of the studies bias is depicted as: ''Sn-Name:BiasValue''. For example, in the non-vascular application, ''S18-Che:149'', where ''18'' represents the study number, ''Che'' is the first three letters of the last name of the first author in the study, and ''149'' represents the normalized value of the bias. The greater white shaded area, the grater the area corresponding to bias.

D. ROBINS-I
This bias estimation approach aims to imitate nonrandomized trials' randomization. RoB is studied using three intervention components namely, ''Pre-Intervention,'' ''During Intervention,'' and ''post-Intervention. These three components are further spanned to seven distinct aspects, namely, (i) bias due to confounding (total patients, risk factor, and demographic), (ii) bias in selection of participants (image modality, multicenter, and data type), (iii) bias in classification of interventions (UNet type, model layers, conv. type, loss type, and optimizer), (iv) bias due to deviations from intended interventions (application, and benchmarking), (v) bias due to missing data (FOV Application),'' (vi) bias in measurement of outcomes (accuracy, Dice, Jaccard, MCC, and HSD), and (vii) bias in selection of the reported result (statistical analysis, scientific validation, and XAI) (Table 12 and 13). We used the ROBINS-I tool on a total of 54 and 56 studies of the vascular and non-vascular domains, respectively. Using the low bias cut-off of 3.1 and 2.9, the moderate-high bias cut-off of 2.6 and 2.6 for vascular and non-vascular, respectively. We found that 29.62%   56) were high bias, for vascular and non-vascular, respectively (Figure 45 (a) and (b)).

E. PROBAST
This is a clinical prediction tool that was primarily designed for reviews to help highlight the bias in the studies. It uses predictors classified into four domains namely, participants (demographic, image modality, multicentre, and data type), predictors (model layers, conv. type, loss type, and optimizer), outcome (accuracy, dice, Jaccard, MCC, and HSD) and analysis (scientific validation, statistical analysis, benchmarking, and XAI) (Table 14 and 15).
We used PROBAST tool on the same set of 54 and 56 studies of vascular and non-vascular domain, respectively. Using the low bias cut-off of 3.1 and 3.0, moderate-high bias  cot-off of 2.4 and 2.5 for vascular and non-vascular, respectively. We found that 25.92% (14 out of 54) and 12.5% (7 out of 56) were low bias, 40.74% (22 out of 54) and 41.07% (23 out of 54) were moderate bias and 33.33% (18 out of 54) and 33.92% (19 out of 56) were high bias, for vascular and non-vascular, respectively (Figure 46 (a) and (b)).

F. ANALYSIS OF BIAS
This section represents the Venn diagram (VD) which displays the relationship among the five adopted innovative methods (RBM vs. RBA vs. RBS vs. PROBAST vs. ROBINs-I) for RoB. Figure 47

A. PRINCIPAL FINDINGS
In particular, medical imaging has undergone a wave of revolution in the last five years in the field of segmentation. Since the introduction of cUNet, there have been nearly 1000 UNet publications. However, the understanding of such black boxes is still not felt to the level where the physicians are comfortable and confident to adopting them in clinical settings. We therefore took one level deeper, offering the following novelties: (i) understanding the statistical distribution post PRISMA-based study selection, (ii) segregating the UNet and its variations into five clear classes (cUNet, sUNet, acUNet, hUNet and eUNet), giving their distinguishing characteristics along with the applications. (iii) The latest and most powerful features of deep learning, such as convolution, max/average pooling, and 81 critical modifications in encoder, decoder, skip connection, and classification frameworks to get the best low-level and high-level features, are now better explained. (iv) Also, segmentation challenges-architecture solutionskey were provided. Further, we link these UNet extraction paradigms with the novel AI necessities such as (v) pruning, (vi) explainable AI, and (vii) AI bias, which are also our novel and unique contributions. (viii) our review covers the stateof-the-art references with powerful vascular and non-vascular applications.

B. BENCHMARKING
The main coverage of 2020 was the modification of basic cUNet, invented in 2015. Since then, we have had inventions related to the UNet series leading to into UNet+, UNet++, UNet+ + +, ZNet, TNet, and WNet (UNet+UNet) [85], [144], [157]. There are five major components in this small segmentation structure: encoder, decoder, skip connection, bridge network, and termination layer for ensuring either accomplishing segmentation or classification. Such a UNet structure can be changed using a bag of tools such as filter deck (channels), convolution, max pooling, ReLU, and fusion of classifiers for handing the spatial and temporal information to compensate for handing scale, space, position, and size in 2D/3D. Keeping the above paradigms in mind, we have only taken the critical UNet reviews from 2020, 2021, and 2022 ( Table 16). Note that there was no review of UNet between 2015 and 2020. One reason was due to ''Inertia of Education'', which there was a lag of five years since UNet was first invented in 2015. Note, Liu et al. [177], Siddique et al. [246], and Du et al. [247] were the only first three UNet reviews in 2020, and since then, the problem-solving tools have evolved along with the challenges. Furthermore, the previous methods like level sets, and classifiers were either userinteractive or computationally expensive and still not fully automated.
Even though the cUNet took a sophisticated turn by undergoing modifications in the five components, we would like to emphasize that the black box nature of AI is unaddressed. Further, in the spirit of generality AI training models are typically larger in size and storage, which makes them nearly impossible to install on edge devices like the RasberryPi or JetsenNano, and these are the devices of the future. The third aspect of AI models is susceptible to bias due to data size, balance of classes, inconsistencies and missing values in data, image acquisition protocol variations, and architecture compatibility with the input data sets, cross-validation protocols in seen and unseen AI models. All the three major components of AI models are not discussed in the above reviews and therefore offer limitations of the above reviews. We have taken special care to address the above issues in our review. Note that, besides addressing the above issues, we classify the UNet series into five categories, namely cUNet, sUNet, acUNet, hUNet, and eUNet, in a special way by studying all types of applications, modalities, and architectures. The importance of these changes is discussed along with their origin.
In comparison to the UNet reviews published in 2021, these were specifically focused on (i) radiation therapy planning [248], (ii) breast tumor cell nuclei segmentation [249], and (iii) detection and segmentation of tumors in orthopedics applications [250]. Thus, these reviews do not offer strong comparisons with other techniques due to lack of generalization. In 2022, the following review articles were published, namely, Punn et al. [251], Yin et al. [252], He et al. [253], and Wu et al. [254]. Punn et al. [251] covered the same set of UNets (inception, ensemble, attention) for all the different modalities, which does not offer design variations, but problems specific to imaging modality and organ segmentation. In Yin et al. [252], the focus was also on the UNet. However, the authors introduced the variety due to transformer based UNet, where the encoder changed to CNN cascaded with the 12 layers of the transformer. Regarding He et al. [253], the authors offered a hybrid UNet unlike reviews in UNet.
This includes the integration of conditional generative adversarial networks (Seg-cGAN) with UNet+. Thus, their innovation was to use cGAN for pattern enhancement and introduction of regularization paradigm for capturing context-based image features for segmentation. Wu et al. [254] summarize different developed methods of UNet for microscopic image analysis along with the comparison of UNet techniques used in other studies.

C. RECOMMENDATION
The study offers the following set of recommendations: (i) Segmentation complexity and UNet selection: The segmentation complexity and image dimensionality should be considered when selecting a UNet type. Based on scale, shape, position, size, and the noise characteristics, it is recommended to choose an architecture where the UNet components are altered by several different attention mechanisms; (ii) UNet hyperparameters optimization: The choice of hyperparameters plays an important role in UNet optimization. Therefore, it is recommended to optimize the UNet architecture in an iterative paradigm. The hyperparameters include a number of layers in the UNet, filter size during convolution, learning rate, batch normalization, and number of epochs, number of iterations per epoch, and the design of the loss functions; (iii) Explainability of UNet-based AI system: All clinical systems using UNet should be explainable or interpretable based on standardized paradigm such as Grad-CAM [179], LIME, and SHAPLEY [255], [256]; (iv) Clinical Evaluation and Scientific Validation: The UNet architecture should be clinically evaluated and scientifically validated using previously unseen AI paradigms. This requires the model to be trained on data different from the test data; (v) Reduction of AI-bias: For the low bias design of AI system using UNet, all the attributes such as data demographics, architecture components, optimization parameter, and scientific/clinical validation must be taken into account for reduction [32], [34], [242], [243], [244], [257]; (vi) Generalization vs. Memorization: The UNet-based design must be generalizable both in terms of data size, and the variability VOLUME 11, 2023 in the data type. Big data framework can be adapted during training having diversity in the database with best gold standard during supervised learning; (vii) Miscellaneous: In healthcare databases, demographics such as comorbidity, and data acquisition must be carefully designed to have the least impact on AI bias and performance.

D. SHORT NOTE ON UNSUPERVISED UNET PARADIGM
Although, the focus of this study was purely and squarely on supervised UNet frameworks, one cannot ignore the upcoming wave and innovation in unsupervised learning or selflearning techniques [258], [259], [260], [261], [262]. The fundamental difference between a supervised and unsupervised paradigms is the incorporation of pseudo gold standard in the form of another observation which is similar to original datasets whose segmentation needs to be determined. Such a pseudo-observation is typically adapted for training the model, exactly the way the gold standard does [164], [263], [264], [265], [266].

E. STRENGTH, WEAKNESS, AND EXTENSION
The system allows selection of the appropriate UNet, given the segmentation challenge. The study provided a set of five types of UNet (cUNet, sUNet, acUNet, hUNet, and eUNet) based on the evolution in the components of the UNet to handle the complexities of the segmentation process. Thus, the selected UNet is able to appropriately configure the components of the UNet, given the change in shape, size, scale, and position in the images. Further, the study exclusively studied 81 configurations which altered the cUNet, leading to a powerful UNet system for vascular and non-vascular applications. The strength of the system was to study similarities and differences between the vascular and non-vascular UNet-based applications. The review also provides a comparison between the vascular and non-vascular segmentation frameworks. Furthermore, the review also gives an insight into pruning, interpretability, and bias. This was the first time such a UNet paradigm was demonstrated.
Even though, UNet has opened the door for most segmentation challenges, there is a price to pay when it comes to speed, storage, portability on edge devices (Raspberry pi and JetsenNano), and time complexity. As the number of layers in the UNet increases, encoder paradigms change from conventional to residual or dilated convolutions, skip connections are embedded with classifiers such as LSTM or RNN, decoder alterations for fusing the output led to complex loss functions -all these affect the computation time demanding a higher processor such as GPU or multithreaded architecture.
While we have seen nearly 300 UNet variations (some not included in this review) in a very short span of less than    half a decade, the process of innovation has just begun in the computer vision field comprising segmentation and classification. By modifying UNet components with stochastic image processing infrastructure, such dynamic growth has a high potential [17].

XI. CONCLUSION
The need for UNet-based stratification into several unique classes was deemed necessary, and this study demonstrated five unique paradigms, namely cUNet, sUNet, acUNet, hUNet, and eUNet. sUNet was the superior UNet which underwent several waves of iterations to handle, position, shape, and object scales in 2-D and 3-D image segmentation. The focus of the study was purely on vascular vs. non-vascular applications. A thorough investigation was conducted to study certain attention blocks that modified the conventional UNet architectures, leading to stable and superior performances. Further, this is the only study of its kind that introduces explainable AI, pruning and evaluates AP(ai)Bias 2.0-UNet, further benchmarking with (i) ranking, (ii) butterfly, (iii) regional area, (iv) PROBAST, and (v) ROBIN's methods. Most of the studies suffered from poor attention in XAI and pruning strategies. Also, segmentation challenges-architecture solutions-key were provided. While the UNet-based strategy has dominated the field of image segmentation, more practical aspects of the UNet from paperto-practice need to be the focus for better clinical setting applications.