UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation

Owing to the success of transformer models, recent works study their applicability in 3D medical segmentation tasks. Within the transformer models, the self-attention mechanism is one of the main building blocks that strives to capture long-range dependencies. However, the self-attention operation has quadratic complexity which proves to be a computational bottleneck, especially in volumetric medical imaging, where the inputs are 3D with numerous slices. In this paper, we propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed. The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features using a pair of inter-dependent branches based on spatial and channel attention. Our spatial attention formulation is efficient having linear complexity with respect to the input sequence length. To enable communication between spatial and channel-focused branches, we share the weights of query and key mapping functions that provide a complimentary benefit (paired attention), while also reducing the overall network parameters. Our extensive evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy. On Synapse, our UNETR++ sets a new state-of-the-art with a Dice Score of 87.2%, while being significantly efficient with a reduction of over 71% in terms of both parameters and FLOPs, compared to the best method in the literature. Code: https://github.com/Amshaker/unetr_plus_plus.


Introduction
Volumetric (3D) segmentation is a fundamental problem in medical imaging with numerous applications including, tuabdelrahman.youssief@mbzuai.ac.ae mor identification and organ localization for diagnostic purposes [13,16].The task is typically addressed by utilizing a U-Net [28] like encoder-decoder architecture where the encoder generates a hierarchical low-dimensional representation of a 3D image and the decoder maps this learned representation to a voxel-wise segmentation.Earlier CNN-based methods use convolutions and deconvolutions in the encoder and the decoder, respectively, but struggle to achieve accurate segmentation results likely due to their limited receptive field.In contrast, transformer-based methods are inherently global and have recently demonstrated competitive performance at the cost of increased model complexity.
Recently, several works [12,13,36] have explored designing hybrid architectures to combine the merits of both local convolutions and global attention.While some approaches [13] use transformer-based encoder with convolutional decoder, others [12,36] aim at designing hybrid blocks for both encoder and decoder subnetworks.However, these works mainly focus on increasing the segmentation accuracy which in turn substantially increases the model sizes in terms of both parameters and FLOPs, leading to unsatisfactory robustness.We argue that this unsatisfactory robustness is likely due to their inefficient self-attention design, which becomes even more problematic in volumetric medical image segmentation tasks.Further, these existing approaches do not capture the explicit dependency between spatial and channel features which can improve the segmentation quality.In this work, we aim to simultaneously improve both the segmentation accuracy and the model efficiency in a single unified framework.Contributions: We propose an efficient hybrid hierarchical architecture for 3D medical image segmentation, named UNETR++, that strives to achieve both better segmentation accuracy and efficiency in terms of parameters, FLOPs, and inference speed.Built on the recent UNETR framework [13], our proposed UNETR++ hierarchical approach introduces a novel efficient paired attention (EPA) block that efficiently captures enriched inter-dependent spatial and channel features by applying both spatial and chan-1 Figure 1.Left: Qualitative comparison between the baseline UNETR [13] and our UNETR++ on Synapse.We present two examples containing multiple organs.Each inaccurate segmented region is marked with a white dashed box.In the first row, UNETR struggles to accurately segment the right kidney (RKid) and confuses it with gallbladder (Gal).Further, both the stomach (Sto) and left adrenal gland (LAG) tissues are inaccurately segmented.In the second row, UNETR struggles to segment the whole spleen and mixes it with stomach (Sto) and portal and splenic veins (PSV).Moreover, it under and over-segments certain organs (e.g., PSV and Sto).In comparison, our UNETR++ that efficiently encodes enriched inter-dependent spatial and channel features within the proposed EPA block, accurately segments all organs in these examples.Best viewed zoomed in.Additional qualitative comparisons are presented in Fig. 4 and supplementary material.Right: Accuracy (Dice score) vs. model complexity (FLOPs and parameters) comparison on Synapse.Compared to best existing nnFormer [36], UNETR++ achieves better segmentation performance while significantly reduces the model complexity by over 71%.
nel attention in two branches.Our spatial attention in EPA projects the keys and values to a fixed lower dimensional space, making the self-attention computation linear with respect to the number of input tokens.On the other hand, our channel attention emphasizes the dependencies between the channel feature maps by performing the dot-product operation between queries and keys in the channel dimension.Further, to capture a strong correlation between the spatial and channel features, the weights for queries and keys are shared across the branches which also aids in controlling the number of network parameters.In contrast, the weights for values are kept independent to enforce learning complementary features in both branches.
We validate our UNETR++ approach by conducting comprehensive experiments on five benchmarks: Synapse [19], BTCV [19], ACDC [1], BRaTs [24], and Decathlon-Lungs [30].Both qualitative and quantitative results demonstrate the effectiveness of UNETR++, leading to better performance in terms of segmentation accuracy and model efficiency compared to the existing methods in the literature.On Synapse, UNETR++ achieves high-quality segmentation masks (see Fig. 1 left) with an absolute gain of 8.9% in terms of Dice Score while significantly reducing the model complexity with a reduction of 54% in terms of parameters and 37% in FLOPs, compared to the baseline UNETR [13].Further, UNETR++ outperforms the best existing nnFormer [36] method with a considerable reduction in terms of both parameters and FLOPs (see Fig. 1 right).

Related Work
CNN-based Segmentation Methods: Since the introduction of the U-Net design [28], several CNN-based approaches [2,14,37,38] have extended the standard U-Net architecture for various medical image segmentation tasks.In the case of 3D medical image segmentation [8,10,11,25,31], the full volumetric image is typically processed as a sequence of 2D slices.Several works have explored hierarchical frameworks to capture contextual information.Milletari et al. [25] propose to use 3D representations of the volumetric image by down-sampling the volume to lower resolutions for preserving the beneficial image features.C ¸ic ¸ek et al. [8] extend the U-Net architecture to volumetric segmentation by replacing the 2D operations with their 3D counterparts, learning from sparsely annotated volumetric images.Isensee et al. [16] introduce a generalized segmentation framework, named nnUNet, that automatically configures the architecture to extract features at multiple scales.Roth et al. [29] propose a multi-scale 3D fully convolution network to learn representations from varying resolutions for multi-organ segmentation.Further, several efforts in the literature have been made to encode holistic contextual information within CNN-based frameworks using, e.g., image pyramids [35], large kernels [26], dilated convolution [6], and deformable convolution [20].Transformers-based Segmentation Methods: Vision transformers (ViTs) have recently gained popularity thanks to their ability to encode long-range dependencies leading to promising results on various vision tasks, including classification [9] and detection [4].One of the main building blocks within the transformer's architecture is the selfattention operation that models the interactions among the sequence of image patches, thereby learning global relationships.Few recent works have explored alleviating the complexity issue of standard self-attention operation within transformer frameworks [7,18,23,32].However, most of these recent works mainly focus on the classification problem and have not been studied for dense prediction tasks.In the context of medical image segmentation, few recent works [3,17] have investigated pure transformers designs.Karimi et al. [17] propose to divide a volumetric image into 3D patches which are then flattened to construct a 1D embedding and passed to a backbone for global representations.Cao et al. [3] introduce an architecture with shifted windows for 2D medical image segmentation.Here, an image is divided into patches and fed into a U-shaped encoderdecoder for local-global representation learning.
Hybrid Segmentation Methods: Other than pure CNN or transformers-based designs, several recent works [5,13,21,31,34,36] have explored hybrid architectures to combine convolution and self-attention operations for better segmentation.TransFuse [34] proposes a parallel CNNtransformer architecture with a BiFusion module to fuse multi-level features in the encoder.MedT [31] introduces a gated position-sensitive axial-attention mechanism in selfattention to control the positional embedding information in the encoder, while the ConvNet module in the decoder produces a segmentation model.TransUNet [5] combines transformers and the U-Net architecture, where transformers encode the embedded image patches from convolution features and the decoder combines the upsampled encoded features with high-resolution CNN features for localization.Ds-transunet [21] utilizes a dual-scale encoder based on Swin transformer [22] to handle multi-scale inputs and encode local and global feature representations from different semantic scales through self-attention.Hatamizadeh et al. [13] introduce a 3D hybrid model, UNETR, that combines the long-range spatial dependencies of transformers with the CNN's inductive bias into a "U-shaped" encoderdecoder architecture.The transformer blocks in UNETR are mainly used in the encoder to extract fixed global representations and then are merged at multiple resolutions with a CNN-based decoder.Zhou et al. [36] introduce an approach, named nnFormer, that adapts the Swin-UNet [3] architecture.Here, convolution layers transform the input scans into 3D patches and volume-based self-attention modules are introduced to build hierarchical feature pyramids.While achieving promising performance, the computational complexity of nnFormer is significantly higher compared to UNETR and other hybrid methods.
Our Approach: As discussed above, most recent hy-brid approaches, such as UNETR [13] and nnFormer [36], achieve improved segmentation performance compared to their pure CNNs and transformers-based counterparts.However, we note that this pursuit of increasing the segmentation accuracy by these hybrid approaches comes at the cost of substantially larger models (both in terms of parameters and FLOPs), which can further lead to unsatisfactory robustness.For instance, UNETR achieves favorable accuracy but comprises 2.5× more parameters, compared to the best existing CNN-based nnUNet [16].Moreover, nnFormer obtains improved performance over UNETR but further increases the parameters by 1.6× and FLOPs by 2.8×.Furthermore, we argue that these aforementioned hybrid approaches struggle to effectively capture the interdependencies between feature channels to obtain an enriched feature representation that encodes both the spatial information as well as the inter-channel feature dependencies.In this work, we set out to collectively address the above issues in a unified hybrid segmentation framework.

Method
Motivation: To motivate our approach, we first distinguish two desirable properties to be considered when designing a hybrid framework that is efficient yet accurate.Efficient Global Attention: As discussed earlier, most existing hybrid methods employ self-attention operation having quadratic complexity in terms of the number of tokens.This is computationally expensive in the case of volumetric medical segmentation and becomes more problematic when interleaving window attention and convolution components in hybrid designs.Different from these approaches, we argue that computing self-attention across feature channels instead of volume dimension is expected to reduce the complexity from quadratic to linear with respect to the volumetric dimension.Further, the spatial attention information can be efficiently learned by projecting the spatial matrices of the keys and values into a lower-dimension space.Enriched Spatial-channel Feature Representation: Most existing hybrid volumetric medical image segmentation approaches typically capture the spatial features through attention computation and ignore the channel information in the form of encoding the inter-dependencies between different channel feature maps.Effectively combining the interactions in the spatial dimensions and the inter-dependencies between the channel features is expected to provide enriched contextual spatial-channel feature representations, leading to improved mask predictions.

Overall Architecture
Fig. 2 presents our UNETR++ architecture, comprising a hierarchical encoder-decoder structure.We base our UN-ETR++ framework on the recently introduced UNETR [13] with skip connections between the encoders and decoders, Figure 2. Overview of our UNETR++ approach with hierarchical encoder-decoder structure.The 3D patches are fed to the encoder, whose outputs are then connected to the decoder via skip connections followed by convolutional blocks to produce the final segmentation mask.The focus of our design is the introduction of an efficient paired-attention (EPA) block (Sec.3.2).Each EPA block performs two tasks using parallel attention modules with shared keys-queries and different value layers to efficiently learn enriched spatial-channel feature representations.As illustrated in the EPA block diagram (on the right), the first (top) attention module aggregates the spatial features by a weighted sum of the projected features in a linear manner to compute the spatial attention maps, while the second (bottom) attention module emphasizes the dependencies in the channels and computes the channel attention maps.Finally, the outputs of the two attention modules are fused and passed to convolutional blocks to enhance the feature representation, leading to better segmentation masks.
followed by convolutional blocks (ConvBlocks) to generate the prediction masks.Instead of using a fixed feature resolution throughout the encoders, our UNETR++ employs a hierarchical design where the resolution of the features is gradually decreased by a factor of two in each stage.Within our UNETR++ framework, the encoder has four stages, where the first stage consists of patch embedding to divide volumetric input into 3D patches, followed by our novel efficient paired-attention (EPA) block.In the patch embedding, we divide each 3D input (volume) x ∈ R H×W ×D into non-overlapping patches x u ∈ R N ×(P1,P2,P3) , where (P 1 , P 2 , P 3 ) is the resolution of each patch and ) denotes the length of the sequence.Then, the patches are projected into C channel dimensions, producing feature maps of size H P1 × W P2 × D P3 × C. We use the same patch resolution (4, 4, 2), as in [36].For each of the remaining encoder stages, we employ downsampling layers using non-overlapping convolution to decrease the resolution by a factor of two, followed by the EPA block.
Within our proposed UNETR++ framework, each EPA block comprises two attention modules to efficiently learn enriched spatial-channel feature representations by encoding the information in both spatial and channel dimensions with shared keys-queries scheme.The encoder stages are connected with the decoder stages via skip connections to merge the outputs at different resolutions.This enables the recovery of the spatial information lost during the downsampling operations, leading to predicting a more precise output.Similar to the encoder, the decoder also comprises four stages, where each decoder stage consists of an upsampling layer using deconvolution to increase the resolution of the feature maps by a factor of two, followed by the EPA block (except the last decoder).The number of channels is decreased by a factor of two between each two decoder stages.Consequently, the outputs of the last decoder are fused with convolutional features maps to recover the spatial information and enhance the feature representation.The resulting output is then fed into 3×3×3 and 1×1×1 convolutional blocks to generate voxel-wise final mask predictions.Next, we present in detail our EPA block.

Efficient Paired-Attention Block
The proposed EPA block performs efficient global attention and effectively captures enriched spatial-channel feature representations.The EPA block comprises spatial attention and channel attention modules.The spatial attention module reduces the complexity of the self-attention from quadratic to linear.On the other hand, the channel attention module effectively learns the inter-dependencies between the channel feature maps.The EPA block is based on a shared keys-queries scheme between the two attention modules to be mutually informed in order to generate better and more efficient feature representation.This is likely due to learning complementary features by sharing the keys and queries but using different value layers.
As illustrated in Fig. 2 (right), the input feature maps x are fed into the channel and spatial attention modules of the EPA block.The weights of Q and K linear layers are shared across the two attention modules and different V layer is used for each attention module.The two attention modules are computed as: where, Xs and Xc denotes the spatial and channels attention maps, respectively.SA is the spatial attention module, and CA is the channel attention module.Q shared , K shared , V spatial , and V channel are the matrices for shared queries, shared keys, spatial value layer, and channel value layer, respectively.Spatial Attention: We strive in this module to learn the spatial information efficiently by reducing the complexity from O(n 2 ) to O(np), where n is the number of tokens, and p is the dimension of the projected vector, where p << n.Given a normalized tensor X of shape HW D×C, we compute Q shared , K shared , and V spatial projections using three linear layers, yielding Q shared =W Q X, K shared =W K X, and V spatial =W V X, with dimensions HW D×C, where W Q ,W K , and W V are the projection weights for Q shared , K shared , and V spatial , respectively.Then, we perform three steps.First, the K shared and V spatial layers are projected from HW D × C into lowerdimensional matrices of shape p × C. Second, the spatial attention maps are computed by multiplying the Q shared layer by the transpose of the projected K shared , followed by softmax to measure the similarity between each feature and the rest of the spatial features.Third, these similarities are multiplied by the projected V spatial layer to produce the final spatial attention maps of shape HW D×C.The spatial attention is defined as follows: where, Q shared , K proj , Ṽspatial denote shared queries, projected shared keys, and projected spatial value layer, respectively, and d is the size of each vector.
Channel Attention: This module captures the interdependencies between feature channels by applying the dotproduct operation in the channel dimension between channel value layer and channel attention maps.Using the same Q shared and K shared of the spatial attention module, we compute value layer for the channels to learn the complementary features using linear layer, yielding V channel = W V X, with dimensions HW D×C, where W V is the projection weight for V channel .The channel attention is defined as follows: where, V channel , Q shared , K shared denote channel value layer, shared queries, and shared keys, respectively, and d is the size of each vector.Finally, we perform sum fusion and transform the outputs from the two attention modules by convolution blocks to obtain enriched feature representations.The final output X of the EPA block is obtained as: where, Xs and Xc denotes the spatial and channels attention maps, and Conv 1 and Conv 3 are 1×1×1 and 3×3×3 convolution blocks, respectively.

Loss Function
Our loss function is based on a summation of the commonly used soft dice loss [25] and cross-entropy loss to simultaneously leverage the benefits of both complementary loss functions.It is defined as: where, I denotes the number of classes; V denotes the number of voxels; Y v,i and P v,i denote the ground truths and output probabilities at voxel v for class i, respectively.
Consistent with [36], we split the data into 70, 10 and 20 train, validation and test samples.We report the DSC on the three classes.The BraTS [24] comprises of 484 MRI images, where each image consists of four channels, FLAIR, T1w, T1gd and T2w.We split the dataset into 80:5:15 ratio for training, validation and testing and report on the test set.The target categories are whole tumor, enhancing tumor and tumor core.The lung [30] dataset comprises 63 CT volumes for a two-class problem with the goal to segment lung cancer from the background.We split the data into 80:20 ratio for training and validation.Evaluation Metrics: We measure the performance of the models based on two metrics: Dice Similarity Coefficient (DSC) and 95% Hausdorff Distance (HD95).DSC measures the overlap between the volumetric segmentation predictions and the voxels of the ground truths, it is defined as follows: where, Y and P denote the ground truths and output probabilities for all voxels, respectively.HD95 is commonly used as boundary-based metric to measure the 95 th percentile of the distances between boundaries of the volumetric segmentation predictions and the voxels of the ground truths.It is defined as follows: where, d Y P is the maximum 95 th percentile distance between predicted voxels and the ground truth, and d P Y is the maximum 95 th percentile distance between the ground truth and the predicted voxels.Implementation Details: We implement our approach in Pytorch v1.10.1 and using the MONAI libraries [27] Table 2. Baseline comparison on Synapse.We show the results in terms of segmentation performance (DSC) and model complexity (parameters and FLOPs).For a fair comparison, all results are obtained using the same input size and pre-processing.Integrating the EPA block in the encoders of our hierarchical design improves the segmentation performance to 85.17%.The results are further improved to 87.22% by also introducing the EPA block in decoders.Our UNETR++ with the novel EPA block both in the encoders and decoders achieves an absolute gain of 8.87% in DSC, while also significantly reducing the model complexity.

Baseline Comparison
Tab. 2 shows the impact of integrating the proposed contributions within the baseline UNETR [13]   architecture that downsamples the feature maps of the encoder by a factor of two after each stage.Hence, the model comprises four encoder stages and four decoder stages.This hierarchical design of our UNETR++ enables a significant reduction in model complexity by reducing the parameters from 92.49M to 16.60M and FLOPs from 75.76G to 30.75G while maintaining a comparable DSC of 78.29%, compared to the baseline.Introducing the EPA block within our UN-ETR++ encoders leads to a significant improvement in performance with an absolute gain of 6.82% in DSC over the baseline.The performance is further improved by integrating the EPA block in the decoder.Our final UNETR++ having a hierarchical design with the novel EPA block both in encoders and decoders leads to a significant improvement of 8.87% in DSC, while considerably reducing the model complexity by 54% in parameters and 37% in FLOPs, compared to the baseline.We further conduct an experiment to evaluate our spatial and channel attention within the proposed EPA block.Employing spatial and channel attention improve the performance significantly with DSC of 86.42% and 86.39%, respectively over the baseline.Combining both spatial and channel attention within our EPA block leads to a further improvement with DSC of 87.22%.Fig. 3 shows a qualitative comparison between the baseline and our UNETR++ on the Synapse dataset.We enlarge different organs (marked as green dashed boxes in the first row) from several cases.In column 1, the baseline struggles to segment the inferior vena cava and aorta.In column 2, it confuses the same two organs when they are adjacent to each other.In the last two columns, the baseline under-segments left kidney, spleen, and stomach, whereas it over-segments the gallblader.In contrast, UNETR++ achieves improved performance by accurately segmenting all organs.

State-of-the-Art Comparison
Synapse Dataset: Tab. 1 shows the results on the multiorgan Synapse dataset.We report the segmentation performance using DSC and HD95 metrics on the abdominal or- gans.In addition, we report the model complexity in terms of parameters and FLOPs for each method.The segmentation performance is reported with a single model accuracy and without utilizing any pre-training, model ensemble or additional data.The pure CNN-based U-Net [28] approach achieves a DSC of 76.85%.Among existing hybrid transformers-CNN based methods, UNETR [13] and Swin-UNETR [12] achieve DSC of 78.35% and 83.48%, respectively.On this dataset, nnFormer [36] obtains superior performance compared to other existing works.Our UNETR++ outperforms nnFormer by achieving a DSC of 87.22%.Further, UNETR++ obtains an absolute reduction in error of 3.1% over nnFormer in terms of HD95 metric.Notably, UNETR++ achieves this improvement in segmentation performance by significantly reducing the model complexity by over 71% in terms of parameters and FLOPs.Fig. 4 shows a qualitative comparison of UNETR++ with existing approaches on abdominal multi-organ segmentation.Here, the inaccurate segmentations are marked with red dashed boxes.In the first row, we observe that ex- isting approaches struggle to accurately segment the stomach by either under-segment it in the case of UNETR and Swin UNETR or confusing it with spleen in the case of nnFormer.In comparison, our UNETR++ accurately segments the stomach.Further, existing methods fail to fully segment the right kidney in the second row.In contrast, our UNETR++ accurately segments the whole right kidney, likely due to learning the contextual information with the enriched spatial-channel representation.Moreover, UN-ETR++ smoothly delineates boundaries between spleen, stomach and liver.In the third row, UNETR confuses stomach with pancreas.On the other hand, Swin UNETR and nnFormer under-segment stomach and left adrenal gland, respectively.UNETR++ accurately segments all organs with better delineation of the boundaries in these examples.
BTCV Dataset: Tab. 3 presents the comparison on BTCV test set.Here, all results are based on a single model accuracy without any ensemble, pre-training or additional data.We report results on all 13 organs along with corresponding mean performance over all organs.Among existing works, UNETR and SwinUNETR achieve a mean DSC of 76.0% and 80.44%.Among existing methods, nnUNet obtains a performance of 83.16% mean DSC, but requires 358G FLOPs.In comparison, UNETR++ performs favorably against nnUNet by achieving a mean DSC of 83.28%, while requiring significantly fewer FLOPs of 31G.
ACDC Dataset: Tab. 4  ditional data.UNETR and nnFormer achieve mean DSC of 86.61% and 92.06%, respectively.UNETR++ achieves improved performance with a mean DSC of 92.83%.BRaTs Dataset: Tab. 5 shows segmentation performance, model complexity, and inference time.For a fair comparison, we use same input size and pre-processing strategy.
We compare speed on Quadro RTX 6000 24 GB GPU & 32 Core Intel(R) Xeon(R) 4215 CPU.Here, inference time is avg.forward pass time using 1×128×128×128 input size of BRaTs.Compared to recent transformer-based methods, our UNETR++ achieves favourable performance while operating at a faster inference speed as well as requiring significantly lesser GPU memory.

Conclusion
We propose a hierarchical approach, named UNETR++, for 3D medical segmentation.Our UNETR++ introduces an efficient paired attention (EPA) block to encode enriched inter-dependent spatial and channel features by using spatial and channel attention.Within the EPA block, we share the weights of query and key mapping functions to better communicate between spatial and channel branches, providing complementary benefits as well as reducing the parameters.
Our UNETR++ achieves favorable segmentation results on five datasets while significantly reducing the model complexity with better speed, compared to existing methods.

Supplemental Material
In this section, we provide additional details regarding: • Implementation Details (Appendix A) • Qualitative Results (Appendix B) • Ablations (Appendix C) • Discussion (Appendix D)

A. Additional Implementation Details
Overall Architecture: As presented in Fig. 2 and described in Sec.3.1, our architecture consists of a hierarchical encoder-decoder structure.The encoder has four stages in which the number of channels at stages are [32,64,128,256] and each stage has three EPA blocks with the number of heads set to four.Similarly, the decoder has four stages, each consisting of upsampling using deconvolution followed by three EPA blocks.The deconvolutional layers increase the resolution of the feature maps by a factor of two.However, we use a 3×3×3 convolutional block at the last stage to compensate the heavy self-attention computation as the spatial size at this stage will be significantly larger (i.e.[128,128,64,16] in case of Synapse dataset).The output of the last decoder stage is fused with convolutional features to recover the spatial information and enhance the feature representation.The outputs are then fed into a 3 × 3 × 3 and 1 × 1 × 1 convolutional layers to generate voxel-wise mask predictions.Training Details: For the Synapse dataset, all the models are trained for 1K epochs with inputs of size 128×128×64.For BTCV, we follow the same training recipe as in [13] and train all the models at 96×96×96 resolution for 5K epochs.For ACDC, Decathlon-Lung, and BRaTs, we train all the models at 160×160×16, 192×192×34, and 128×128×128 resolutions, respectively.All other training hyper-parameters are same as in [36].Further, we add learnable positional encoding to the input of each EPA block.Our code and pretrained models will be made publicly available to reproduce our results.

B. Additional Qualitative Results
In this section, we provide additional qualitative comparisons for Synapse and ACDC datasets between UNETR++ and the state-of-the-art methods.Moreover, we provide a detailed comparison between UNETR++ and the baseline for Synapse, ACDC, and Dechatlon-Lungs datasets.

B.1. Synapse Dataset
Fig. 5 shows qualitative comparisons for different cases between UNETR++ and existing approaches on Synapse dataset.The inaccurate predictions are marked with red dashed boxes.In the first and second rows, UNETR++ differentiates the stomach tissues at different sizes and spleen successfully, while nnFormer struggles to differentiate between spleen and stomach, UNETR and Swin UNETR struggle to differentiate between stomach and the background, demonstrating that UNETR++ provides better segmentation predictions at different scales.In the third row, UNETR++ accurately segments all the organs, while the other existing methods under-segments left adrenal gland or spleen, and over-segments stomach in the case of UN-ETR.As illustrated in the fourth row, UNETR++ well delineates the boundaries of the inferior vena cava, compared to all existing methods which struggle, and confuse it with the background.In the last row, the existing methods under-segments the stomach in addition to confusing pancreas and portal and splenic veins in the case of UN-ETR.As illustrated, UNETR++ has a better delineation of the boundaries of different organs without under/oversegmenting, thus suggesting that UNETR++ encodes enriched inter-dependent spatial and channel features within the proposed EPA block.

B.2. ACDC Dataset
Fig. 6 shows qualitative comparisons for different cases between UNETR++ and existing approaches, nnFormer and UNETR on the ACDC dataset.The inaccurate predictions are marked with red dashed boxes.In the first row, UN-ETR and nnFormer under-segments the right ventricular (RV) cavity, while our UNETR++ accurately segments all three categories.In the second row, we present a difficult sample where the sizes of all three heart segments are comparatively smaller.In this case, both UNETR and nn-Former under-segments and struggles to delineate between the segments, while UNETR++ gives a better segmentation.In the last row, we present a more simpler sample.However, the existing methods over-segments the RV cavity and the myocardium in this case, while UNETR++ provides better delineation and provides a segmentation very close to the ground truth.Similar to the observation from Synapse, these qualitative examples shows that, UNETR++ achieves delineation for the three heart segments without under-segmenting or over-segmenting, thus suggesting the importance of its inter-dependent spatial and channel features encoded in the proposed EPA block.

B.3. Detailed qualitative comparison between UN-ETR++ and the baseline
Fig. 7 shows a qualitative comparison between UN-ETR++ and the baseline UNETR on Synapse dataset.We present visualizations of enlarged views of different organs (marked with green dashed boxes in the first row) from several cases for better analysis.In the first column, UN-ETR++ delineates the outline of pancreas well, while the baseline notably struggles in segmenting the pancreas and under-segments the stomach.In the second column, the     baseline under-segments the inferior vena cava and portal and splenic veins, while UNETR++ segments these organs precisely.In the third column, the baseline struggles to segment the inferior vena cava and aorta.In the last two columns, The baseline under-segments stomach and struggles in delineating the boundaries of spleen and left kidney.On the other hand, UNETR++ achieves improved performance and accurately segments all these organs with better delineation.We further show the 3D rendered segmentation results of UNETR++ in comparison to UNETR in Fig. 9.
We show in Fig. 8 qualitative comparison between our UNETR++ and the baseline on the ACDC dataset.In all three rows, UNETR suffers from under-segmentation and struggles in delineating the boundaries of the right ventricular (RV) cavity, while UNETR++ segments all three segments more precisely.In addition, we show in Fig. 10 another baseline comparison on the Decathalon-Lung dataset.In the first two rows, UNETR++ has less false positives, while in the third row, UNETR under-segments the whole tumor and UNETR++ segments it correctly.

C. Additional Ablations
To investigate the scalability of UNETR++, we designed an experiment with feature maps of size [64,128,256,512] instead of [32,64,128,256] on the BTCV dataset.Although the number of parameters with this change increased to 94.24M and the FLOPs increased to 117G, the average dice similarity coefficient (DSC) is improved from 83.28% to 84.27%, which proves the scalability of UNETR++ without using any ensemble, pre-training or additional custom data.
To validate the effectiveness of our EPA block, we conduct experiments on Synapse to compare our EPA module with other attention methods.(i) We integrate the gated at-  tention (GA) from the attention-gated U-Net method within nnUNET (referred to col. 3 in Tab. 6).(ii) We replace our EPA module in UNETR++, over the proposed hierarchical approach, with GA (col. 4 in Tab.6 and with squeeze-andexcitation (SE) (col.5 in Tab. 6).Our UNETR++ achieves superior results compared to other attention methods.

D. Discussion
In this paper, we present a hierarchical approach, named UNETR++, that achieves promising segmentation results on five datasets (Synapse, ACDC, BTCV, BRaTs, and Decathlon-Lung) while significantly reducing the model complexity and the memory consumption, and improving the inference speed compared to existing methods.The proposed efficient paired attention (EPA) block encodes enriched inter-dependent spatial and channel features by using spatial and channel attention.To observe potential limitations of UNETR++, we analyze different outlier cases of Synapse.Although our predictions are better than the existing methods and more similar to the ground truth, we find that there are a few cases where our model, as well as the existing methods, struggle to segment certain organs.When the geometric shape of the organs in a few slices is abnormal (delineated by thin borders), our model and the existing models struggle to segment them accurately.The reason might be the limited availability of training samples with such abnormal shapes compared to the normal samples.We are planning to solve this problem by applying geometric data augmentation techniques at the pre-processing stage.

83. 28 Table 3 .
State-of-the-art comparison on the BTCV test set for multi-organ segmentation.All results are obtained using a single model accuracy and without any ensemble, pre-training or additional custom data.Our UNETR++ achieves favorable segmentation performance against existing 3D image segmentation methods.Abbreviations are as follows: Spl: spleen, RKid: right kidney, LKid: left kidney, Gal: gallbladder, Eso: esophagus, Liv: liver, Sto: stomach, Aor: aorta, IVC: the inferior vena cava, PSV: portal and splenic veins, Pan: pancreas, RAG: right adrenal gland, LAG: left adrenal gland.Results are obtained from BTCV leaderboard.

Figure 3 .
Figure 3. Qualitative comparison between UNETR++ and baseline UNETR on Synapse.For better visualization, we enlarged different areas (marked in green dashed box) in the images.The inaccurate segmentations are marked by red dashed boxes.Compared to the baseline, UNETR++ achieves superior segmentation performance.Best viewed in zoom.

Figure 4 .
Figure 4. Qualitative comparison on multi-organ segmentation task.Here, we compare our UNETR++ with existing methods: UNETR, Swin UNETR, and nnFormer.Existing methods struggle to correctly segment different organs (marked in red dashed box).Our UNETR++ achieves promising segmentation performance by accurately segmenting the organs.Best viewed in zoom.

Figure 5 .
Figure 5.Additional qualitative comparison on Synapse dataset.We compare our UNETR++ with existing methods: UNETR, Swin UNETR, and nnFormer.It is noticeable that the existing methods struggle to correctly segment different organs (marked in red dashed box).Our UNETR++ achieves promising segmentation performance by accurately segmenting the organs.Best viewed zoomed in.

Figure 6 .
Figure 6.Qualitative comparison on the ACDC dataset.We compare our UNETR++ with existing methods: UNETR and nnFormer.It is noticeable that the existing methods struggle to correctly segment different organs (marked in red dashed box).Our UNETR++ achieves favorable segmentation performance by accurately segmenting the organs.Best viewed zoomed in.

Figure 7 .
Figure 7.Additional qualitative comparison between UNETR++ and the baseline UNETR.The baseline struggle to correctly segment different organs (marked in red dashed box).We enlarge multiple organs (marked with green dashed boxes in the first row) from several cases.Our UNETR++ achieves promising segmentation performance by accurately segmenting the organs.Best viewed zoomed in.

Figure 8 .
Figure 8.Additional qualitative comparison between UNETR++ and the baseline UNETR.The baseline struggle to correctly segment different heart regions (marked in red dashed box).We enlarge the regions from several cases.Our UNETR++ achieves promising segmentation performance by accurately segmenting all regions.Best viewed zoomed in.

Figure 9 .
Figure 9. Qualitative comparison between the baseline UN-ETR[13] and our UNETR++ on Synapse dataset.Each inaccurate segmented region is marked with a white dashed box.UNETR++ better segments the organs as compared to the baseline.

Figure 10 .
Figure 10.Qualitative comparison between the baseline UN-ETR[13] and our UNETR++ on Decathlon-Lung dataset.The enlarged area is marked with a green box.UNETR++ has better segmentation and less false positives for segmenting the tumors as compared to the baseline.Best viewed zoomed in.Model nnUNet Attention EPA replaced EPA replaced UNETR++ nnUNet w/ GA w/ SE DSC 84.2 85.0 85.3 85.5 87.2

:
[5]]Synapse[19]dataset consists of abdominal CT scans of 30 subjects with 8 organs.Consistent with previous approaches, we follow the splits used in[5]and train our model on 18 samples and evaluate on the remaining 12 cases.We report the model performance using Dice [1]phagus, inferior vena cava, portal and splenic veins, right and left adrenal gland.We report the DSC on all 13 abdominal organs.The ACDC[1]dataset comprises cardiac MRI images of 100 patients, with segmentation annotations

Table 1 .
State-of-the-art comparison on the abdominal multi-organ Synapse dataset.We report both the segmentation performance (DSC, HD95) and model complexity (parameters and FLOPs).Our proposed UNETR++ achieves favorable segmentation performance against existing methods, while being considerably reducing the model complexity.Best results are in bold.Abbreviations stand for: Spl: spleen, RKid: right kidney, LKid: left kidney, Gal: gallbladder, Liv: liver, Sto: stomach, Aor: aorta, Pan: pancreas.Best results are in bold.
.90 93.62 70.75 77.18 95.95 85.15 89.28 83.14 76.91 77.42 72.56 68.17 on Synapse.In addition to the Dice Similarity Coefficient (DSC), we report the model complexity in terms of parameters and FLOPs.In all cases, we report performance in terms of single model accuracy.As discussed earlier, UNETR++ is a hierarchical

Table 4 .
shows the comparison on ACDC.Here, all results are reported with a single model accuracy and without using any pre-training, model ensemble or ad-State-of-the-art comparison on ACDC.We report the performance on right ventricle (RV), left ventricle (LV) and myocardium (MYO) along with mean results using DSC metric.

Table 5 .
Comparison on BRaTs.UNETR++ achieves favorable segmentation results (DSC), while being efficient (Params in millions and GFLOPs), operating at faster inference speed (GPU T. and CPU T. in ms) and requires lesser GPU memory (Mem in GB).

Table 6 .
Comparison with other attention methods on Synapse.