SGU-Net: Shape-Guided Ultralight Network for Abdominal Image Segmentation

Convolutional neural networks (CNNs) have achieved significant success in medical image segmentation. However, they also suffer from the requirement of a large number of parameters, leading to a difficulty of deploying CNNs to low-source hardwares, e.g., embedded systems and mobile devices. Although some compacted or small memory-hungry models have been reported, most of them may cause degradation in segmentation accuracy. To address this issue, we propose a shape-guided ultralight network (SGU-Net) with extremely low computational costs. The proposed SGU-Net includes two main contributions: it first presents an ultralight convolution that is able to implement double separable convolutions simultaneously, i.e., asymmetric convolution and depthwise separable convolution. The proposed ultralight convolution not only effectively reduces the number of parameters but also enhances the robustness of SGU-Net. Secondly, our SGU-Net employs an additional adversarial shape-constraint to let the network learn shape representation of targets, which can significantly improve the segmentation accuracy for abdomen medical images using self-supervision. The SGU-Net is extensively tested on four public benchmark datasets, LiTS, CHAOS, NIH-TCIA and 3Dircbdb. Experimental results show that SGU-Net achieves higher segmentation accuracy using lower memory costs, and outperforms state-of-the-art networks. Moreover, we apply our ultralight convolution into a 3D volume segmentation network, which obtains a comparable performance with fewer parameters and memory usage.


I. INTRODUCTION
M EDICAL image segmentation aims to make anatomical or pathological structures changes clearer in images and it often plays a key role in computer-aided diagnosis and smart medicine due to the great improvement in diagnostic efficiency and accuracy. To help clinicians make accurate diagnosis, it is necessary to segment some crucial organs and targets in abdomen medical images and extract features from segmented targets [1]. In particular, it is more difficult to extract discriminating features from medical images than normal RGB images since the former usually suffers from problems of blur, noise, low contrast, etc. In recent years, deep learning, especially the U-shaped encoder-decoder network [2], has been widely used in medical image segmentation due to its excellent performance. As the encoder of U-shaped network [2] used for feature learning are insensitive to image noise, blur, low contrast, etc., many improved U-shaped networks such as U-Net++ [3], mU-Net [4], Attention U-Net [5], TransUNet [6], Swin-Unet [7], etc. can provide excellent segmentation results for medical images. Although these networks gain high segmentation accuracy, they are complex due to a large number of parameters and high memory usage thus leading to the difficulty of deployment on mobile devices. How to balance the complexity of networks and segmentation accuracy is a challenge. Fig. 1 shows the Dice value and the number of parameters of different networks on the CHAOS-CT [8] dataset. We can see that some medical image segmentation models have a huge number of parameters, e. g. R2U-Net [9] of 39.09 M, Attention U-Net [5] of 34.88 M and V-Net [10] of 65.17 M. It is clear that most of these high-accuracy networks are unsuitable to be deployed on mobile devices. Although some lightweight networks [3] have been reported, they may suffer from serious performance degradation when under the low computing resources. To address the above issues, we present a shape-guided ultralight network (SGU-Net) with extremely low computational costs for medical image segmentation.
To improve computational efficiency and reduce the number of parameters, we propose an ultralight convolution (UC) that is a plug-and-play operation and can be used in arbitrary networks. Compared with the popular asymmetric convolution [11] and depthwise separable convolution [12], ultralight convolution has obvious advantages in reducing the number of parameters and improving feature representation ability. Specifically, for the input feature maps, the ultralight convolution first performs depthwise asymmetric convolution that consist of the cascade of 1 × k and k × 1 convolution. Then, the pointwise convolution is utilized to obtain the output feature maps.
To improve segmentation accurancy, we present a shape adversarial autoencoder (SAAE) that is an additional selfsupervision strategy to raise the performance of our segmentation network by alternating training SAAE and the segmentation network. The proposed SAAE has a completely different and novel working mode from popular autoencoder-based shape constraint methods [13], [14]. Specifically, we try to use an autoencoder to explore the ability of CNNs on shape representation of predicted targets in low-dimensional manifold. It is worth mentioning that the proposed SAAE and segmentation network are trained cooperatively, which not only forces the proposed segmentation network to output targets with more real shape information but also is a costless supervision operation for the segmentation network.
The experimental results show that SGU-Net not only obtains higher segmentation performance but also provides better target shape prediction. Besides, the SGU-Net requires fewer parameters and lower computational costs with 4.99 M and 4.98 GFLOPs, respectively.

II. RELATED WORK
Medical Segmentation Networks: Currently, most of medical image segmentation networks are based on U-shape architecture. These networks can be roughly grouped two categories that are often used for 2D images and 3D volumetric data, respectively.
For 2D medical image segmentation, residual and dense connections are popular for improving network performance, such as ResUNet [15], mU-Net [4], and DenseUNet [16]. The improved networks replace each submodule of U-Net in the form of residual and dense connections, respectively. This improvement can accelerate the model convergence and improve the feature reuse such as U-Net++ [3] and R2UNet [9]. It has been demonstrated that the attention mechanism is very useful for improving the feature representation ability of networks. Inspired by this, Attention U-Net [5] with spatial attention, Re-sUNet++ [17] with channel attention, and Non-local U-Net [18] with self-attention mechanism are proposed and used for different segmentation tasks to overcome the drawback of feature utilization in U-Net. Compared to attention mechanism, multiscale feature fusion, for example, the atrous spatial pyramid pooling module (ASPP) [19], is also a useful way for improving network performance. By integrating ASPP into U-shape networks, both CE-Net [20] and DefED-Net [21] achieve better target segmentation in medical images.
For medical volumetric data, 2D CNNs are often limited since they ignore the temporal information of volumetric data. To overcome this drawback, 3D CNN-based models such as 3DU-Net [22] and V-net [10] have been proposed. Although these 3D networks can simultaneously explore the temporal information of inter-slice and spatial information of inner-slice, they suffer from some new problems such as more parameters, much memory usage, and much narrow reception fields than 2D networks [23].
As human organs usually have a fixed shape and position, the incorporation of the prior-knowledge about target shape and position is crucial for improving medical image segmentation effect. Mosinska et al. [24] used a pre-trained model to constrain the shape of segmentation targets. Li et al. [25] proposed a shape perception strategy based on generative adversarial networks (GANs). Lei et al. [26] proposed a network based on adversarial consistency learning and dynamic convolution. Al Arif et al. [27] used symbolic distance functions (SDFs) generated by modified U-Net instead of partition maps to obtain better topology prediction results. Furthermore, some researchers [28], [29] used autoencoder to constrain the shape of segmented targets. However, the interpretability of priori information utilization is insufficient in above methods.
Lightweight Segmentation Networks: Small medical image segmentation models require a good trade-off between segmentation accuracy and model szies for clinical mobile devices. The methods used for lightweight network design can be roughly categorized into two groups, model compression and model compacting. For a given model, the purpose of model compression is to reduce the computational costs as well as the number of parameters. Common model compression methods can be divided into three categories. The first is model pruning [30], [31], Fig. 2. The overall architecture of SGU-Net. The SGU-Net consists of two parts: the segmentation network on the left and the shape adversarial autoencoder (SAAE) on the right. SAAE encodes both segmentation results and labels into low-dimensional manifold, aiming to constrain their shape representation in low-dimensional manifold. The performance of the segmentation network is improved by adversarial training between SAAE and the segmentation network. It is worth mentioning that SAAE is an additional network that is only used in the training stage, once the training is over, the SAAE will be removed from the testing stage. [32], and it aims to cut off unnecessary connections between different neurons for further speedup in practice. In medical image segmentation tasks, for example, U-Net++ [3] uses model pruning to reduce the number of parameters. Secondly, model quantization [33], [34] focuses on the reduction of the number of bits required on each weight to compress the original network, for example, binarization methods with only 1-bit value can greatly accelerate the model inference by efficient binary manipulation. In addition, knowledge distillation [35], [36], [37], [38] uses larger models to teach smaller models, which improves the performance of smaller models. The performance of these methods usually depends on the given pre-trained models. As for the model compaction, a lot of work has been reported in recent years. MobileNets [12], [39] proposes the depthwise separable convolution that decomposes a vanilla convolution into a depthwise convolution and a pointwise convolution. Lei et al. [40] and Zhang et al. [41] extended the depthwise separable convolution to 3D networks and applied it to medical image segmentation, effectively reducing the number of model parameters and computational costs. In addition, GhostNet [42] proposes a Ghost module to generate more feature maps from cheap operations. Lo et al. [11] and Szegedy et al. [43] used the strategy of decomposing the standard 3 × 3 convolution into 3 × 1 and 1 × 3 convolutions to reduce the number of parameters and computational costs at the expense of slight performance degradation. ShuffleNet [44] divides the convolutions into multiple groups in a similar way to [12], which leads to a significant reduction in FLOPs with a rather small decrease on accuracy. By combining asymmetric convolution and dilation convolution, researchers [45], [46] further designed a depthwise asymmetric dilation convolution to reduce the number of parameters of models.
Compared to previous work such as asymmetric convolution [12] and depthwise separable convolution [47], our ultralight convolution not only achieves a higher model compression ratio but also provides better feature representation ability. Compared with the current autoencoder-based shape constraint methods [13], [14], our SAAE is considered as an additional self-supervision to explore a more accurate representation of shapes in low-dimensional manifold. SAAE can provide costfree accuracy gains since the segmentation network employed by our SGU-Net can work independently from SAAE during the testing stage.

III. METHOD
The overall architecture of SGU-Net is shown in Fig. 2 and that consists of two parts: the segmentation network and the shape adversarial autoencoder (SAAE). Compared to U-Net, on the one hand, the segmentation network uses the ultralight convolutional groups instead of the vanilla convolutional groups in the encoding stage. On the other hand, since the deconvolution may cause grid effect [48], which is unfavorable to pixel-level segmentation, the deconvolution in the vanilla U-Net is replaced by a combination of upsampling and ultralight convolutional groups. The shape-guide module adds additional shape constraints to the segmentation network, which encourages the predictions of segmentation network to be consistent with the shape of the organ by encoding shape information into lowdimensional manifold.

A. Ultralight Convolution
Overview: The proposed ultralight convolution tries to integrate the advantages of both asymmetric convolution and depthwise separable convolution. We factorize vanilla convolution into depthwise asymmetric convolution and pointwise convolution. For SGU-Net, depthwise asymmetric convolution applies asymmetric convolution to each input channel, and then pointwise convolution is used for channel information merging. The vanilla convolution simultaneously performs filtering on the channel and spatial dimension and merges inputs to form a new output, while the proposed ultralight convolution divides itself into three layers, namely the horizontal and vertical convolution layers for filtering and a separate layer for merging. This decomposition has a significant effect in reducing computational costs and model size.   3 shows the comparison between ultralight convolution and popular convolution strategies including vanilla convolution, asymmetric convolution and depthwise separable convolution. According to Fig. 3, we can see that both asymmetric convolution and depthwise separable convolution can reduce the number of parameters and computational costs compared to vanilla convolution. However, depthwise separable convolution is superior to asymmetric convolution since it achieves the decouple operation between spatial convolution and channel convolution operation leading to more lightweight networks. Compared to asymmetric convolution and depthwise separable convolution, our proposed ultralight convolution has the following advantages: a) Compared with asymmetric convolution as shown in Fig. 3(a), our ultralight convolution includes two stages namely depthwise asymmetric convolution and pointwise convolution, it can effectively decouple the spatial and channel dimensions of convolution operation, leading to more efficient model compression. b) Compared with depthwise separable convolution as shown in Fig. 3(b), our ultralight convolution implements depthwise asymmetric convolution while the former implements standard depthwise convolution, which achieves the spatial decomposition of convolution kernels and is especially helpful for improving feature extraction of irregular organs in abdominal image segmentation. Complexity Analysis: The vanilla convolution takes the fea- where H x and W x are the spatial height and width of X, M is the number of channels of X, H y and W y are the spatial height and width of Y , and N is the number of channels of Y . A vanilla convolution layer usually employs a convolutional kernels of size where D k is the size of the convolution kernel, M is the number of input channels, and N is the number of output channels. Consequently, the computational cost of the vanilla convolution is In fact, the vanilla convolution directly output feature maps by implementing a complex convolution operation. However, the proposed ultralight convolution factorizes a vanilla convolution into two processes. First, the input feature maps are filtered channel-by-channel using asymmetric convolution. Secondly, a pointwise convolution (1 × 1 convolution) is used to create a linear combination output. It is worth mentioning that both normalization and activation are also implemented in these two steps.
It is clear that the computational cost of the k × k depthwise convolution is D k × D k × M × H y × W y . In our ultralight convolution, we factorize the k × k depthwise convolution kernel into a k × 1 kernel and a 1 × k kernel. As a result, the computational cost of the k × k depthwise asymmetric convolution denoted by C da is we can see that the further factorization of depthwise convolution can reduce the computational complexity.
As the computational cost of a pointwise convolution (1 × 1 convolution) is M × N × H y × W y , the computational cost of the ultralight convolution denoted by C uc is Although the basic ultralight convolution requires a small number of parameters and low computational costs, it may suffers from difficulties in some specific scenarios due to the requirement of smaller and faster models to mobile devices. To assure the flexibility of the proposed ultralight convolution, we introduce a hyperparameter called thinning exponent δ. The thinning exponent δ is important to compress further the model from the channel dimension and thus controls the overall size and computational efficiency of the model. When there are specific scenarios requiring smaller models and faster inference speed, we can make the models better by adjusting the thinning exponent δ. Therefore, δ is an important hyperparameter to balance model size and segmentation accuracy. In this case, the number of input channels M changes to δM and the number of output channels N changes to δN . The computational cost of an ultralight convolution with a thinning exponent δ denoted bŷ C uc iŝ where δ ∈ (0, 1], δ = 1 is the original ultralight network and δ < 1 is the skinny ultralight network. The thinning exponent is a hyperparameter that can be adjusted for any model according to the desired number of parameters and segmentation accuracy. Asymmetric convolutional validity analysis: In fact, the asymmetric convolution is often used for existing square kernel convolution layers for compression and acceleration. However, it may cause the performance degradation by factorizing the k × k convolution directly into k × 1 and 1 × k convolutions. One main reason is the weak extraction capability of asymmetric convolution for channel features in the case of multiple channels, as the factorization destroys the feature space extracted by the square convolution kernel leading to the loss of channel information. Unlike previous work [11], [47], our proposed ultralight convolution applies asymmetric convolution for the channel-by-channel of X, thus avoiding this drawback. Also, performing asymmetric convolution operation on each channel can better enhance the robustness of the model to prevent rotational distortions. Especially in medical images, as the shape of organ is usually irregular, the asymmetric convolution can better accommodate irregular shapes and extract more effective features for abdominal organ segmentation than a vanilla convolution.

B. Shape-Guided Strategy
Overview: For medical image segmentation, the predicted target contours are very important since these results are often used for 3D organ reconstruction. However, it is difficult to segment targets accurately due to the limitation of imaging quality. Therefore, the strategy of adding higher-level shape constraints to a segmentation network is a solution that can make prediction results more consistent with prior anatomical knowledge. Nevertheless, a ground truth usually involves structural and high-dimensional information, and measuring the shape similarity in a high-dimensional space is extremely difficult. To solve the above problems, we present SAAE helping the segmentation framework to explore the shape representation and constraints, and to guide prediction results of the segmentation network to be close to the ground truth. Specifically, SAAE is a trainable neural network to capture the salient features of the input shape and encodes them into low-dimensional manifold. If SAAE can reconstruct the input shape well, the encoding in low-dimensional manifold can be well approximated as a representation of the shape features.
Training process: The overall training stage can be seen in Fig. 2. The motivation of SAAE consists of two parts. The first is to insert an additional shape guidance strategy into the segmentation model to improve the segmentation accuracy without increasing network parameters. The second is to make full use of the rich prior knowledge of abdominal images to improve the interpretability of model learning. Our SAAE explores the representation of shape in a microscopic way, and minimizes the difference between prediction results and labels through the gradient backpropagation algorithm. This is because the shape representation of abdominal organ is high-dimensional information, it is difficult to directly measure the shape difference between the prediction results and the labels. To solve this problem, our SAAE uses an autoencoder to explore the shape representation abdominal organ in low-dimensional manifold. At the same time, since the organ labels only contain shape and position information, our SAAE can further encode the shape and position information of organs after reconstructing a large number of labels and prediction results. Moreover, the constantly trained SAAE can capture the subtle difference between the segmentation result and the real organ shapes, which can be used to monitor the segmentation network to output better results.
SAAE is trained by reconstructing the prediction results and labels from the segmentation network. It contains two loss functions. The first loss function L rl learns shape representation by minimizing the difference between the reconstructed shape and the input shape, and the second loss function L sal tries to distinguish the difference between the predicted shape and the real shapes by maximizing the representation of the shape in low-dimensional manifold. The two loss functions help SAAE better to encode the predicted shape and the real shape of organ, capture the subtle difference between them, and force the segmentation network to segment the results close enough to the real organ shape. The loss function of the segmentation network is expressed as L seg . Therefore, the optimization objective of the overall segmentation network G and shape adversarial autoencoder D is In fact, G and D are done by alternating training, and they are like playing a game against each other. First, G is optimized by fixing D and minimizing subsequent losses G is encouraged to segment the image closer to the real label by optimizing G. Then D is optimized by fixing G For our model training, G and D are implemented alternate training. The segmentation network and SAAE is like playing a minimum-maximum game. In low-dimensional manifold, the segmentation network tries to get a result consistent with the real shape to minimize the distance between output results and labels, while SAAE tries to learn better encoding and feature extraction methods to maximize the distance between output results and labels. In other words, our SAAE needs to maximize its ability to find the difference between labels and segmentation results, while the segmentation network tries to cheat SAAE by minimizing the difference. During the whole training process, the organ contours predicted by the segmentation network G will continuously approach the real organ contours. When our SAAE cannot distinguish the difference between these contours, D no longer provides effective supervision. At this time, the segmentation network G can output better segmentation results independently. Our SAAE thus has two advantages. First, it can represent shapes in different ways and uses the gradient backpropagation algorithm to optimize the shape-guided segmentation network. Second, our SAAE can distinguish the subtle difference among different shapes, so that the shape predicted by the segmentation network can be closer to the real organ shape.
SAAE validity analysis: Previous work [24], [49] on shape constraints typically use a pre-trained model to guide shape constraints or a classification discriminator to distinguish between true and false labels. However, the parameters of the pre-trained model are fixed and the model cannot serve to discriminate when the predicted contours are close to the real shapes. Besides, the classification discriminator cannot potentially encode the contours and does not guide the segmentation network well enough to constrain it by true or false signals. The SAAE is able to represent contour information in a microscopic way and can regularize the estimated target contours by minimizing the difference between the predicted result and the ground truth. In addition, the trained SAAE is able to distinguish subtle difference between contours, and it still gives correct penalties even if the result predicted by the segmentation network is closer to the ground truth shape.

C. Loss Function
In our task, there are three loss functions that are segmentation loss L seg , reconstruction loss L rl and shape adversarial loss L sal . First, the standard cross entropy loss L cross and the boundary loss L bd [50] are used in L seg . However, since the boundary loss is unstable and easily leads to training difficulties, the final loss function of segmentation network is defined as We define the reconstruction loss of SAAE as where x is the input image, y is its corresponding ground truth, G is the segmentation network, G(x) is the segmentation result corresponding to x, D(y) and D(G(x)) are the reconstruction results of SAAE corresponding to the ground truth y and the predicted result G(x). For shape adversarial loss, since it is a shape representation in low-dimensional manifold, we define L sal as where E(·) is the encoding of the predicted shape of the segmentation network with the shape of ground truth.

A. Datasets and Pre-Processing
In our experiments, the Combined (CT-MR) healthy abdominal organ segmentation (CHAOS) [8] and the Liver Tumor Segmentation Challenge (LiTS) [51] are considered as experimental datasets. The CHAOS from the CHAOS challenge is collected by the Department of Radiology, Dokuz Eylul University Hospital, Izmir, Turkey. It contains a total of 80 cases, in which 40 cases are abdominal CT scans containing ground truth of liver segmentation, and the other 40 cases are T1-DUAL in phase (T1-DUALin). Three radiologists (10, 12 and 28 years of experience) are involved in the manual segmentation. The final masks are obtained by using majority vote, which ensures the accuracy of the ground truth. We divided the CT and MR images into training set, validation set and testing set in a ratio of 6:2:2, respectively. The MR images are 256 × 256 or 288 × 288 in size with axial slice numbers ranging from 26 to 50 and layer thicknesses between 4.4 and 8.0 mm, and the CT images are 512 × 512 in size with axial slice number ranging from 78 to 294 and layer thickness between 2.0 mm and 3.2 mm. Training data are subjected to random scaling, rotation, cropping and shifting operations. In our experiments, the given models are dedicated to a single modality (T2-SPIR, CT) and a single organ (liver, right kidney, left kidney, spleen). Thus, each model performs binary rather than multiclass segmentation to extract robust organ-specific features.
The LiTS includes 131 labeled 3D CT scans, where the resolution in-plane ranges from 0.55 mm to 1.0 mm and slice spacing ranges from 0.45 mm to 6.0 mm. We constructed the training set and validation set using 90 patients (total 43,219 images) and 10 patients (total 1,500 images), respectively. Then the other 30 patients (total 15,419 images) are considered as the testing set. It is worth mentioning that the LiTS dataset does not use data augmentation techniques.
Medical CT images are different from natural images, the former is able to obtain wider range of values from -1000 to 3000 than the latter from 0 to 255. To remove interferences and enhance liver areas, we truncated the image intensity values of all scans of [−200, 250] HU.

B. Experimental Setup and Evaluation Metrics
All models are trained by the framework of Pytorch 1.3.0 and implemented on a desktop PC with double NVIDIA GeForce RTX 2080 Ti with 11 GB RAM. The initial learning rate (lr) is set to 0.001, and then decays according to the poly schedule lr = lr × (1 − iterations/totaliterations) 0.9 . We used the Adam gradient descent with momentum to optimize all models.
The hyperparameters in SGU-Net are set as follows: the thinning exponent δ is set to 0.25 since we aim to obtain an ultralight network as soon as possible, where 0 < δ ≤ 1. If the value of δ is too large, then the model size will be also large. We presented more details on the set of the value of δ in the section of our discussion. θ 1 in (5) is set to 5, θ 2 in (6) is set to 0.01 and α in (7) is set to 0.5.
We denoted the segmentation result by S and the ground truth by G. Dice value is estimated by 2(S ∩ G)/(|S| + |G|), where the Dice value of in the interval [0,1]. A perfect segmentation yields the Dice value is 1. In addition, average/maximum symmetric surface distances (ASSD/MSSD) [51] corresponds to the average/maximum Hausdorff distance between border voxels in S and G. Dice generates an overlap measure while ASSD and MSSD are surface distance measures. The former focuses more on the interior of segmentation targets, while the latter focuses more on the shape similarity of segmentation targets. It is worth noticing that we calculated the above metrics in the binary slice segmentation results.

C. Ablation Studies
In this paper, two contributions are highlighted, one is the replacement of vanilla convolution by ultralight convolution, and the other is the design of a shape adversarial autoencoder to impose additional shape constraints on the segmentation network. To demonstrate the effectiveness of our contributions, we performed ablation experiments on the CHAOS liver dataset and the NIH-TCIA pancreas dataset.
The results in Table I demonstrate the validity of our contributions. For simplicity, AC represents asymmetric convolution [43], and DC denotes depthwise separable convolution [12]. From the experimental results, compared with the vanilla U-Net, U-Net+AC and U-Net+DC, the U-Net+UC can significantly reduce the number of parameters while improving the Dice value. The vanilla U-Net achieves mean Dice value of 94.04%, ASSD value of 1.70 mm, and MSSD value of 29.52 mm. After replacing the vanilla convolution with ultralight convolution, the parameters of U-Net decrease by 29.54 M, the Dice value increases by 0.08%, the value of ASSD and MSSD decrease by 0.03 mm and 1.46 mm, respectively. This shows that our ultralight convolution not only effectively reduces the number of parameters, but also improves the segmentation accuracy due to the fact that the asymmetric convolution can better adapt to irregular organ shapes in abdominal images.
To show the validity of SAAE, we compared U-Net with U-Net+SAAE, and compared U-Net+UC with U-Net+UC+SAAE. We can see that the Dice value increases by 0.52% and 0.56%, ASSD value decreases by 0.13 mm and 0.24 mm, and MSSD value decreases by 2.5 mm and 2.34 mm, respectively. By using SAAE with U-Net, it not only provides better shape guidance, but also significantly reduces the ASSD and MSSD values.
Meanwhile, we also compared the results of SAAE with the standard generative adversal network (GAN) and autoencoder (AE) actions in our ablation experiments of Table I, respectively. Compared to the GAN constraint, the results provided by SAAE are 0.04 mm and 1.1 mm lower in ASSD and MSSD, respectively. This shows that using only the binary signal of the discriminator in the GAN does not provide a good constraint. Similarly, compared to the use of AE, the ASSD and MSSD of the results provided by SAAE are reduced by 0.06 mm and 1.18 mm, respectively. This demonstrates that the encoding of  labels and segmentation results in SAAE is important, and a well-trained autoencoder can well encode the differences in labels and segmentation results into the low-dimensional space and use the difference L sal in Fig. 2 between them to supervise the segmentation network. Fig. 4 shows the comparison of segmentation boundaries, which further illustrates the ablation studies. As can be seen from the results, the utilization of SAAE supervision can effectively provide shape supervision and constraint without extra parameters and computational costs. And the segmentation results are further improved with the help of ultralight convolution.
To further verify the performance of SAAE, we performed the lightweight U-Net network with four stages on the CHAOS dataset. The experimental results are shown in Table II. We find that our SAAE can improve segmentation accuracy by 1.24% (Dice value). Therefore, our SAAE can obtain a clear improvement in segmentation accuracy for low Dice segmentation tasks. It is a general module that can be combined with different backbone networks for different segmentation tasks.
In addition, we conducted an additional experiment on a pancreatic dataset with a low Dice value to demonstrate further our contributions. The NIH-TCIA CT dataset [52] comes from the US National Institute of Health (NIH), which contains 82 abdominal enhanced 3D CT scans. In the direction of the axial viewpoint, the CT slice size is 512 × 512 pixels, and the number of slices varies from 181 to 466 for different patients. The experimental results are shown in Table III. This experiment has two purposes. On the one hand, we aim to show that our ultralight convolution module is effective and universal. On the other hand, we want to demonstrate that our SAAE module is effective for abdominal organ segmentation whether organs are large or small. It can be seen in Table III that ultralight convolution has a good universality, which can effectively extract target features and plays an outstanding role. It greatly reduces the complexity of the model and improves the segmentation accuracy. For relatively fixed abdominal organs, SAAE can achieve shape guidance and make full use of prior knowledge to guide the segmentation network to obtain better abdominal organ segmentation results.
In addition to conducting ablation experiments on our contributions UC and SAAE, we also conducted additional ablation experiments on three hyperparameters θ 1 , θ 2 , and α. In our experiments, θ 1 and θ 2 are used to control the shape adversarial loss function. Specifically, θ 1 guides the segmentation network to learn by minimizing the difference between the reconstruction results and the labels, and θ 2 optimizes SAAE by maximizing the difference between the segmentation results and the labels. α is used to control the boundary loss function.
According to Table IV, we found that our SGU-Net can achieve the best performance on the abdominal organ segmentation when θ 1 = 5, θ 2 = 0.01 and α = 0.5. It is noted that δ = 0.25 in this experiment. The analysis of the set of δ can be seen in the section of the discussion.
CT liver segmentation: Quantitative metrics on the CHAOS dataset are shown in Table V. We can see that one of the main reasons is that the 3D convolutional architecture is too complex to be used for small datasets. Also, the lower contrast and resolution of the CT images in this dataset, combined with the higher spacing make it difficult to extract the temporal domain information better and may even have the opposite effect. In terms of the segmentation performance of the 2D network, CE-Net tends to be a better solution due to the adoption of pre-trained encoder, and the Dice value improves by 0.52% compared to U-Net. The strategies of residual connectivity, dense connectivity, and recurrent connectivity are also useful, and thus ResU-Net, DenseU-Net, and R2U-Net provide higer segmentation accuracy than U-Net. The setting of attention mechanism brings effective gains, both Attention U-Net as well as ResU-Net++ use spatial attention and channel attention, and the Dice metric is thus improved by 0.19% and 0.06%, respectively, compared to U-Net. Non-local module can obtain global attention by using a larger receptive field, so Non-local U-net shows better performance. We note that SGU-Net obtains the best values of Dice (94.68%), ASSD (1.43 mm), and MSSD (26.26 mm). Similarly, Table VI shows quantitative results for liver segmentation on the LiTS dataset.The LiTS is a larger 3D dataset where a fair comparison with the methods of 3D CNNs baseline can be made. We can see that SGU-Net provides 1.8% and 1.65% improvement in Dice compared to 3D U-Net and V-Net, respectively. Also, SGU-Net provides finer segmentation contours, with the lowest ASSD (1.30 mm) and MSSD (24.45 mm) compared to other networks.
The above experimental results indicate that SGU-Net can make the prediction map achieve better contour and shape consistency, and its ability to mimic expert annotations performs significantly better.
Abdominal multi-organ MR segmentation: Table VII shows quantitative results for multi-organ MR T2 modity segmentation. As for 3D networks, they do not provide the required robustness for organ segmentation. The detailed reason has been presented in the second paragraph of Section IV. D. experimental comparison. For 2D networks, significant improvements can be noticed using attention for right kidney, left kidney and spleen. Attention U-Net (spatial attention), ResU-Net++ (channel attention), and Non-local U-Net (Non-local attention) clearly provide similar effect on liver, left kidney, and right kidney segmentation. On the basis of 89.32% Dice of U-Net, both ResU-Net++ and Non-local U-Net show greater improvement (+2.97%, +1.95%) on spleen segmentation than Attention U-Net (+0.42%). DenseU-Net improves the value of Dice by 0.06%, 0.8%, 0.31% and 1.9% compared to ResU-Net on liver, left kidney, right kidney and spleen, respectively.
It is easy to see that the contribution of CE-Net is clear. Using pre-trained models can provide better underlying features and multi-scale feature extraction blocks can extract richer features, thus CE-Net provides better Dice, ASSD and MSSD values than U-Net. Our proposed SGU-Net obtains the best segmentation results, not only in the Dice metric, but also in the ASSD and MSSD metrics for the four types of organs, which indicates that our strategy is able to constrain the organ contour leading to higher segmentation accuracy.

A. Compacting Model Design
The asymmetric convolution usually perform a direct factorization of the k × k vanilla convolution into k × 1 and 1 × k convolutions. This direct factorization may cause the consequence of significant information loss leading to performance degration of models. There are two possible reasons. First, deep neural networks usually have distributed eigenvalues in them, and they usually rank higher than 1 in practical applications, so the direct decomposition may cause information loss. Secondly, since standard convolution in feature extraction usually uses square convolutional kernels for spatial and channel features co-extraction, asymmetric convolutional kernels may corrupt feature extraction in channel dimension, resulting in chaotic channel feature extraction and thus information loss. Therefore, the asymmetric convolution in [47] is fused into the square convolution kernel to enrich feature extraction and representation, but the number of parameters remains unchanged.
Unlike the previous work, the proposed ultralight convolution performs feature extraction in space and channel separately by first performing asymmetric convolution channel-by-channel in the spatial dimension, and then performing feature extraction and combination in channels. This avoids to a certain extent the problem of confusing channel feature extraction when asymmetric convolution is directly operated on space and channel. In addition, the channel-by-channel asymmetric convolution allows better feature extraction for abdominal organs with irregular shapes information in space, which is helpful for improving abdominal image segmentation.

B. Extension on 3D Volume Segmentation
To demonstrate the proposed UC is useful for different networks. we discussed the extension of UC to 3D networks. In fact, there are some works [40], [41] that have extended the depthwise separable convolution to 3D CNNs and applied it to medical image segmentation to reduce the number of parameters. The ultralight convolution integrates the advantages of both asymmetric convolution and depthwise separable convolution and thus can be extended to 3D CNNs to further reduce the number of parameters. To evaluate the performance of ultralight convolution on liver segmentation tasks, we consider the 3D Image Reconstruction for Comparison of Algorithm and DataBase (3Dircadb) 1 as experimental data. The dataset is split into 17 patients for training and 5 patients for testing.
As for the experiments on 3D networks, we first pre-trained our network on the LiTS dataset, and then fine-tuned the network on the 3Dircadb dataset. There are two reasons for this. First, different datasets correspond to different collection environments and parameter configurations, and training a network on datasets of different scales can improve the robustness of the network. Second, the LiTS is a large dataset while the 3Dircadb is a small dataset. Using a large dataset for model training, and then using a small dataset for fine-tuning will not only speed up the training efficiency on the small dataset but also avoid the risk of overfitting.
The 3D network consists of two stages namely encoder and decoder. The volume data size used for 3D network input is fixed to 128 × 128 × 64 voxels. The encoder consists of five stages, each of which corresponds to a different image resolution. The 3D Ultralight V-Net network employs ultralight convolution to achieve feature extraction, which is composed of 3 × 3 × 1, 3 × 1 × 3 depthwise asymmetric convolutions and 1 × 1 × 1 pointwise convolution. Following feature extraction, a 2 × 2 × 2 transposed convolution is used for upsampling, and the output image size of the last stage of the encoder is 8 × 8 × 4. The decoder is an inverse process of the encoder. By using the deconvolution operation, the final output feature map size is the same as the input image size.   [40], [41], [53], [54]. This is a direction worth exploring.

C. Shape-Guided Exploration
For the design of the shape adversarial autoencoder (SAAE), the inspiration comes mainly from [25], [28]. There are four main types of methods used for shape constraint in medical image segmentation.
The first is to use a discriminator [25] to perform binary classification by determining whether the predicted image is a label or not, which mainly constrains the segmentation network by positive and negative signals. Although this approach is useful for improving network performance, the constraint is insufficient and lacks interpretability.
The second is to apply a pre-trained model [24] to feature extraction and compute the difference between the predicted results and the labels simultaneously and use the difference to constrain the segmentation network. This approach mainly depends on the suitability of the pre-trained model for extracting information from medical images.
The third is to use the autoencoder to learn the reconstruction of the prediction results with labels [28], which forces the autoencoder to learn the features in medical images and use the final reconstruction loss to constrain the segmentation network. This approach is more explanatory compared to the first two methods, but the constraints are limited.
The fourth is to use an autoencoder to improve the segmentation results obtained by the segmentation network. Painchaud et al. [13] used an autoencoder constrained by labels to improve segmentation results, so that the final result can be close to the real segmentation result. This postprocessing operation based on the autoencoder actually helps and corrects the original segmentation network, which cannot directly guide the segmentation network to improve segmentation results. The SAAE designed in this paper is based on the third approach, which has been further explored. As the shape of the organ contains high-dimensional information, it is difficult to measure the shape difference between the prediction result and the label directly. To solve the problem, SAAE uses the auto-encoder to explore the representation of the shape in low-dimensional manifold. Since the organs in the labels only contain shape and location information, the SAAE can better encode the shape and location information of the organs after reconstructing a large number of labels and prediction results. As a result, the segmentation network is constrained by using adversarial learning for the difference between prediction results and labels in low-dimensional manifold. Obviously, SAAE is more efficient and more explanatory.

D. UC and SAAE Adaptability
In this section, we performed ablation experiments of UC and SAAE based on different backbone networks to show the universality of our contributions. As shown in Table X, ultralight convolution can reduce the number of model parameters by up to 90% with no reduction in segmentation accuracy. SAAE can further improve the segmentation accuracy without increasing the model parameters. In other words, they greatly reduce the model parameters while providing competitive segmentation results. In conclusion, the two contributions presented in this paper can be applied to different backbone networks as shown in Table X to reduce the number of parameters, balancing the model size and segmentation accuracy.

E. The Analysis of the Thinning Exponent
The hyperparameter δ is the thinning exponent in the proposed SGU-Net. We evaluated the results obtained by SGU-Net using different values of δ. As shown in Table XI, as the value of δ increases, the number of model parameters becomes larger and the segmentation accuracy becomes higher than before. In this paper, SGU-Net with δ = 0.25 can attain the most competitive performance and achieves the best balance between model size and segmentation accuracy.
In Table XI, we can see that although the larger value of δ will lead to higher segmentation accuracy, the computational efficiency of the network will be lower. Therefore, in practical applications, we need to adjust the value of δ according to different task requirements. If the network is deployed on low resource devices, then we should choose a smaller value of δ to improve computational efficiency. In contrast, if we do not consider the problem of computational costs, we can choose a larger value of δ to achieve higher segmentation accuracy.

VI. CONCLUSION
In this paper, we have proposed a shape-guided ultralight network for medical image segmentation. First, an ultralight convolution is presented to factorize vanilla convolution into deepwise asymmetric convolution and pointwise convolution, which integrates the advantages of both asymmetric convolution and depthwise separable convolution. Secondly, a shape-guided module is presented to use the priori knowledge of fixed organ position and shape to constrain the segmentation network to produce results that are closer to the true organ shape. Extensive experiments on the LiTS and the CHAOS have shown that the proposed SGU-Net provides a general and effective solution to achieve high-quality segmentation results in the case of limited memory and computation resources.