Multi-Model Medical Image Segmentation Using Multi-Stage Generative Adversarial Networks

Image segmentation is a challenging problem in medical applications. Medical imaging has become an integral part of machine learning research, as it enables inspecting interior human body with no surgical intervention. Much research has been conducted to study brain segmentation. However, prior studies usually employ one-stage models to segment brain tissues, which could lead to a significant information loss. In this paper, we propose a multi-stage Generative Adversarial Network (<inline-formula> <tex-math notation="LaTeX">$GAN$ </tex-math></inline-formula>) model to resolve existing issues of one-stage models. To do this, we apply a <italic>coarse-to-fine</italic> method to improve brain segmentation using a multi-stage <inline-formula> <tex-math notation="LaTeX">$GAN$ </tex-math></inline-formula>. In the first stage, our model generates a <italic>coarse</italic> outline for both the background and brain tissues. Then, in the second stage, the model generates a <italic>refine</italic> outline for the white matter (<inline-formula> <tex-math notation="LaTeX">$WM$ </tex-math></inline-formula>), gray matter (<inline-formula> <tex-math notation="LaTeX">$GM$ </tex-math></inline-formula>), and cerebrospinal fluid (<inline-formula> <tex-math notation="LaTeX">$CSF$ </tex-math></inline-formula>). We perform a fusion of the <italic>coarse</italic> and <italic>refine</italic> outlines to achieve high results. Despite using very limited data, we obtain an improved Dice Coefficient (DC) accuracy of up to 5% compared to one-stage models. We conclude that our model is more efficient and accurate in practice for brain segmentation of both infants and adults. In addition, we observe that our multi-stage model is 2.69–13.93 minutes faster than prior models. Moreover, our multi-stage model achieves higher performance with only a few-shot learning, in which only limited labeled data is available. Therefore, for medical images, our solution is applicable to a wide range of image segmentation applications for which convolution neural networks and one-stage methods have failed. This helps to advance the process of analyzing brain images, thus providing many advantages to the healthcare system, especially in critical health situations where urgent intervention is needed.


I. INTRODUCTION
Magnetic resonance imaging (MRI ) employs a magnetic field to generate detailed images of tissues without using harmful radiations [1], [2]. However, these images tend to be segmented manually, a process that is considered timeconsuming and clinically expensive [3]. Hence, automated segmentation of infant and adult brain images has received a substantial research attention [4], [5]. However, training deep learning models requires large sets of labeled images [6]. Due to the limited sets of data in medical applications [7], [8], semi-supervised learning techniques has been used to address this issue by means of unlabeled image [9], [10]. Segmentation results can be improved by adopting unlabeled The associate editor coordinating the review of this manuscript and approving it for publication was Cristian A. Linte. images [11] or images with weak annotation, such as image level tags [12].
For object detection, a one-stage method is normally used to predict the class probability and position information [13], [14]. With the recent success of two-stage method, many models took advantage of that for semantic segmentation. Recently, Xiaohao et al. [1] proposed a twostage image segmentation method using a convex variant of the Mumford-Shah model and thresholding. In computer vision, two-stage methods are used for generating global information in the first stage and local information in the second stage [15], [16]. Good results can be achieved by fusing the global information and local information together [17], [18]. In addition, the adoption of multi-stage Generative Adversarial Networks (GAN ) in medical imaging remains unexplored.
In this paper, we propose a coarse-to-fine method to improve brain segmentation using a multi-stage GAN with three generators, referred to as G, as follows: • In the first generator, our model generates a coarse outline for both background and brain tissues. The main role of the first G is to generate coarse segmentation information to guide the third G.
• In the second generator, two inputs, image x and a random vector z, are taken to encourage generating as many different values for each x as those of z.
• In the third generator, an encoder and decoder are used along with a dense skip connection to combine features from different scales. This generator generates an outline for (i) white matter (WM ), (ii) gray matter (GM ), and (iii) cerebrospinal fluid (CSF). This process is similar to that of human learning in a clinical practice. Specifically, the role of the third G is to generate more detailed results using the coarse segmentation from the first G. We evaluate our proposed multi-stage generative adversarial model on two datasets of brain tissues, including infant and adult brain. Our model achieves higher results compared to the state-of-the-art models. In particular, despite using very limited data, we obtain an improved Dice Coefficient (DC) accuracy of up to 5% compared to onestage models. In addition, we observe that our multi-stage model is 2.69 − 13.93 minutes faster than prior models. Therefore, for medical images, our solution is applicable to a wide range of image segmentation applications for which convolution neural networks and one-stage methods have failed. This helps to advance the process of analyzing brain images, thus providing many advantages to the healthcare system, especially in critical health situations where urgent intervention is needed.
The rest of this paper is organized as follows. Section II presents the prior studies and techniques related to brain segmentation. Section III presents the design of multi-stage model. Section IV presents our experimental design and evaluation. Section V presents our results and discussion. Section VI discusses the validity threats to our results. Finally, Section VII concludes the paper and discusses directions for future work.

II. BACKGROUND AND RELATED WORK
This section presents the prior studies and the techniques related to brain segmentation. First, we describe in detail semi-supervised learning. Second, we describe generative adversarial networks (GAN ). Finally, we show how loss functions are used to improve the stability of training GAN models.
A. SEMI-SUPERVISED LEARNING Training a deep model using a small datasets may cause overfitting [11], [19]. To prevent overfitting, large amounts of unlabeled data with a small amount of labeled data should be used [20], [21]. Training deep models using  both labeled and unlabeled data encourages neural networks to have a similar distribution [22], [23]. In particular, semantic segmentation works by taking an image as an input and generating a segmentation map as an output [16], [24], [25]. Figure 1 and Figure 2 show the semantic segmentation labels and semantic segmentation classes, respectively.
Much research has applied semantic segmentation for brain images, in particular, images for brain tissues. Bdair et al. [26] proposed ROAM, a random layer mixup that allows neural networks to be less confident for interpolated data points on any selected space. Gillmann et al. [27] proposed two architectures for brain tumor segmentation. Their results have been evaluated using the pinnacle BraTS confront2017 datasets. Similarly, Majib ett al. [28] proposed a rethinking atrous convolution model for semantic images. Differently from the above models, rethinking atrous convolution model targets long range contexts, as it does not require convolution layers. Instead, it utilizes an atrous convolution with up-sampled filters to extract dense feature maps. The model was evaluated on the PASCAL VOC 2012 semantic image segmentation benchmark, consisting of 3,475 finely annotated images and extra 20,000 coarsely annotated images. Their experimental results of the sentiment task show that atrous convolution is necessary when building more blocks cascadedly. The results also show that the more blocks are added, the better the performance.

B. GENERATIVE ADVERSARIAL NETWORK (GAN)
GANs have demonstrated promising results for medical image diagnostics [29] and brain image segmentation [21], [25]. Figure 3 shows an overview of how GANs work.
Many researchers have applied generative adversarial network for brain segmentation. Cirillo et al. [30] proposed a 3D volume-to-volume (GAN ) for segmenting brain tumors. Their model achieved a good result when the generator loss was weighted five times higher than the discriminator loss. The proposed model was evaluated on the BraTS 2018 datasets. Their model outperformed previous models with an overall accuracy of 0.66%. Delannoy et al. [31] proposed a super resolution and segmentation framework using generative adversarial networks to neonatal brain MRI images. The framework consists of (a) a training of a generating network that estimates the corresponding HR image for a given input image and (b) a discriminator network D to distinguish real HR and segmentation images. In Table 1, we provide example GAN models applied in medical applications.

C. LOSS FUNCTIONS
Loss functions have been developed to improve the training stability of GAN models [39], [40]. In this section, we describe five loss functions that are used for GAN s.

1) MINIMAX GAN LOSS
Minimax GAN loss function consists of two components: a generator and a discriminator. The generator attempts to minimize the loss function, whereas the discriminator attempts to maximize. Their formulas are given below. Generator loss function [41]: Discriminator loss function [41]: In the discriminator loss function: D(x) denotes the discriminator's estimate of the probability that real data x is real.
denotes the expected value over all real data.

G(z)
denotes the generator's output for a given noise z.
denotes the generator's output for a given noise z.
D(G(z)) denotes the discriminator's estimate of the probability that fake data is real. E(z) denotes the expected value over all generated fake data G(z).

3) WASSERSSTEIN LOSS (WGAN)
GAN s are commonly used in the area of computer vision [42], [43], but the main problem is with training instability [28]. Many loss functions have been developed toward providing a stable training of GAN s [35]. Wassersstein (WGAN ) achieves a good progress for training stability of GAN , but still suffers from poor results. It has been argued that Wassersstein's poor result is due to the use of weight clipping. To address tha, Adler and Lunz [44] proposed a better approach for clipping weights. This resulting model is a modification of the standard GAN . The discriminator training tries to make the output bigger for real data than for fake data. The output of the discriminator is a number, which does not have to be between 0 and 1. More details can be found in [44]. Generator loss function [44]: Discriminator loss function [44]: In these functions: D(x) denotes the discriminator's output for real data.

G(z)
denotes the generator's output for a given noise z. D(G(z)) denotes the discriminator's output for fake data.

4) LEAST-SQUARES LOSS (LSGAN)
This model proposed a − b coding scheme for the discriminator where a, b denote to the labels of fake and real data. Generator loss function [21]: Discriminator loss function [21]:

5) WASSERSSTEIN GRADIENT PENALTY LOSS (AC-GAN)
AC-GAN uses a noise z and a sample with class label c ∼ p. This model is a modification of the standard GAN .
In the standard GAN, X fake = G(Z ), whereas in AC − GAN , X fake = G(c, z). IIn addition, the output of standard GAN is a probability distribution P(s, x) = D(x), whereas in AC − GAN , the output is two probability distributions.
Mondal et al. [35] proposed a model that uses a GAN for brain segmentation. The authors used a dataset of 43 subjects, where they generate fake images using a generator, followed by labeled, unlabeled, and fake data to train the discriminator to distinguish between generated data and true data. Besides, an encoder was used to compute the predicted noise mean and log-variance. However, their approach only supports onestage, whereas our model supports multi-stage modeling.
Unlike previous work, we aim in this paper to solve the problem of information loss suffered by one-stage modeling. To do this, our first generator generates a coarse outline to be used by a third generator. Then, the encoder and decoder generate a fine outline. Moreover, we use a dense skip connection to combine the features from different scales. To validate our multi-stage model, we use the Dice coefficient metric.

III. METHODOLOGY
In this section, we present the design of our proposed multistage GAN model. We first give a more detailed description of the GAN model that we used. Then, we give a more detailed description of the loss functions (discriminator and generator) we used. Table 2 shows a list of the symbols defined in this paper.

A. GENERATIVE ADVERSARIAL NETWORK (GAN)
Generative adversarial network (GAN ) refers to a network composed of two networks: a generator G, which is used to generate a fake images from a noise vector, and a discriminator D, which is used to distinguish between generated data and true data. In particular, G is trained to map a noise vector z ∈ R to a fake image, whereas D is trained to differentiate between true data x and generated data G(z). The core idea behind GAN s is to play a two player min/max game: Figure 4 shows an overview of our proposed GAN network, which mainly consists of the 3-stage generator network and the discriminator network. The discriminator is used to distinguish between true and generated data. The first generator is mainly used to generate an outline for the background and brain tissues from the input images. The second generator takes two inputs: an image x and a random vector z. This encourages the generator to generate as many different values for each x as those of z. Specifically, training a network with a random vector z and an image x encourages the network to give better output. The third generator is used to generate an outline for (i) white matter (WM ), (ii) gray matter (GM ), and (iii) cerebrospinal fluid (CSF). The main role of the first G is to generate a coarse segmentation that can be used to guide the third G. The main role of the third G is to generate more detailed results using the coarse segmentation from the first G. The third G consists of an encoder and a decoder. The encoder and decoder use a dense skip connection to combine the features from different scales. Figure 5 shows the network architecture of the encoder, decoder, and dense skip connection.
We used the generator proposed by Dai et al. [41] and change it as follows: 1-K classes are changed to (K + 1) classes.

2-
The segmentation network is changed to be fullyconvolutional. We used the discriminator network proposed by Çiçek et al. [45] and change it as follows: VOLUME 10, 2022

B. LOSS FUNCTION 1) DISCRIMINATOR LOSS FUNCTION
The discriminator in our model has an unlabeled data loss, labeled data loss, and refined prediction loss. The overall loss function is computed as follows: l discriminator = λ labeled l labeled + λ unlabeled l unlabeled + λ fake l fake (9) where λ labeled , λ unlabeled , and λ fake are hyper-parameters. We set the hyper-parameters in Equation 9 to λ labeled = 1.0, λ unlabeled = 1.0andλ fake = 2.0.
For labeled data, we use the same loss function in the standard segmentation network. Mondal et al. [35] showed that using l i,k+1 as a subtracted function, the softmax function is changed as follows: The idea is to introduce unlabeled loss and fake loss, which are analogous to the two components of the discriminator loss in the standard GAN , whereas labeled loss represents the cross-entropy. More details can be found in [35].

2) GENERATOR LOSS FUNCTION
We proposed a novel generated loss to encourage G to generate real data. Let x and z denote to the real data and noise, respectively.
In our paper, f (x) contains the activation of the last layer.
By minimizing this loss, we force the generator to generate real data to match our data and the corresponding K classes of real data, which is defined as Classes = 1, . . . , K .

IV. EXPERIMENTS
This section presents our experimental design and evaluation. First, we give a more detailed description of the datasets used in our experiments. Then, we show our experimental setup. Finally, we explain the Dice coefficient of the segmentation evaluation.

A. DATASETS 1) DATASETS
In our experiments, we use two different datasets of brain images: the MICCAI iSEG dataset and MRBrainS dataset. The MICCAI iSEG-2017 dataset contains data of 6-month infants, whereas the MRBrainS-2013 dataset contains adult data. We should note that there are significant differences between the two datasets in term of image data characteristics, such as voxel spacing and the number of available modalities. However, these two datasets were both used to evaluate the state-of-the-art models in this context [46], [47]. We describe each of these datasets in the following.

2) MICCAI iSEG-2017 DATASET
The aim of the evaluation framework introduced by the MICCAI iSEG organizers is to compare segmentation models of WM , GM and CSF on T 1 and T 2. The MICCAI iSEG dataset contains 10 images, named subject-1 through subject-10, subject T 1 : T 1-weighted image, subject T 2 : T 2-weighted, and a manual segmentation label used as a training set. The dataset also contains 13 images, named subject-11 through subject-23, used as a testing set. An example of the MICCAI iSEG dataset (T 1, T 2, and manual reference contour) is shown in Figure 6. On the other hand, Table 3 shows the parameters used to generate T 1 and T 2. The dataset has two different times: longitudinal relaxation time and transverse relaxation time, which are used to generate T 1 and T 2. The dataset has been interpolated, registered, and skull-removed by the MICCAI iSEG organizers. We present the evaluation equations in subsection IV-B.

3) MRBrainS-2013 DATASET
The MRBrainS dataset contains 20 subjects for adults for segmentation of (a) cortical gray matter, (b) basal ganglia, (c) white matter, (d) white matter lesions, (e) peripheral cerebrospinal fluid, (f) lateral ventricles, (g) cerebellum, and (h) brain stem on T 1, T 2, and FLAIR. Five subjects, 2 male and 3 female, are provided to the training set and 15 subjects are provided for the testing set. To evaluate the segmentation, these structures were merged into gray matter (a − b), white matter (c − d), and cerebrospinal fluid (e − f ). The cerebellum and brainstem were excluded from the evaluation.

4) EXPERIMENTAL SETUP
We implement our proposed model using Python on a computer with a NVIDIAGPU and Ubuntu 16.04. Training our model took 30 hours in total, whereas testing took 5 minutes for.

B. SEGMENTATION EVALUATION 1) DICE COEFFICIENT (DC)
To better demonstrate the significance of our model, we use the Dice Coefficient (DC) metric for evaluation. Dice Coefficient (DC) has been considered as a baseline (benchmark) in the literature to compare brain segmentation models. We use V ref for the reference segmentation and V auto for the automated segmentation. The DC is given by the following equation [41]: where DC values are given in the range of [0, 1]. 1 corresponding to the perfect overlap and 0 indicating the total mismatch.

C. EVALUATING THE HYPER-PARAMETERS MULTI-STAGE
To evaluate the effectiveness of our model, we evaluate different hyper parameters: epochs, learning rate, and batch size. Table 4, Table 5, and Table 6 show training epochs, learning rate, and batch size, respectively. We observe that a batch size of 30 is 95%, 94%, and 92% for CSF, GM and WM , respectively. Large training epochs can cause overfitting, whereas and small training epochs can cause underfitting. To mitigate these issues, we validate whether the training epochs will be significantly impacted the network performance. To do this, we use training epochs of 20, 40, 60, 80 epoch. In the 80 epochs, we observe that the network performance was the best. We followed a similar approach to select the best learning rate values. A large learning rate can make the parameters of network updated quickly, whereas a small learning rate can make the parameters updated slowly.
To address this, we first randomly start with a learning rate of 1 × 10 −1 . Then, we use multiple runs while changing the   learning rate. Experimental results showed that our multistage model achieves a higher result for the learning rate of 1 × 10 −4 .

V. RESULTS AND DISCUSSION
To better demonstrate the significance of our model, we train and test our multi-stage GAN model on two datasets of different ages: infants and adults, as follows: • MICCAI iSEG-2017 dataset -For the 13 unlabeled images, that are actually part of the testing set, we use them to train our GAN model -For the 10 labeled images, we use two for training, one for validation, and seven for testing • MRBrainS-2013 dataset -For the 15 unlabeled images, that are actually part of the testing set, we use them to train our multi-stage GAN model -For the five labeled images, we use one for training, one for validation, and three for testing The main goal of our multi-stage GAN model is to improve the performance with a few-shot learning case. Table 7 presents the results of our model to segment CSF,GM , and WM using the MICCAI iSEG dataset, in comparison with the state-of-the-art models. Our model achieves DC values of 95% in CSF segmentation. The DC values obtained from segmenting CSF by the state-of-the-art models ranged between 86% and 91%. In addition, our model achieves a DC values of 94% and 92% in segmenting GM and WM , respectively. The state-of-the-art models, on the other hand, obtain DC values in the ranges of 80%-93% for GM   segmentation and 81%-90% for WM segmentation. Such results highlight the remarkable efficiency gained by using multi-stage GAN . Table 8 presents the results achieved by our model using the MRBrainS dataset, in comparison with the state-of-the-art models. We observe that our model achieves a DC value of 93% on CSF segmentation, 93% on GM segmentation, and 88% on WM segmentation. Such results surpass the results achieved by the state-of-the-art models. Therefore, we argue that our model can perform better in a few-shot learning case. Table 9 shows the execution time (in minutes) for our multi-stage GAN model, in comparison with the state-of-theart models. We observe that the execution of our proposed model is faster than the state-of-the-art models. Such results indicate that our model is more efficient and, hence, more practical to be used in real-time systems. Figure 7 visualizes the results of our model on the images used as a validation set. We observe that the segmentation results achieved by our multi-stage model are fairly close to the manual reference contour, i.e., ground truth, provided by the MICCAI iSEG organizers.

VI. THREATS TO VALIDITY A. EXTERNAL VALIDITY
Threats to external validity are related to the generalizability of our results. In this paper, we use two datasets that belong to two organizers. The total number of the subjects in the two datasets are 43 subjects. One could argue that the datasets do not have enough samples. We mitigate such threat by using two datasets that (a) contain both infant and adult brain data and (b) were previously used by prior studies. Our model obtains a higher performance than the stateof-the-art models. We believe that our model performs as similar as human learning in clinical practice. Moreover, while we only targeted three tissues, our proposed model can be easily extended to segment more tissues as it does not require more labeled data. The intuition behind our multistage model is that it improves the performance in a few-shot learning case where only a few labeled data are available for training. VOLUME 10, 2022 B. INTERNAL VALIDITY Threats to internal validity are related to experimental errors and bias. Our model is constructed using data extracted from medical images in which contracts might be low. We use the small-size kernels, deconvolution layer (to upsample the input), PReLU, dropout and normalization methods to reduce the risk of overfitting. Hence, any potential deficiency in the data should deficient all the implemented models. Nevertheless, our model obtains higher performance than previous models.

VII. CONCLUSION
In this paper, we proposed a multi-stage generative adversarial network (GAN ) model for brain segmentation that generates a coarse outline for both background and brain tissues. Then, our model generates an outline for white matter (WM ), gray matter (GM ), and cerebrospinal fluid (CSF). We evaluated our results using both infant and adult datasets, in comparison with three baseline state-of-the-art models. We found that our segmentation results are fairly close to the manual reference. In addition, we observe that our model surpasses the state-of-the-art models by achieving a performance improvement of up to 5%. In particular, we obtain Dice coefficients (DC) ranging between 88% and 95%. Such results indicate that the adoption of our multi-stage GAN model has significantly improved segmentation results. We argue that our model is more efficient and accurate in practice for both infant and adult brain segmentation.
Despite the promising results obtained from our proposed model, we believe that further improvements can be achieved in the future. We aim in the future to consider more datasets in our study. Moreover, we intend to expand the evaluation of our multi-stage model to investigate its performance on segmenting more brain tissues. Finally, we aim to investigate whether our multi-stage model achieves a higher performance for pathological brain images, such as with tumor or edema.

A. COMPETING INTERESTS
The authors declare that they have no known competing financial interests.

B. ETHICS APPROVAL AND CONSENT TO PARTICIPATE
Huazhong University of Science and Technology the ethics committee that approved our study and the committee's reference number is 430074.

C. CONSENT FOR PUBLICATION
Not applicable.

D. AVAILABILITY OF DATA AND MATERIALS
The data that support the findings of this study are available from MICCAI grand challenge on 6-month infant brain MRI segmentation [48] and MRBrainS and are publicly available.