Brain Extraction From Brain MRI Images Based on Wasserstein GAN and O-Net

Brain extraction is an essential pre-processing step for neuroimaging analysis. It is difficult to achieve high-precision extraction from low-quality brain MRI images with artifacts and gray inconsistencies which often result in irregular hole regions in the extracted brain tissues. In addition, the U-Net based brain extraction method trends to output over-smoothed brain boundary. To remove those irregular holes in the extracted mask, we proposed a new U-Net based model for brain extraction named O-Net. O-Net replaces the skip-connection path in the U-Net with dual shortcut paths including an attention module of an O-shaped network, which uses deep semantic information to highlight the target area while retaining more image details. O-Net effectively reduces the impact of intensity differences caused by artifacts or gray inconsistencies in the brain MRI images on the extraction results. To more accurately identify brain boundary, we designed a new GAN based brain extraction method, which used above O-Net as the segmentation network. The discrimination network of the proposed GAN model adopts the residual structure to enhance the nonlinear expression ability of the network to balance the adversarial training of the two networks. To speed up the convergence of the proposed model, a segmentation loss was added to the adversarial loss to supervise the feature learning of the segmentation network. This method was compared with other popular brain extraction methods on two public datasets (IBSR18 and LPBA40). The mean Dice similarity coefficients obtained by the proposed method were 97.26% and 98.29% on IBSR18 and LPBA40 respectively. In the comparative experiment, the results of the proposed method are the best on the two public datasets. Experimental results show that the proposed model can stably output high-precision brain tissue extraction images and the influence of artifacts and gray inconsistencies is small.


I. INTRODUCTION
With the wide application of magnetic resonance imaging (MRI) equipment in clinical medicine, neuroimaging analysis becomes more and more popular in the fields of brain disease diagnosis and brain function analysis. Brain extraction is an important pre-processing step in most neuroimaging analysis, having a crucial impact on subsequent research such as registration of brain MRI, measurement of brain volume, and brain tissue classification. Therefore, an accurate and The associate editor coordinating the review of this manuscript and approving it for publication was Ziyan Wu . stable brain extraction method is needed in neuroimaging analysis.
Manual extraction method has the highest accuracy, but it consumes a lot of time and energy. Although the accuracy of automatic method is lower than the manual method, it is more efficient. The automatic extraction method is affected by many factors, which is still a challenging field. First, brain MRI images usually show low contrast, low resolution, and uneven gray distribution due to the diversity of devices and imaging protocols. Furthermore, brain structure is complex, so it is difficult to determine the boundary between brain tissue regions and non-brain tissue regions. Second, artifacts caused by the patient's unconscious movement and equipment debugging are manifested as distorted, overlapped, lost, and blurred in brain MRI images. These artifacts may prevent the model from extracting the correct target features, leading to serious deviations in the results. Last, the intensity difference between individuals makes it difficult for one method to constantly output good results for different brain scans. Pathological changes in brain may cause morphological changes, which will increase the difficulty of accurately extraction. With the in-depth study of brain extraction algorithms, the problems of low resolution, low contrast and uneven gray level of image have been well solved. However, these models still can't solve the artifact problem well, and their adaptive performance is still a problem that needs to be further studied.
Since the development of the brain extraction research, a lot of automatic methods have been proposed. These methods can be divided into the classic-based [1]- [7], the atlas-based [8], [9], and the learning-based [10]- [20]. Some classic automatic methods are yet widely used because of their advantages in high calculation speed and batch processing of data. Smith [1] proposed BET which pushes the grid points to the brain boundary by a set of local adaptive forces. BET has stable performance and can handle some complex areas in brain MRI image well such as eyeballs. Therefore, many researchers have proposed improved brain extraction methods based on BET [2]- [4]. However, these classical methods are highly dependent on parameter setting, and their extraction accuracy is not enough to meet clinical requirements. Compared with the classical method, the atlasbased brain extraction method has a good performance in terms of accuracy and stability, yet this kind of method involves registration technology. If there is a large error in the registration, the subsequent operation will be affected. ROBEX proposed by Iglesias et al. [8] is known for stability. However, if its registration step fails, the extraction work can't continue. Moreover, the performance of the atlas-based method is also restricted by the slow registration speed.
Learning-based methods can be divided into machine learning-based and deep learning-based according to the way of feature extraction. Many machine learning algorithms have been applied to brain extraction task, such as hidden Markov algorithm [10], [11], meta-algorithm [12], K-mean [13], and Bayesian [14]. However, all most features need to be determined by experts, and then manually coded in machine learning-based methods. As the database is consciously collected and established by people, deep learning based on a large amount of data can be studied deeply and used widely. Deep learning tries to learn feature from the data itself. Among them, the convolutional neural network (CNN) extracts the shallow and deep semantic information of the image by establishing multiple hidden layers. The CNN-based method has low sensitivity to low contrast, low resolution and uneven gray-scale of brain MRI image.
Kleesiek et al. [15] firstly used an end-to-end extraction method based on 3D convolutional network for skull stripping. As the most widely used model in medical image segmentation, U-Net proposed by Ronneberger et al. [16] is the basic network of many brain extraction models. For example, Salehi et al. [17] proposed Auto Context U-Net, which added a local predictive brain mask to U-Net to obtain higherprecision results than Kleesiek et al. [15]. Hwang et al. [18] extended U-Net from 2D pixels to 3D voxels to make full use of more spatial information for skull separation. However, U-Net can't handle data with large different in the data sets. The predicted brain mask obtained by U-Net on a data set with severe artifacts or different intensity will have largescale irregular missing region. Most of brain extraction methods based on U-Net are very likely to have the same problems as mentioned above.
Generative adversarial networks (GAN) are well known for the powerful data fitting capabilities and the adversarial training way that are different from others network. In the field of image segmentation, researchers make use of adversarial training to promote the segmentation network to better learn the mapping between samples, thereby improving the accuracy of segmentation. The model combing GAN and CNN has been widely used in the field of medical images segmentation [19]- [21]. Chen et al. [19] combined dense expansion network and GAN to realize neuron cells.
Moeskops et al. [20] used the same idea as Chen to achieve brain tissue segmentation. Thirumagal and Saruladha [21] combined ResNet [22] and GAN to construct FCSE-GAN for accurate segmentation of brain tumor regions. At present, there are few researches on the whole brain segmentation based on combined network.
Based on the above observation and analysis, this paper proposes a new brain extraction model WGAN + O-Net, where WGAN (Wasserstein GAN) [23] performs adversarial training stably to improve the accuracy of our proposed segmentation network O-Net. O-Net introduces attention modules into U-Net to form a new shortcut to connect the corresponding feature mapping on encoding and decoding path. This structure can not only preserve the more detailed feature of the image, but also highlight the target area of each channel by using the deep semantic information. O-Net can effectively reduce the influence of intensity difference caused by artifacts or gray inconsistencies in the extraction results. At the same time, in WGAN, the residual structure is used in the discriminator to enhance the nonlinear performance of the network and balance the adversarial training of the two networks.
In the remainder of this paper, we first review the related work of the proposed brain extraction method, then give a detailed description of the proposed method, and finally verify the performance of the model through experiments on healthy brain MRI and pathological brain MRI.

A. ATTENTION MECHANISM
The attention mechanism in deep learning is similar to selective visual attention mechanism of human beings. Its essence VOLUME 9, 2021 is to locate interest information and suppress useless information. The results are usually displayed in the form of probability graphs or probability feature vectors. The attention model can be divided into three categories according to the principle: spatial attention model, channel attention model, and spatial channel mixed attention model. The spatial attention model includes AG by Oktay et al. [24], STN proposed by DeepMind [25], and the SAM proposed by Zhu et al. [26]. The channel attention model includes the SENet proposed by Hu et al. [27], and Wang et al. [28] improved ECANet based on SENet. The attention modules of hybrid spatial and channel are CBAM proposed by Woo et al. [29] and SANet by Zhang and Yang [30].
The advantage of these attention models is that they can be easily added to the CNN network model without causing large-scale changes to the model structure. For example, the Attention U-Net proposed by Oktay et al. [24] directly adds attention gate (AG) between the skip-connections of U-Net to supervise the previous-level feature map with the next-level feature map. AG belongs to the spatial attention model, the importance of features at different spatial positions can be controlled in U-Net model training. This allows AG to suppress the characteristic response of irrelevant regions.

B. WGAN-DIV
The emergence of GAN has opened up the research ideas of unsupervised learning on complex distributions, but its training has the problem of model collapse. To solve this problem, many improved GANs have been proposed. Wasserstein distance [23] was proposed for the difficulty of GAN network training. The training of WGAN is easier than the original GAN, and eliminates mode collapse to a certain extent. However, WGAN introduces the Lipschitz (L) constraint.
To satisfy the L-constraint, many methods have been proposed. The original solution given by Zhang and Yang [30] is weight clipping. Gulrajani et al. [31] proposed gradient penalty to discriminator. Miyato et al. [32] proposed spectrum normalization. However, these methods have a problem: restrict discriminator to a small cluster of functions. In this case, WGAN-DIV proposed by Wu et al. [33] proved that the optimization objective of the discriminator in WGAN-GP is not always divergence, and proposed a theoretically complete wasserstein divergence. Wasserstein divergence can satisfy the L-constraint while retaining the excellent properties of wasserstein distance for stable training. The definition of wasserstein divergence is as follows: where C 1 represents the first-order continuous function; r represents the true distribution, and q represents the composite distribution. WGAN-div has very low requirements for P random distribution, and the effects of various sampling methods are similar. After a series of experiments, the author found that the effect is best when K = 2 and P = 6. Based on the above analysis, we choose the Wasserstein divergence in WGAN-DIV as the loss function of our generative adversarial network.

III. METHOD
The general architecture of the WGAN+O-Net model we proposed is illustrated in Figure 1. The network can be divided into two parts: segmentation network and discrimination network. We introduce the above two networks in turn, and then introduce the adversarial training of the network.

A. SEGMENTATION NETWORK
In the proposed brain extraction model, the purpose of the segmentation network is to output a predictive brain mask that can replace manual labels to deceive the discrimination network. The encoding-decoding architecture of U-Net with skip connection can integrates features of different scales and has few of parameters, so it is very suitable for the data mining with simpler semantics and relatively fixed structure such as medical images. Based on U-Net, we proposed O-Net to improve brain extraction results of brain MRI with serious artifacts or inconsistent gray-scale distribution.
In the proposed network, the encoder has 8 convolutional components, each of which includes a convolution layer with kernel size of 4 × 4 and a stride of 2 + BN layer + Leaky Relu activation function layer + a convolution layer with kernel size of 3 × 3 and a stride of 1 + BN layer + Leaky Relu activation function layer. The max-pooling layer is replaced by a convolution layer with a stride of 2 to achieve down-sampling in the entire coding path. The initial number of feature channels is 64, and then increases exponentially until it reaches 512. Obviously, the proposed network is deeper than the baseline U-Net. The deepened network can provide a wider range of receptive fields and more feature maps of different scales. However, as deepening the network more spatial feature information is lost with the down-sampling. Therefore, we add an attention path between the encoder and the decoder based on the original skip connection to locate the target region features of different scale feature maps. Furthermore, two parallel skip connection paths can retain more detailed features in deep model training. As shown in Figure 1, the skip connection path and the attention path come to form a closed path which looks like a letter 'O'. Therefore, we named it O-Net.
The key components of the attention path are the attention module which is marked with a capital letter A in the middle of the red-filled circle in the Figure 1. Each attention module has two inputs (x and g). x is fed by the shallow feature map from the encoding path. g is fed by the feature map after transposed convolution of the output of the previous level in the decoding path. The output of the attention module is labeled x'. x and x' are concatenated and then undergoes a convolution with the kernel size of 3 × 3. The output of the convolution become the input g of the next attention module. The detailed internal structure of attention module  Figure 2(a). The discriminator is composed of a superposition of residual blocks, and the detailed structure is shown in Figure 2 is shown in Figure 2(a). First, the two inputs (x and g) are pixel-wise superimposed to control the importance between the feature information of the same position in x through the high-level semantic information in g. Then, the irrelevant features in the feature map are suppressed by a convolution layer with kernel size of 3 × 3 + BN layer + Relu activation function layer. The value of the obtained attention matrix is normalized to between 0∼1 by a convolution layer with a kernel of 3 × 3 + BN layer + Sigmoid activation function layer. Finally, the obtained attention coefficient matrix of each channel is multiplied by the corresponding channel of x to output a new feature map x'. The attention module uses deeper semantic information to guide the feature map of the current layer to adjust feature weights. Therefore, the size of the coding path down-sampling to the feature map is set to 1 × 1 to assist the model to obtain a wider receptive field.
The attention module proposed in this paper draws on the idea of attention gate (AG) in attention U-Net, but it is different from AG in internal structure and attention matrix. In the module for suppressing useless feature and normalization, we use a convolutional layer with a kernel size of 3 × 3 to obtain more features to distinguish it from the feature information provided by another skip-connection path and provide more feature for the decoding path. At the same time, this operation also extends the receptive field of the attention module to a deeper level so that it is not limited to the features of the previous layer and the current layer. The number of characteristic channels of AG changes as follows: C (input) → C / 2 (after the first convolution) → 1 (after the second convolution). The number of channels of the proposed attention model changes as follows: C (input) → C / 2 (after the first convolution) → C (after the second convolution). The scale of the attention coefficient matrix obtained by AG is H ×W × 1, which means that the feature map of each channel of x employs the same attention matrix to adjust the weight space information. Although the attention matrix output by the AG integrates the information of all channels, using the same attention for all channels will lead to deviations in the range of the prominent target area. The attention coefficient matrix scale obtained by the proposed attention model is H × W × C, which means that the feature map on each channel has its own corresponding attention matrix. Adjust the proportion of features according to different semantic information represented on different channels. This targeted adjustment helpful to better understand the feature distribution of the image. The purpose of the decoder is to restore the low-resolution feature map with high-level semantic information to the same resolution as the input image. The network structure is symmetrical, so the decoder needs to be up-sampled 8 times. After the two input feature maps of the decoder are concatenated by channel, they are sent to a convolution with kernel size of 3 × 3 + BN layer + Leaky Relu activation function layer for dimension reduction and feature fusion. The output is used as the input of the next attention module after transposed convolution. The dual-skip path facilitates propagation and update of gradients and provides more detailed feature information. Therefore, the proposed decoding architecture makes the edge of the output prediction mask more fine. The proposed O-Net has a feature map with a larger receptive field to deal with artifacts, which are usually different in images and do not exist only in a certain region of the image. At the same time, the attention module in the model obtains image context information to increase the weight of brain tissue regions. The purpose of the final model is to weakening the influence of artifacts. To show how O-Net can improves brain extraction, we show the outputs of a typical image in three processing ways in Figure 3. They all used the WGAN-DIV, but with different generative networks. The WGAN+U-Net, WGAN+Attention U-Net, and WGAN+O-Net used the U-Net, the Attention U-Net, and the O-Net as the generative network respectively. In Figure 3(b), affected by the artifacts in the image, a lot of brain tissues are missed in the middle of the extraction results obtained by WGAN+U-Net. By adding the attention module into the U-Net, the largescale missed areas in Figure 3(b) are recovered in Figure 3(c), but the accuracy has not been significantly improved.
By providing more detailed information for the up-sampling of the network, our proposed O-Net not only avoided holing in the prediction mask but also significantly improved the accuracy.

B. DISCRIMINATION NETWORK
The discrimination network learns the difference between the ground truth and the output of the segmentation network to effectively punish the segmentation network. It provides a learning signal for the segmentation network so that the segmentation network can output the predictive brain mask closest to the ground truth. Therefore, the stronger the discriminating ability of the discrimination network is, the better the segmentation network is. We deepened the depth of the discrimination network as well as the segmentation network to enhance the discrimination ability and balance the adversarial training of the two networks. The modified convolution residual blocks (ResNet v2 [34], being qualified to build more deeper network) were used to form the layers in the discrimination network in Figure 1. The internal structure of the modified res-block is shown in Figure 2(b). The residual block consists of two convolutional paths including multiple Leaky Relu, convolutional layer with kernel size of 3 × 3 and 2 × 2 max-pooling convolutional components. The output of the left path including two 3 × 3 convolutional layers have the same receptive field as the output of one 5 × 5 convolutional layer, but having fewer parameters. The outputs of the two convolutional paths are added, thus the residual block can provide multiple receptive fields. Instead of using the BN layer in the discriminator, we removed the last activation layer in the discriminator to ensure that the model can be trained stably.

C. ADVERSARIAL TRAINING
The generative adversarial network is trained alternatively to minimize the objective function. Adversarial training is implemented iteratively by two iterative steps. First, the adversarial model is trained in order to improve the model's ability to discern the authenticity. The loss of discriminator is defined as: where C 1 refers to the first-order continuous function family, x is the sampling point on the line between the real distribution Pr and the generator distribution Pg, ∇ is the gradient operation in connection with the discriminator.
The generator of the model is trained in the second step. In order to speed up the convergence of the model and better cooperate with the segmentation network, we added the segmentation loss function L s to generator's loss function.
where l mce is class cross-entropy and x is the true distribution data. Cost minimization on 10 epochs was performed using ADAM optimizer with an initial learning rate of 0.0001 on both the segmentation network and the discrimination network. All model performance evaluation experiments were done on workstations with Nvidia eforce GTX1080Ti.

IV. EXPERIMENTS A. DATASETS
We evaluated and verified the performance of proposed model on three brain MRI image data sets, two of which are public and the other is private. IBSR18: come from the International Brain Tissue Segmentation Image Library, dedicated to the study of brain extraction algorithms. The 18T1 scans of IBSR18 come from healthy human with a resolution of 0.94 × 1.5 × 0.94mm. The 2D image size in three directions of this dataset (cross section, coronal plane and sagittal plane) are 256 × 128, 256 × 256 and 128 × 256, and the corresponding numbers are 4068, 2304, and 4068. Overall quality Poor, with severe artifacts.
LPBA40: come from the LONI website of the University of Southern California Los Angeles. The 40T1 scans of LPBA40 come from healthy humans with a resolution of 0.86 × 1.5 × 0.86mm. The 2D image size in three directions of this dataset (cross section, coronal plane and sagittal plane) are 256 × 124, 256 × 256 and 124 × 256, and the corresponding numbers are10240, 4952 and 10240. Overall quality is clearer and better. The brain MRI image is a small sample data set, so the different division of the data would greatly affect the effect of the model. To reflect the true level of the model as much as possible, we used the K-fold cross-validation method for sampling, in which LPBA40 was 10-fold cross-validation (training: test = 9:1) and IBSR18 was 9-fold cross-validation (training: test = 8:1).
Dataset with lesions: come from the hospital, which is a set of MRI images with meningiomas, cerebral ischemic stroke, and pituitary tumor. There are 30T1 scans sets, and each type of lesion contains 10T1 scans sets. Each scan set contains only 15 slices. The image dimensions of these slices are 256 × 256. The labels corresponding to the data set were marked by the researchers under the guidance in this paper. In order to avoid uncertainty, they were only used for cross-dataset experiments to test the adaptability of the model.

B. EVALUATION METRICS
To quantitatively evaluate the extraction methods, three evaluation indexes Dice, sensitivity, and specificity, were calculated Dice coefficient was used to measure the similarity VOLUME 9, 2021 between prediction mask and ground truth: where TP is true positive, FP is false positive, and FN is false negative. TP is the total number of pixels correctly classified as brain tissue in the prediction label. FP is the total number of pixels incorrectly classified as brain tissue in the prediction label. FN is the total number of pixels incorrectly classified as non-brain tissue. Sensitivity represents the ability of brain extraction methods to correctly recognize brain tissue: Specificity represents the ability of brain extraction methods to correctly recognize non-brain tissues.
The value of Dice coefficient, sensitivity, and specificity range from 0 to 1. The larger the values of these three evaluation coefficients are, the more accurate the brain tissue extraction results are.

C. RESULTS
We evaluated the performance of the proposed models (WGAN+O-Net) through several comparative experiments with some popular brain extraction algorithms and similar model. The brain extraction methods involved in the comparative experiment include the baseline model (U-Net [16]), the deep learning-based method specially proposed for the brain extraction (Kleesiek et al. [15], Auto Context U-Net [17]), and three non-deep learning methods (HFEM-E [35], BET [2], and ROBEX [8]). Similar models include WGAN+Attention U-Net, WGAN+U-Net, and Pix2pix+U-Net. The evaluation results of each method on IBSR18 and LPBA40 are listed in Table 1 respectively.
Compared with other algorithms specifically proposed for brain extraction tasks, WGAN+O-Net obtained the highest Dice score in the two public datasets. On IBSR18, the Dice score of WGAN+O-Net is significantly improved compared with other methods (U-Net: 2.4%, Auto Context U-Net: 4.57%, Kleesiek: 0.94%, HFEM-E: 1.76%, BET: 4.91%, and ROBEX: 5.58%). On LPBA40, the difference between the scores of WGAN+O-Net and all other methods is more than 0.5% (U-Net: 0.61%; Auto Context U-Net: 0.56%; Kleesiek: 1.33%; BET: 4.77%; and ROBEX: 1.73%). We obtained the p-value after paired T-test to show the statistical difference between the proposed method and other methods in three evaluation indexes. On the two public datasets, all p-values  The overlays of two typical MRI scans from LPBA40 and IBSR18 and their corresponding prediction masks produced by various brain extraction methods Blue pixels represent the segmentation results of the corresponding brain extraction methods. Green pixels indicate the false negatives and red pixels indicate the false positives. Regions that are difficult to segment are intercepted and zoomed with rectangles. WGAN+O-Net produces more accurate brain segmentation compared with other brain extraction methods.
in Dice that can be calculated are lower than 0.01, which shows that WGAN+O-Net has a very significant difference in accuracy with other comparison methods. In terms of sensitivity, ROBEX performed best on IBSR18, followed by BET; our method performed best on LPBA40, followed by BET. In terms of specificity, Kleesiek performed best on IBSR18, followed by U-Net; our method performed best on LPBA40, followed by U-Net. Overall, WGAN+O-Net has the best sensitivity and specificity on LPBA40, but it does not perform well on IBSR18. However, both our method and the second-ranked method have p-values greater than 0.05 on IBSR18, which indicates that there is no significant difference between them.
Compared with similar models, WGAN+O-Net still obtains the highest value of Dice on the two datasets on Dice. WGAN+O-Net is at least 1.26% higher than the second highest WGAN+U-Net on IBSR18, and at least 0.5% higher than the second highest Pix2pix+U-Net on LPBA40. Observing the values in Table 1 shows that all methods perform better on LPBA40 than on IBSR18. The image quality affects the extraction methods resulting in unstable output. Compared with other methods, our method has better stability on the two data sets. Figure 5 shows the box plots of the evaluation values of different brain extraction methods in the comparative experiment of the two public datasets. The purpose of the box plots is to intuitively reflect the performance of different algorithms on brain extraction tasks. Since the evaluation values of Kleesiek and HFEM-E are directly quoted from VOLUME 9, 2021 the papers, box plots of these two brain extraction methods is not added in Figure 5. In the box plots of WGAN+O-Net on all datasets, every square box maintains a high level; the distribution of normal values is concentrated; most square boxes are symmetrical and have very few outliers. The boxes in the plots of U-Net are also relatively concentrated. Expect for sensitivity, there are few outliers, but their values are not high. For LPBA40, Auto context U-Net has a discrete numerical distribution compared to the other two deep learning models. On the IBSR18, the number of digits in the box plot of each indicator is not as good as that of BET. On LPBA40 and IBSR18, the box plots clearly show that the traditional brain extraction methods (BET and ROBEX) has many abnormal values, heavy tails, and a large deviations from the median. This indicates that their extraction results are usually not as good as those based on deep learning.
The above analysis of comparative experiments shows the superiority of our method in accuracy. Next, we verified the adaptability of the model by cross-dataset experiments. This experiment used healthy sample scans LPBA40 as the training set and diseased sample scans as the test set. The training set and test set with huge differences are selected in order to explore the adaptability limit of the model as much as possible. Table 2 shows the evaluation results of the predicted masks obtained by different methods in the cross-dataset experiment. Our method achieves the highest Dice coefficient value, which is at least 3.26% higher than the best among other methods. At the same time, the highest sensitivity is obtained, which is at least 3.59% higher than other methods, while the specificity is lower than the highest value of 0.78%. Although Auto Context U-Net method obtained a specificity value of nearly 1, its sensitivity is too low, which means that its brain extraction results are too conservative. Figure 6 shows the results of experiments across datasets. There is no prediction mask of ROBEX in the figure 5, because it failed to register brain MRI images of lesions, method. The first row in Figure 6 is ground truth (GT). A round and white area can be clearly observed on the coronal and sagittal planes of GT. This area is the focal area of meningioma. which is the main problem of atlasbased brain extraction The gray scale of this lesion area is obviously whiter than that of brain tissue, so it is easy to be recognized by the model as non-brain tissue. It was obvious that only part of the lesions can be identified by U-Net.   The result of Auto Context U-Net that rely on prior information was worse. The extraction results of BET left too many non-brain tissue parts such as eyeballs. Although the result of WGAN+O-Net is not as smooth as ground truth, its segmentation accuracy is the highest. It can completely segment the tumor region with the highest failure probability in the sagittal plane. The high accuracy and adaptability of the model proposed are proved by the above two experiments. In order to prove that these superior properties contribute to our improvement of the model, we have done ablation experiments. The improvements of each module were removed individually from the model in turn, and then learning process was done on the same dataset. The mean values of the evaluation results of different algorithms on ablation experiment are listed in Table 3. First, the accuracy of the model decreased to 41.54% after the segmentation loss function (L S ) was removed. This result from one side indicates that WGAN + O-Net has most convergence. Figure 7 is a supplementary diagram of Table 3, which records the output results of the segmentation network in the two loss function states at a certain moment in the training process. In the 2900th round of training, WGAN+O-Net with L S already outputted the approximate outline of the predicted brain mask, while the model without L S only had a grayscale distribution that was much different from the label. In the 42200th round of training, the model with L S in the loss function can already output the prediction mask with edge details, while the model without L S has obvious adhesion. Therefore, it is verified that adding the loss of the segmentation network on the basis of the adversarial loss can greatly improve the convergence speed of the model and reduce the training time. WGAN+O-Net(D) means to replace the residual block in the discrimination network with a common convolutional layer. The model extraction accuracy dropped by 4.03% after the replacement. WGAN+O-Net(G) with U-Net. The accuracy of the model decreased by 2.7%, After the replacement. The above two datasets show that the segmentation network O-Net and the residual block of the discrimination network are effective contributions to the model. The results of ablation experiments show that the improvement of each part of the model has a positive impact on the brain extraction task and makes an effective contribution to the improvement of accuracy.
Compared with the basic model U-Net, the new brain extraction model not only adds a discrimination network, but also increases the depth of the network. Obviously, WGAN+O-Net has more parameters than U-Net. Although the ablation experiments have confirmed that each part of the improvement has a positive contribution to the improvement of brain extraction accuracy, it cannot explained whether the VOLUME 9, 2021 improvement of accuracy and artifact suppression is due to the improvement of structure or the increase of parameters. To further verify the contribution of model structure to brain extraction research, we evaluated and analyzed the different depth and structure of the model, and the results are shown in Table 4.
In the Table 4, Deep U-Net refers to a model that increases the 4 down-sampling of U-Net to 8 down-sampling. Deep-U-Net adds more feature maps of different scales and some parameters compared to U-Net. Theoretically, Deep-U-Net have better performance. However, Table 4 shows that the Dice coefficient of Deep-U-Net is 91.26%, which is 3.6% lower than of U-Net. Simply increasing the convolutional layer did not achieve performance improvement. When both U-Net and Deep-U-Net were combined with WGAN, the extraction accuracy increased significantly. Note that the discrimination network and loss function of WGAN here refer to the improved one. The Dice coefficient value of WGAN+U-Net is increased by 1.14% compared with U-Net; the Dice coefficient value of WGAN+Deep-U-Net is increased by 6.04% compared with Deep-U-Net. This result reflects that the adversarial training can promote the target segmentation network to obtain better extraction accuracy, alleviate the perform of network degradation, and make the deep network play its due performance. In fact, WGAN+Deep-U-Net achieves the highest Dice coefficient value, which is even 0.04% higher than the proposed WGAN+O-Net model in this experiment. However, the results of WGAN+Deep-U-Net on samples with severe artifacts or uneven gray levels show holes, as shown in Figure 8(a). Figure 8(b) is the extraction result of WGAN+O-Net on the same brain MRI slice image. The difference between the WGAN+O-Net model and the WGAN+Deep-U-Net model is whether the dual shortcut path with channel corresponding attention module is used. Although the extraction accuracy of the overall evaluation is slightly worse than that of WGAN+Deep-U-Net, the proposed WGAN+O-Net can better handle low-quality images such as artifacts or uneven gray-scale distribution. This proves that the proposed O-Net can indeed suppress artifacts.

V. DISCUSSION
The performance of these brain extraction methods evaluated in this article is different on the two datasets. ROBEX performed normally on LPBA40 without major error recognition, but it was affected by artifacts and inconsistent gray-scale distribution. Its extraction results retained large-scale skulls on IBSR18. BET shows strong stability on the two datasets although the extraction results are not good. The boundary of the result extracted by BET is too smooth, and a small part of the brain can easily be misidentified. The sensitivity and specificity evaluation value of the latest method HEFM-E rank low among these methods, indicating that its brain tissue recognition ability and non-brain tissue recognition ability are not good enough. Compared with traditional brain extraction methods, CNN-based methods cause no serious error. U-Net is able to achieve good extraction results on brain MRI datasets with good image quality, such as LPBA40, but it can cause unstable results for datasets with uneven quality such as IBSR18. U-Net largely depends on image quality. When it processes images with artifacts or partial gray distributions that are inconsistent with most other images, the extraction results will show large-scale irregular missing areas. Like U-Net, the results obtained by auto context U-Net also depend on the quality of the dataset. This brain extraction method trends to exclude some brain tissues (see Figure 5). The reason for this may be that the presence of artifacts will affect the effect of the automatic context architecture.
Although based on U-Net, the stability of the proposed WGAN+O-Net is better than U-Net. WGAN+O-Net can achieve a further improvement in accuracy on a good-quality data set (LPBA40), and can also process images with artifacts or uneven grayscale on a poor-quality dataset IBSR18 (see Figure 3 and Figure 8). WGAN+O-Net achieved the highest accuracy on both public datasets, and its results are significantly different from other comparison methods. The adaptability of WGAN+O-Net has also been verified by cross-data set experiments with very large differences between the training set (LPBA40) and the test set (brain data set with lesions). In the test scan images, the lesion may squeeze the brain tissue and blur the brain boundary, which makes it easy for the model that has not learned the characteristics of the lesion area to make wrong judgment. WGAN+O-Net can locate the brain tissue by increasing the focus of the target region so as to improve the model's sensitivity of the model to foreground pixels, and more scale participation makes it have a more comprehensive receptive field. At the same time, the model also has a strong fitting ability to generate a confrontation network. Therefore, high-precision extraction (average 95.51%) can still be maintained for experiments with huge differences across datasets. WGAN+O-Net has a high rate of correct recognition of the lesion area (see Figure 4).
Although WGAN+O-Net has the above advantages, it also has some disadvantages. One is that the method cannot yet achieve cross-modal extraction, that is, the model trained on the T1 weighted MRI images cannot be applied to the T2 weighted images. The other is that we only studied on 2D brain MRI slice images. Thus, the future work will focus on how to realize brain extraction on multi-modal MRI images based on GAN and expand the 2-D based brain extraction method to 3-D based extraction method.

VI. CONCLUSION
In this study, we presented a new brain extraction model WGAN+O-Net which has the ability to suppress artifacts to prevents the large-scale irregular regions in the prediction of brain mask. In comparison to existing brain extraction methods, WGAN+O-Net has a more accurate prediction of brain mask. Moreover, this method can still maintains a good extraction results on data with large differences.