Fusion of Brain PET and MRI Images Using Tissue-Aware Conditional Generative Adversarial Network With Joint Loss

Positron emission tomography (PET) has rich pseudo color information that reflects the functional characteristics of tissue, but lacks structural information and its spatial resolution is low. Magnetic resonance imaging (MRI) has high spatial resolution as well as strong structural information of soft tissue, but lacks color information that shows the functional characteristics of tissue. For the purpose of integrating the color information of PET with the anatomical structures of MRI to help doctors diagnose diseases better, a method for fusing brain PET and MRI images using tissue-aware conditional generative adversarial network (TA-cGAN) is proposed. Specifically, the process of fusing brain PET and MRI images is treated as an adversarial machine between retaining the color information of PET and preserving the anatomical information of MRI. More specifically, the fusion of PET and MRI images can be regarded as a min-max optimization problem with respect to the generator and the discriminator, where the generator attempts to minimize the objective function via generating a fused image mainly contains the color information of PET, whereas the discriminator tries to maximize the objective function through urging the fused image to include more structural information of MRI. Both the generator and the discriminator in TA-cGAN are conditioned on the tissue label map generated from MRI image, and are trained alternatively with joint loss. Extensive experiments demonstrate that the proposed method enhances the anatomical details of the fused image while effectively preserving the color information from the PET. In addition, compared with other state-of-the-art methods, the proposed method achieves better fusion effects both in subjectively visual perception and in objectively quantitative assessment.


I. INTRODUCTION
Positron emission tomography (PET), a nuclear medicine imaging technology, provides a color image with functional information that reflects the metabolism of different tissues. However, PET image has a low spatial resolution and lacks The associate editor coordinating the review of this manuscript and approving it for publication was Bohui Wang . structural information of tissues [1]. On the other hand, magnetic resonance imaging (MRI), another non-invasive imaging tool, presents strong soft tissue structure information with higher spatial resolution. However, MRI image lacks color information that reflects the metabolic function of specific tissues [2], [3]. Therefore, effectively integrating PET with MRI via image fusion can provide more meaningfully complementary information. In other words, the fused image not only retains the spatial structure information of MRI, but also preserves the color information of PET. As a result, this kind of complementary information can assist clinical diagnosis and treatment of diseases better [4], [5].
Over the past few decades, different types of methods for fusing PET and MRI images have been developed. These methods can be roughly categorized into four classes via their implementation mechanisms. The first one is IHS (Intensity-Hue-Saturation) based method which is realized based on the transformation and the replacement strategies [5]- [7]. This type of method first transforms the PET image from RGB color space into IHS color space; then replaces the I component of the transferred PET with the matched MRI (the MRI and the PET need to be registered in advance); finally, inversely transfers the PET with the substituted I component from IHS color space to RGB color space, and then obtains the fused image. This kind of approach usually generates a fused image which contains rich structural information with high resolution, but it generally distorts the color information of PET image due to the substitutive MRI image is rater different from the replaced I component of the PET image. The second type of method for merging PET and MRI images is implemented by the multi-resolution analysis (MRA) strategy [8]- [10]. This kind of method first decomposes PET and MRI images into multi-scale coefficients and transforming bases; then merges the decomposed coefficients according to a certain fusion rule; finally, inversely transforms the fused coefficients and transforming bases so as to get the final fused image. This kind of method can effectively preserve the color information of PET, but it has limitations in enhancing the spatial structure information of fused PET. Moreover, one of key issues confronting by the MRA approach is designing the specific fusion rule, which is very crucial to the fusion effect. The third type of method for fusing PET and MRI images is sparse representation (SR)-based method [11], [12]. This type of approach first solves the sparse representation coefficients both for PET and for MRI images, respectively; then merges the calculated coefficients via specific fusion rule; lastly reconstructs the target image using the fused sparse coefficients and a predefined/learnt over-complete dictionary. Sparse representation has achieved remarkable effects on image fusion. However, in most of the proposed SR-based methods for fusing PET and MRI images, the dictionary is learnt or constructed using the entire image, i.e., extracting the image patches from the entire image to learn or construct a global dictionary. Since the structural similarities among the image patches are not considered while learning or constructing the global dictionary, hence, the sparse coefficients solved by this kind of global dictionary are not very suitable for accurately reconstructing the target image [13]. Furthermore, similar to the MRA-based fusion method, designing the specific fusion rules is also an inevitable issue encountering by the SR-based fusion method. The last but not the least, inspired by other new ideas, the methods for fusing PET and MRI images include such as nonparametric density model-based method [14], ant colony optimization-based method [15] and so on.
Although these recent advanced methods achieve remarkable performance, one of major problems involved in these methods is designing fusion rule. Unfortunately, the fusion rules in the most of existing approaches are manually designed, and become more and more complicated. As a result, the fusion schemes with these complex hand-crafted fusion rules inevitably have the limitations such as implementation difficulty and time-consuming computation.
In recent few years, deep learning (DL) has become one of the most attractive topics in the field of computer vision due to its strong ability to extract image features. Correspondingly, in the field of image fusion, DL has also been successfully applied to various applications, such as remote sensing image fusion [16]- [18], multi-focus image fusion [19]- [21], medical image fusion [22] etc. Liu et al. [23] comprehensively summarized DL-based methods for image fusion in details. Actually, most of the proposed DL-based methods for image fusion are realized based on convolutional neural network (CNN), in which a critical prerequisite must be satisfied, i.e., the ground truth should be available in advance. However, in the task of fusing PET and MRI images, it is nearly impossible to establish the ground truth due to defining a standard for final fused images is unrealistic. Moreover, in order to complete the image fusion task, most of the proposed CNN models require additional post-processing procedures because they are not designed in the end-to-end manner [24].
More recently, generative adversarial network (GAN) has drawn a tremendous amount of attention, and has been successfully applied to various applications in the field of computer vision and machine learning, especially to the image synthesis [25]- [27]. In the particular case of image fusion, Ma et al. [28] firstly applied the GAN to the image fusion, i.e., fusion of infrared and visual images. Ma et al. [29] further improved the image fusion algorithm for infrared and visual images by adding an edge-enhancement constraint. Guo et al. [30] proposed an algorithm for multi-focus image fusion using conditional GAN. To the best of our knowledge, there are no reports on the application of GAN and its variants to the fusion of medical images.
According to the above analysis, and inspired by the [28], we propose a method for brain PET and MRI image fusion through the generative adversarial mechanism. Specifically, conditioned on the multiple input images together with the tissue label map generated from the input MRI image, a novel end-to-end tissue-aware framework based on conditional generative adversarial network (TA-cGAN) is proposed. Similar to the original GAN, the training procedure of our proposed TA-cGAN like a two-layer min-max game in which the generator and the discriminator are trained simultaneously with the goal of one beating another, i.e., the generator attempts to output a fused image mainly contains the color information of PET, whereas the discriminator tries to urge the fused image to include more anatomical information of MRI. Furthermore, our proposed TA-cGAN is an VOLUME 8, 2020 end-to-end model, in which the fused image can be generated automatically from the combining of source images and the tissue label map without manually designing the complicated fusion rules.
The reminder of this paper is organized as follows. In Section II, related works regarding the generative adversarial network and the conditional generative adversarial network are briefly reviewed. Section III details our proposed method. Experiments and analysis are presented in Section IV. The concluding remarks are given in Section V.

II. RELATED WORKS A. GENERATIVE ADVERSARIAL NETWORK
Generative adversarial network was firstly proposed by Goodfellow et al. [31] in 2014, and has drawn appealing attention in the field of machine learning and computer vision. The GAN is a generative model which consists of two adversarial networks namely generator G and discriminator D. The generator attempts to generate fake but plausible samples, whereas the discriminator tries to distinguish between the generated samples and the real samples. Specifically, the generator learns to capture the real data distribution and then generate new plausible samples so as to fool the discriminator, while the discriminator learns to distinguish the model generated distribution from the real data distribution. The two networks are trained against each other until the discriminator be unable to tell whether the generated samples come from the generator or not.
Mathematically, in the original GAN, D and G are trained in a competitive fashion by solving the following min-max optimization problem: (1) where x is the real sample from true dataset, and z is the noise; G(·) and D(·) denote the output of the generator G and the discriminator D, respectively; P data denotes the real data distribution, and P z denotes the prior distribution of noise. G tries to minimize the above objective function as shown in (1) whereas D attempts to maximize it.

B. CONDITIONAL GENERATIVE ADVERSARIAL NETWORK
In the standard GAN, there is no control on modes of the synthesized data. Actually, it is possible to guide the sample synthesis by conditioning the GAN on auxiliary information, such as class label, text information, data from other modalities, et al. Hence, Mirza and Osindero [32] extended the basic GAN framework to the conditional generative adversarial network (cGAN) by feeding the auxiliary information into both the generator and the discriminator as extra input layer.
Mathematically, in the cGAN, D and G are trained by solving the following two-player optimization problem: where y is the auxiliary input which could be any kind of extra information.

III. PROPOSED METHOD A. PIPELINE OF PROPOSED METHOD
The main goal of this study is to design a method for fusing a pair of pseudo color PET image and a gray MRI image so as to obtain the fused image with meaningfully complementary information as much as possible. In particular, the conditional generative adversarial network is employed to fulfill the fusion of PET and MRI images. More specifically, we regard the PET and MRI image fusion task as a two-player adversarial game between the generator and the discriminator, where both the generator and the discriminator are conditioned on the tissue label map which is generated from MRI image. Fig. 1 illustrates the general pipeline of our proposed method that consists of training and testing stages, where I P stands for the PET image, I M stands for the MRI image, I L denotes the tissue label map, and I F denotes the fusion result.
Assume we have a set of ''pairs'' of PET and MRI images, and all paired PET and MRI images are registered. We further suppose that all MRI images are segmented into white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF) so as to obtain the tissue label maps.
In the training stage, firstly stack the PET image I P , the MRI image I M together with the tissue label map I L . Then feed the concatenated ''image'' into the generator and obtain the fused image I F . Next, input the fused image I F , the MRI image I M , and the tissue label map I L into the discriminator whose goal is attempting to distinguish I F from I M . It worth noting that both the generator and the discriminator are conditioned on the tissue label map I L , and trained simultaneously in a competitive fashion, i.e., the generator tries to contain more and more color information from the PET image I P , while the discriminator urges the fused image I F include more and more structural details from the MRI image I M . In this way, the fused image I F will gradually include more and more anatomical details from the MRI image I M . Once the generated image (i.e., I F ) produced by the generator cannot be distinguished by the discriminator, the final expected fused image I F is obtained. In the testing stage, firstly concatenate the PET image I P , the MRI image I M together with the corresponding tissue label map I L , then input the concatenated ''image'' into the trained generator, and finally get the fused image I F .

B. JOINT LOSS FUNCTION
In the framework of traditional GAN, the generator G aims to generate samples by using random noise that follows a prior probability distribution z ∼ P z (z). In our proposed TA-cGAN, instead of using random noise as the input, we condition the model on multiple images from different modality, i.e., the PET image I P , the MRI image I M and its corresponding label map I L . Furthermore, different from the conventional GAN in which the log likelihood cost is used for the adversarial loss, we adopt the least square loss which has been proved can boost training stability as well as generate high quality image [33]. During the training process, to satisfy with the PET and MRI image fusion task, except for only using adversarial loss to train the generator G, our proposed TA-cGAN utilizes joint losses including spectral loss L Spec , structural loss L Str , and adversarial loss L Adv . Mathematically, the joint loss used in our work is expressed as follows: where the spectral loss L Spec urges the fused image to contain similar color information (Characterized by the pixel intensities of PET image) as those of the PET image; The structural loss L Str attempts to make the fused image has similar structure information (Characterized by the gradients of MRI image) as those of the MRI image; The adversarial loss L Adv aims to add more detailed information to the fused image; λ 1 , λ 2 ,and λ 3 are the corresponding weights for spectral loss, structural loss and adversarial loss, respectively.

1) SPECTRAL LOSS
Formally, the spectral loss L Spec is defined based on mean square error (MSE) as follows: where I P is the original PET image; I F is fused image generated by the generator G; M and N denote the width and height of the image. The spectral loss mainly tries to make the fused image similar with the PET image in terms of pixel intensities, i.e., to make the fused image I F preserve the color information contained in the PET image I P .

2) STRUCTURAL LOSS
The structural loss L Str is defined based on image gradient difference as follows: where ∇ x and ∇ y denote the gradient operation of image with respect to the horizontal and vertical direction, respectively. The structural loss attempts to minimize the magnitude difference of the gradients between the MRI image and the fused image, i.e., to make the fused image I F retain the gradient information contained in the MRI image I M .

3) ADVERSARIAL LOSS
The adversarial loss L Adv is defined based on the probabilities of the discriminator D over the concatenated training data I Concat = {I P , I M , I L } (the PET image I P , the MRI image I M and its corresponding label map I L ) as follows: where c denotes the value that the generator G wants the discriminator D to believe for fake data.

C. NETWORK ARCHITECTURE
Similar to the original GAN, the proposed TA-cGAN also consists of two sub-networks, i.e., the generator and the discriminator. However, different from the original GAN which is mainly used for image-to-image translation, our proposed TA-cGAN is designed for images-to-image translation, i.e., input multiple images (PET image I P , MRI image I M and its corresponding label map I L ) and output one fused image I F . The network architectures of the TA-cGAN are detailed as follows.

1) NETWORK ARCHITECTURE OF GENERATOR
In our proposed method, the generator is constructed based on the U-Net [34]. The U-Net utilizes skip connection technique to integrate the low-level feature coming from the shallow encoder layers and the high-level feature coming from the deep decoder layers. Moreover, the skip connection technique can be used to partially solve the problem of gradient vanishing. Due to adopting the idea of skip connection, the U-Net has been successfully applied to many image applications, such as image synthesis [27]. In this work, the network architecture of generator G consists of two parts, i.e., the encoder and the decoder, as shown in Fig. 2. The inputs of the network are the PET image I P , the MRI image I M and its corresponding label map I L ; and the output of the network is the fused image I F . Specifically, as illustrated in Fig. 2, the entire generator network is composed of 12 convolutional layers. The encoder part consists of 6 down-sample layers that perform convolutions using 3 × 3 filters with stride 2 in each direction, batch normalization (BN), and rectified linear unit (ReLU) activation operations with slope of negative 0.2. Note that, we do not use pooling operation mainly because it will reduce the spatial resolution of feature maps and will make the network unable to capture fine details in the MRI images. In addition, zero padding with 1 × 1 in each down-sample layer is employed. The decoder part consists of 6 up-sample layers, where the first five layers perform convolution-BN-ReLU operations, and the last layer only perform convolutional operation using 1 × 1 filter. In the decoder part, the feature maps in the encoder layers are concatenated with those in the decoder layers using skip connection (as indicated by the dotted arrows in Fig. 2).

2) NETWORK ARCHITECTURE OF DISCRIMINATOR
Different from the generator G, the discriminator D is mainly designed for solving the problem of classification. Specifically, in this study, the major goal of the discriminator D is attempting to distinguish the fused image pair (Fused image I F and label map I L ) from the MRI image pair (MRI image I M and label map I L ). Fig.3 illustrates the network architecture of the discriminator D used in our study. As shown in Fig. 3, the inputs of the discriminator D is either the fused image pair or the MRI image pair; and the output of the network is the class label, i.e., distinguished (labeled by 1) or not (labeled by 0).
Briefly, as illustrated in Fig. 3, our network architecture of the discriminator D is a simple convolutional neural network consisting of 5 convolutional layers and 1 fully connected layer followed by a sigmoid activation function. The five convolutional layers, similar to the encoder structure of the generator G, perform the convolution-BN-ReLU operations.

D. TRAINING PARADIGM
Similar to the original GAN, the generator network G and the discriminator D are trained alternatively. Specifically, first fix G to train D for one step according to the joint loss function as (3), and then fix D to train G for one step too. More intuitively, the training process of the generator G and the discriminator D is just like playing a two-player min-max game, where the generator G aims to minimize the loss function, whereas the discriminator D attempts to maximum it. In this way, the training process will continue, and both the generator G and the discriminator D will gradually become more and more powerful until the termination condition of iteration is satisfied. In the training stage, both G and D are optimized using the Adam solver [35] with β = 0.5 and learning rate of 0.0002. Note that, the settings for other parameters during the training process as well as the preparation of training samples will be elaborated in the section IV.A.
In the testing stage, first concatenate the PET image I P , the MRI image I M , and its corresponding label map I L ; then input the concatenated image into the trained generator G, and output the final fused image I F .

IV. EXPERIMENTS AND ANALYSIS
In this section, we firstly introduce the experimental settings including experimental data and preprocessing, parameters' setting, compared methods, and evaluation metrics. Then we demonstrate and analyze the experimental results both visually and quantitatively.

A. EXPERIMENTAL SETTINGS 1) EXPERIMENTAL DATA AND PREPROCESSING
In order to validate the performance of our proposed method, we use a publicly available dataset of Whole Brain database (http://www.med.harvard.edu/aanlib/) which is created by the School of Medicine, Harvard University. In this study, 36 pairs of PET and MRI images are collected for the usage of experiments. The collected images include 30 cases of normal control (NC) and 6 cases of mild Alzheimer's disease (AD). For the case of normal control, both the PET and MRI images have the same size of 256 × 256. However, for the case of mild AD, the sizes of the PET images are different from those of the MRI images, i.e., the PET images have the size of 128 × 128, whereas the MRI images have the size of 256 × 256. Therefore, it is necessary to firstly reduce the size of the MRI images into 128 × 128 for the case of mild AD. Furthermore, all the collected MRI images (including the resized MRI images) are segmented into white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF) by the HMRF-EM algorithm [36].
Normally, large amount of training samples is preferable to train deep neural networks. However, the number of training samples is limited in our study. Hence, in the process of training data preparation, we adopt the data augmentation technique to expand the training samples. Moreover, instead of using the entire images as input, we take the large image patches with the size of 64 × 64 as input. The detailed information of preparing the training data is elaborated as follows: (1) First, each paired images (PET image, MRI image and its corresponding label map) were flipped from left to right, and then from top to down. Thus, we expand the samples from 36 pairs to 144 pairs which include 120 pairs for the case of normal control and 24 pairs for the case of mild AD.
(2) Next, for all image pairs obtained in the step (1), we crop the entire image without overlap into large image patches with size of 64 × 64, and thus increase the training samples from 144 pairs to 2016 pairs, where 1920 pairs were cropped from the images with normal control, and 96 pairs were cropped from the images with mild AD.

2) PARAMETERS' SETTINGS
In the training stage, the network was trained using the Adam optimizer with an initial learning rate of 0.0002 and a minibatch size of 10 over 100 epochs. The number of training iterations is set to 100. λ 1 = λ 2 = 1, λ 3 = 0.5. c = 0.9, where c is a label value as shown in (6).

3) COMPARED METHODS
For the purpose of validating the performance of our proposed method, the following five state-of-the-art methods are used to compare with our method: the IHS combined with retina-inspired models (IHS-Retina) method [5], the non-subsampled shearlet transform (NSST) method [10], the low-rank sparse dictionaries learning (LSDL) method [11], the nonparametric density model (NDM) method [14], and the convolutional neural networks (CNNs) method [22].

4) EVALUATION METRICS
It is usually difficult to assess the fusion performance only via visually subjective evaluation. Therefore, it is necessary to choose some quantitative metrics to objectively evaluate the performances of different fusion methods. In this paper, we adopt the following four commonly used metrics to evaluate the performances of different methods: the entropy (EN) [37], the average gradient (AG) [38], the spectral discrepancy (SD) [5], and the Q AB/F [39]. The definitions of these four metrics are sequentially presented as follows: EN is mainly used to measure the amount of information contained in the fused image. Mathematically, EN is formulated as follows: where L denotes the number of gray scale, and it is 256 in our experiments. P(i)(i = 0, 1, . . . , L − 1) is the occurring probability of the pixels with the gray scale i(i = 0, 1, . . . , L − 1) in the fused image. Normally, the larger EN is, the richer information is contained in the fused image, and the better performance is achieved by the fused method. AG usually reflects the clarity of the fused image, and is mainly used to measure the spatial resolution of the fused image. Formally, AG is defined as follows: where F k (x, y)is the pixel value of the fused image at position (x, y). R, G, B are the three components of the fused image with the size of M × N . In this paper, M = N = 256 for the case of normal control, and M = N = 128 for the case of mild AD. Simply, the larger AG is, the higher spatial resolution fused image has. SD is mainly used to measure the spectral (color) quality of fused image. Mathematically, SD is expressed as follows: where F k (x, y) and O k (x, y) are the pixel values of the fused image and the original PET image at position (x, y), respectively. The meanings of R, G, B and M , N are same as those of (8). A small SD indicates a good fusion result. In other words, smaller AD indicates that the color of the fused image is closer to that of the original PET image. Q AB/F is mainly used to measure the edge preservation from the source images during the process of fusion. Q AB/F is mathematically defined as follows: where A and B denote the two source images, and F represents the fused image. Q AF (x, y) and Q BF (x, y) are the edge preservation values. ω A (x, y) and ω B (x, y) are the weights. Q AF (x, y) and Q BF (x, y) are weighted by ω A (x, y) and ω B (x, y), respectively. Usually, a larger Q AB/F means a good fusion result.

B. EXPERIMENTAL RESULTS WITH VISUAL AND STATISTICAL ANALYSIS
To demonstrate the advantage of our proposed method in terms of fusion effect, in this section, the proposed method is compared with other five competitive methods on two aspects: subjectively visual evaluation and objectively quantitative assessment.

1) SUBJECTIVELY VISUAL EVALUATION
To qualitatively compare the fusion performances of the proposed method with those of the other five state-of-the-art fusion methods mentioned in the section IV. A., we visually demonstrate the fusion results for two cases of PET and MRI images, i.e., the case of normal control as well as the case of mild AD. Subsequently, we analysis the fusion results from two aspects, i.e., the structural details extraction from the original MRI images and the color fidelity preservation from the original PET images. Fig. 4 shows the fusion results using different fusion methods for a case of normal control. Similarly, the fusion results achieved by different fusion methods for a case of mild AD are displayed in Fig. 5. Note that, for easily observing the differences among the fused images resulted by the different methods, the regions marked by the red rectangles are enlarged and displayed under their corresponding fused image, as shown in Fig. 4 and Fig. 5.
From Fig. 4, it can be seen that the LSDL method fails to preserve the anatomical details [Pointed by the top white arrow as shown in the close-up region of Fig. 4   In summary, comprehensively considering both the structural details extraction from the original MRI images and the color fidelity preservation in the original PET image, we can conclude that the proposed method achieves the best fusion results in terms of visual quality than other five fusion methods.

2) OBJECTIVELY QUANTITATIVE ASSESSMENT
To quantitatively compare the fusion performance of the proposed method with those of the competitive fusion methods, we investigate the statistical results of different fusion methods for fusing two types of PET and MRI images, i.e., the images of normal control (NC) and the images of mild AD. In our study, four popular-used objective metrics i.e., EN, AG, SD and Q AB/F , are exploited to validate the fusion performance of different methods. The statistical results in terms of EN, AG, SD and Q AB/F are tabulated in Table 1, Table 2,  Table 3, and Table 4, respectively. Note that, the best performance is highlighted in bold.
From Table 1, we can see that the proposed method ranks the first place in terms of EN, i.e., it achieves the largest average value of EN over all fused images including the case of normal control and the case of mild AD. This fact implies that the fused images resulted by our proposed method contain more information including spectral colors and anatomical structures. The reason for this is mainly due to incorporating    the spectral loss as well as the structural loss into the loss function as shown in (3).
Similarly, From Table 2, it can be observed that our proposed method achieves the largest value of AG across both the normal control images and the mild AD images. This indicates that the fused images produced by our proposed method have higher spatial resolution than those images generated by other competing methods.
Again, from Table 3, we can observe that our proposed method achieves the smallest average value in terms of SD. This means that compared with other fusion methods, the proposed method preserves more spectral color information from the original PET images. In other words, the spectral colors of the fused images generated by the proposed method are closer to that of the original PET images. The main reason for this is due to incorporating the spectral loss into the loss function as shown in (3).
Also, from Table 4, it is clearly that the proposed method ranks the first place in terms of Q AB/F . In other words, the proposed method obtains the highest average value of Q AB/F . This reflects that compared with other fusion methods, our proposed method has stronger ability to preserve edge details from the source images, i.e., PET and MRI images.
Overall, from aforementioned four tables, we can conclude that the proposed method achieves the best fusion results in terms of quantitative assessment than other five fusion methods. Specifically, the images fused by the proposed method are more informative, clearer. Moreover, our proposed method can produce the fused images with less color distortion and more structural details.

C. EFFECTIVENESS OF LABEL CONDITION
In our proposed method, TA-cGAN, both the generator G and the discriminator D are conditioned on the tissue label map generated via segmenting the MRI image into white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF). To verify the contribution of the tissue label condition to performance improvement, we perform comparison experiments on two cases of data using the label conditioned model and the model without label condition, respectively. Fig. 6 visually illustrates the experimental results using two models in terms of previously mentioned four evaluation metrics, i.e., EN, AG, SD and Q AB/F . As shown in Fig. 6, we can find that the tissue label conditioned model performs better on two cases of image fusions than the model trained without label condition.
This fact proves that the tissue label extracted from the MRI image is really helpful for improving the fusion performance in this study.

V. CONCLUSION
In this paper, we propose a novel tissue-aware conditional generative adversarial network called TA-cGAN for fusing the brain PET and MRI images. In our proposed method, both the generator G and the discriminator D are conditioned on the tissue label map generated from the MRI images. In addition, adversarial loss, spectral loss, and the structural loss are incorporated to capture both the spectral colors from the original PET image and the anatomical structures from the original MRI image. Extensive experiments demonstrate that our proposed TA-cGAN outperforms the state-of-the-art fusion methods both in visual perception and in quantitative assessment. Specifically, the fused images generated by our proposed method contain more spectral colors and include more structural details than those images fused by other competing methods. In the future, we will mainly focus on improving the performance of the TA-cGAN via including more cases of PET and MRI images, and extending TA-cGAN to address the general problems of multi-modality medical image fusions.