Assessing the Ability of Generative Adversarial Networks to Learn Canonical Medical Image Statistics

In recent years, generative adversarial networks (GANs) have gained tremendous popularity for potential applications in medical imaging, such as medical image synthesis, restoration, reconstruction, translation, as well as objective image quality assessment. Despite the impressive progress in generating high-resolution, perceptually realistic images, it is not clear if modern GANs reliably learn the statistics that are meaningful to a downstream medical imaging application. In this work, the ability of a state-of-the-art GAN to learn the statistics of canonical stochastic image models (SIMs) that are relevant to objective assessment of image quality is investigated. It is shown that although the employed GAN successfully learned several basic first- and second-order statistics of the specific medical SIMs under consideration and generated images with high perceptual quality, it failed to correctly learn several per-image statistics pertinent to the these SIMs, high-lighting the urgent need to assess medical image GANs in terms of objective measures of image quality.


I. INTRODUCTION
When developing improved medical imaging technologies, such as methods for image reconstruction, restoration, and analysis, it is crucial to objectively evaluate them via a diagnostic clinical task [1]- [4].Because a full-fledged clinical trial of rapidly developing imaging technologies often is infeasible [5], [6], computer simulation studies [7] have been proposed as an alternative.In order to refine and assess any medical imaging technology via computer simulation, the nature and variability of the objects to-be-imaged must be accurately characterized.To this end, a variety of stochastic object models (SOMs) have been developed [5], [7]; these enable simulation of random, and sufficiently realistic, digital medical objects.
A generative model is a statistical model of an unknown data distribution that enables sampling from the data distribution via a learned representation of it.The model is trained directly on a large sample drawn from the data distribution [8].Modern generative models learn a neural network-based mapping from a tractable distribution, such as a multivariate, independent, and identically distributed (i.i.d.) Gaussian distribution, to the intractable, high-dimensional object distribution of interest.This enables sampling from the unknown distribution, and provides the ability to perform inference.Therefore, generative models such as generative adversarial networks (GANs) are being actively investigated for applications in medical imaging, such as: image restoration [9], [10], image reconstruction [11]- [14], image analysis [15], [16], image-to-image translation [17], data sharing [18] and objective image quality assessment [19].
Modern generative models, such as the StyleGAN and its successors [20]- [22], represent a tremendous improvement in terms of the stability, controllability, diversity, and visual quality of generated images.However, state-of-the-art GANs trained on medical image datasets have been shown to produce images that look realistic, but nevertheless contain medically impactful errors [18], [23], [24].Therefore, in order for GANs to be safely used in medical imaging, they must first be objectively evaluated [25], for instance, with the help of a relevant diagnostic task.
Despite tremendous improvements in the quality of the images generated by a GAN, the question of whether or not a GAN correctly approximates the statistical features important to a medical imaging application remains largely unanswered.Although mathematical summaries, such as the Wasserstein metric [26] and negative log-likelihood [27] are correlated with the fidelity of the trained GAN, there is no guarantee that a favorable value achieved by these measures also translates to usefulness of the trained GAN for medical imaging applications.Although perceptual measures such as the Frechet Inception distance (FID) have grown to be immensely popular, they are agnostic to the downstream task a medical image GAN may be used for [28].Furthermore, the above mentioned measures are ensemble measures.It has been shown that individual samples drawn from the GAN may contain impactful errors despite giving satisfactory ensemble measures [29].Lastly, medical image distributions typically consist of multiple classes or modes, and it has been shown that may produce critical errors while producing images from a mode that is rarely seen during training [18].
The objective of this study is to assess the ability of a state-of-the-art GAN to learn the statistics of a canonical stochastic image model (SIM) that are relevant to the objective assessment of image quality (IQ), and to study how the performance assessment of GANs by task-agnostic measures such as FID score compares with the performance assessed by the medically meaningful measures identified for the canonical SIM under consideration.To this end, three canonical SIMs were identified, namely the modified clustered lumpy background model [30], the B-mode ultrasound speckle model [31] and the stylized two-dimensional (2D) VICTRE (S2V) model [32].A state-of-the-art GAN architecture, namely StyleGAN2, was trained on images generated from these canonical SIMs.Statistical quantities that are meaningful and relevant to the above SIMs were computed from images from the canonical SIM as well as the images generated by the GAN.Summary measures computed from these identified statistical quantities were compared against the FID for the purpose of assessing the fidelity of the trained GAN.This work is an extension of a preliminary study conducted using an angiographic SIM [33].
The remainder of this paper is organized as follows.Section II describes the relevant background on GANs and their evaluation, as well as the background on the SIMs used in this study.Section III describes the setup for the specific numerical studies and the identification of the SIM-pertinent evaluation measures.Section IV presents the results.A summary of the salient findings of this work is presented in Section V.

A. Generative adversarial networks (GANs)
Generative adversarial networks (GANs) are a popular class of generative models that are aimed to approximate a data distribution by learning to map a sample z ∈ R k from a lower dimensional, tractable data distribution p z , such as the i.i.d.standard normal distribution, to a sample f from the high dimensional data distribution p f .In GANs, two networks, namely a generator network G : R k → R n with parameters Θ G and a discriminator network D : R n → R with parameters Θ D are jointly trained by approximately solving the following min-max optimization problem: where (•) is a utility function used to define the objective; for instance, a popular choice being (x) = log(x) [34].
The promise of a generative model such as a GAN comes from the fact that once trained, samples from the otherwise inaccessible high dimensional distribution p f can be obtained by sampling low dimensional vectors, known as latent vectors z from p z and computing G(z).Thus, the GAN provides a tractable representation of p f that may find use in downstream applications in imaging science, such as image reconstruction [11], [14] and image quality assessment [19].

B. Advanced GAN training strategies
Under prescribed theoretical conditions, minimizing the GAN training loss described in Eq. ( 1) is equivalent to minimizing the empirical Jensen-Shannon (JS) divergence between the true and the estimated probability distribution functions (PDFs) of the data [34].However, in practice, GAN training is known to be unstable [35], [36] and several strategies have been proposed to improve stability.For example, the use of different learning rates and update frequencies for the generator and discriminator weights aids in avoiding the vanishing gradients problem for the generator and premature overfitting of the discriminator [34], [37].Novel loss functions, such as in so-called Wasserstein GANs [26], also help in improving the training stability.Karras, et al. [38] proposed a strategy for scaling GANs by use of progressive training, where both the generator and discriminator are trained on lower resolution images and are progressively grown to enable training on higher and higher resolution images.StyleGAN and its successor, StyleGAN2, introduce blocks of transformed latent vectors as inputs to different layers of the network at different resolutions, thus controlling features at different scales [20], [21].Although these improvements to the GAN architecture and training have cumulatively led to state-ofthe-art performance in terms of diversity, controllability and realism of images generated, they are largely heuristic, and are not designed specifically to learn task-pertinent statistics of medical image distributions.

C. Evaluation of generative adversarial networks
Modern GANs, such as the StyleGAN2 [21] have shown impressive performance in terms of the perceptual quality of the generated images, invertibility, and meaningful control over image semantics.However, evaluating the quality of the distribution learned by a generative model is an open problem [39].Some measures directly estimate analytical quantities and distance metrics related to the image probability density function (PDF), such as the negative log-likelihood [27] or the Wasserstein metric [26].Other measures such as the perceptual path-length [20] analyze the nature of the manifold learned by the GAN.Motivated by subjective perceptual assessment by humans [40], perceptual evaluation measures such as the Inception score (IS) and more commonly, the Fréchet Inception distance (FID) score have become immensely popular [28], [40].In order to compute these scores, image features are first extracted using a pre-trained Inception network [41] and distance metrics on the extracted features are computed.Although the FID score has shown excellent agreement with subjective visual assessments by humans [28], it is agnostic to the downstream task a medical image GAN may be used for.Additionally, it is an ensemble statistic, and hence could be blind to specific errors in high-order statistics of individual images [29].
The studies described below seek to assess the ability of medical image GANs to reproduce image statistics that are meaningful and pertinent to the medical stochastic image model under consideration, and to see how well traditional measures such as the FID correlate with these pertinent statistics.In order to do so, the data distributions used to train the GAN needs to be carefully chosen as follows.First, realistic canonical SIMs that are associated with a mathematical procedure for generating images need to be identified, because it allows for direct control over image properties of interest.For these canonical SIMs, statistical quantities that are medically meaningful for the particular canonical SIM need to be identified.These tasks are described next.

D. Canonical stochastic image models
Stochastic models of simulated medical images have been developed in order to approximately capture the variability in medical image distributions [1], [5], [42].Traditionally, such stochastic image models (SIMs) have been established by developing a mathematical procedure for generating images that possess certain prescribed statistical properties.Examples of such SIMs include the lumpy background model [42], the clustered lumpy background (CLB) model [43], B-mode ultrasound speckle model [31], among others.Once a SIM is established, it can be used to model image statistics in virtual imaging trials [7].Here, the three canonical SIMs identified for the purpose of evaluating a GAN-based SIM are briefly reviewed.These SIMs are the modified clustered lumpy background model [30], the B-mode ultrasound speckle model [31] and the stylized 2D VICTRE breast phantom model [32].As compared to real medical images, simulated images from these SIMs provide the ability to examine the behavior of the GAN under a controlled setting, with several different parameter configurations of the canonical SIM.
1) Modified clustered lumpy background (CLB) model: The CLB model was developed by Bochud et al. [43] for generating random backgrounds that resemble the image textures seen in mammography.In 2008, Castella, et.al proposed variations to the original CLB model so that the images from the model better resemble realistic mammographic textures as judged by human experts [30].In addition to introducing oriented structures and long-range correlations, the authors proposed to adjust the parameters of the CLB model in order to improve the realism of the images generated.This was done by computing 17 different texture features on both the real mammographic regions of interest (ROIs) as well as images generated from the CLB model.These were used to formulate a loss function that was minimized by tuning the parameters of the CLB model.
2) B-Mode Ultrasound Speckle (USS): B-mode ultrasound speckle (USS) can be viewed as a random phasor sum of complex signals [31].The received complex signal E is a radio frequency voltage output from an ultrasound transducer and can be modeled as the sum of N complex signals with phases statistically independent uniformly distributed on [0, 2π] [31].The quantity N is the number of scatterers per resolution cell or equivalently the scatterers per number density (SND) times the resolution cell size.The resolution cell size is defined as the axial resolution (AR) times the lateral resolution (LR), given in Ref. [44], where the parameters are the frequency of the carrier f c , the wave velocity v, the ratio between the focal distance and the length of the aperture (called the fnumber) and the number of cycles within the full width half maximum in the spatial direction (FWHM) N c .The USS SIM is modeled using the method proposed in Ref. [31] where the standard deviations of the 2-D Gaussian PSF are determined by the AR and LR.
If N is large, the resulting USS follows Gaussian statistics and is called fully developed speckle.In this case, the envelope |E| follows a Rayleigh distribution and thus the intensity I = |E| 2 follows an exponential distribution.If N is small then the resulting USS is called non-Gaussian speckle and its statistical properties are determined by N [31].
3) The stylized 2D VICTRE (S2V) breast phantom model: The US Food and Drug Administration's (FDA) Virtual Imaging Clinical Trials for Regulatory Evaluation (VICTRE) initiative has produced a set of software tools for simulating random anthropomorphic phantoms of the human female breast [32].These numerical breast phantoms (NBPs) are three dimensional (3D) voxelized maps, where a voxel value denotes the tissue type from one out of the following 10 tissues: fat, glandular tissue, skin, artery, vein, muscle, ligament, nipple and terminal duct lobular unit.Controlling the patient-specific input parameters such as breast type, size, shape, granularity and density, and setting the random seed number enables the generation of large ensembles of stochastic NBPs with realistic variation in breast anatomy, shape and fat-to-glandular tissue ratio.The VICTRE model is thus a general stochastic object model (SOM) that can be specialized to different imaging modalities by assigning the appropriate physical coefficients.In particular, by assigning X-ray linear attenuation coefficients to the various tissues in the NBPs and extracting 2D slices from the 3D phantom, a SIM can be obtained.The VICTRE software creates NBPs that correspond to four breast types identified by the American College of Radiology's (ACR) Breast Imaging Reporting and Data System (BI-RADS) [45] and are distinguished by the amounts of fat and glandular tissue.

III. NUMERICAL STUDIES A. SIM training data and GAN training
1) The CLB model: The following four parameter configurations of the modified CLB model that were shown to produce realistic simulated mammographic images under radiologists' assessment [30] were used in this study -(1) doubiso, a double-layered CLB model with isotropically oriented clusters, (2) simpiso, a single-layered CLB model with isotropically oriented clusters, (3) doubori, a double-layered CLB model with anisotropically oriented clusters, and (4) simpori, a singlelayered CLB model with anisotropically oriented clusters.Additionally, images from the original CLB model opex99, proposed by Bochud et al. [43] were employed.The gray levels and pixel value range were set in accordance with Castella et al. [30].For each of the five canonical SIMs, a GAN was trained on a dataset of 100,000 256×256 images from the SIM.
As discussed in the Introduction, medical image distributions are typically mixed distributions consisting of multiple classes or modes.In order to illustrate the effect of mixing distributions on the identified SIM-pertinent measures, a stylized emulation of data coming from two different imaging sites or clinical systems having different resolution properties was constructed.Accordingly, one of the classes consisted of doubiso images as are described above.The other class consisted of doubiso images that were first degraded by use of a Gaussian blur followed by low-pass filter H LPF (•) with cutoff at half the image bandwidth.Two such multi-class datasets were constructed, one having a 50\50% split and the other having a 95\5% split between the regular and degraded image classes.These two datasets will henceforth be referred to as the doubiso 50-50 and doubiso 95-5 datasets respectively.
2) B-mode Ultrasound Speckle Model: The parameter configurations chosen for the USS SIMs are follows.All images were 256 × 256 pixels in size with each pixel corresponding to a 100µm × 100µm square.The velocity of the wave was set to v = 1556m/s, the frequency f c was set to 3.5 MHz, the number of cycles within the FWHM was set to N c = 2, the f -number for the y direction was set to 2 and the f -number in the z direction was set to 3. The ultrasound wave was assumed to be propagating in the x direction.The SND parameter was varied to create four canonical USS SIM datasets, corresponding to SND values of 1, 2, 3 and 30 mm −3 respectively.The first three values were chosen because they fall in the range of SND values that can be accurately estimated from the image [31], which is not the case for the SND-30 SIM that represents a fully developed speckle [44].These four SIMs will henceforth be called (1) SND-1, (2) SND-2, (3) SND-3 and (4) SND-30 respectively.
For each of the above described SIMs, a GAN was trained using 100,000 images from the SIM.Before training, each ensemble of training images was converted to an unsigned, 8bit grayscale where 255 corresponds to the top 1% pixel value in the ensemble.
3) The S2V model: The S2V was obtained from the 3D VICTRE NBP SOM described in Section II as follows.First, a dataset of 1000 3D NBPs was generated using the VICTRE tool [32].Next, linear attenuation coefficients in cm −1 for X-rays of energy 30 keV were assigned to the pixels corresponding to each of the tissue types.These values were either directly obtained from literature, or calculated using the mass attenuation coefficient and material density values obtained from literature [46]- [48].Coronal slices were extracted from a central region of an NBP that ranges from 40% through 70% of the distance from the outermost coronal plane to the innermost coronal plane.This was done to avoid extracting slices too close to the chest wall or the nipple.A spacing of 50 pixels was maintained between two slices consecutively extracted from the same NBP.The extracted slices were then downsampled to an image dimension of 512×512, which corresponds to the length scale of 0.4 µm per pixel.The described procedure generated a 2D dataset of 130,000 slices, which was used for training a GAN.
StyleGAN2, proposed by Karras et al. [21] was employed as the GAN in all the studies described in this work.All the default parameters and configurations of the StyleGAN2 architecture including the latent space dimensionality were kept the same as the the original code base, except for the number of channels in the output image, which was set to 1.The networks were trained using Tensorflow 1.14/Python [49] on an Intel Xeon Gold 5218 CPU and two Nvidia Quadro RTX 8000 GPUs.

B. Identification and computation of SIM-pertinent evaluation measures
A GAN may learn different types of image statistics to different levels of correctness.Hence, it is important to evaluate GANs using measures based on those statistics that are meaningful and pertinent to the SIM considered.In this study, such SIM-pertinent evaluation measures are based on statistics that either have been deemed important for assessing the realism of the canonical SIM images by human experts, or are known to be related to biomarkers important for a particular diagnostic task.These statistics are computed from both the "direct-simulated" images, i.e. images directly simulated from the canonical SIM, as well as the GAN-generated images.
1) The CLB model: The 17 different texture features identified by Castella, et al. mentioned in Section II have been demonstrated to be useful for improving the medical realism of CLB images under objective and psychophysical experiments involving the judgement of radiologists [30].Therefore, these statistics were chosen as the statistics meaningful for assessing a GAN trained on the CLB SIMs.These texture features include those derived from the per-image, gray-level intensity distribution, gray-level co-occurrence matrices (GLCMs) [50], primitives matrices (GLRM), and the neighborhood gray tone difference matrix (NGTDM) [51].
For each of the five CLB model types in Section IIIA, as well as the two multi-class CLB SIMs, the following 17 texture features described by Castella, et al. [30] were computed from each image of the evaluation datasets.Mean, standard deviation, skewness and kurtosis were derived from the perimage gray-level intensity distribution.The texture features energy, entropy, maximum, contrast and homogeneity were computed from the GLCMs.Four features were derived from the primitives matrices (GLRMs), namely, the short primitive emphasis (SPE), long primitive emphasis (LPE), gray level uniformity (GLU), and primitive length uniformity (PLU).The four features derived from the NGTDM [51] were coarseness, contrast, complexity and strength.Various parameter values required for the computation of the texture features, such as the number of gray levels, two-point distances and angles were fixed to the values used in Castella, et al. [30].The resulting feature data were then used for further analysis in order to summarize trends.Two types of analyses were conducted on the feature data.The first computed an empirical estimation of the JS divergence between the joint texturefeature distributions by utilizing the feature data [52].The second plotted the joint empirical PDF over the first two principal components of texture features.The texture features used for this computation were selected as follows.First, principal component analysis (PCA) was conducted for each of the three spatial texture feature families, namely -the GLCM, GLRM and NGTDM feature families.Next, the first two principal components were selected, and an empirical PDF over these two components was computed.The empirical PDFs that give the highest discrepancy between the directsimulated and GAN-generated distributions in terms of the total variation (TV) distance were plotted.
2) B-mode Ultrasound Speckle Model: Previous studies have shown that the intensity signal-to-noise ratio (SNR) of USS images is associated with the envelope statistics [53].In regions of the body such as the liver and the breast, the envelope statistics have previously been successfully used for tissue characterization [53].Therefore, it was chosen as the SIM-pertinent statistic, though this preliminary study does not associate a given speckle model with a tissue type.
The PDF of the SNR 2 estimate of USS speckle can be modeled as a Gaussian distribution centered around the true SNR 2 .If the scatterers per resolution cell N follows a Poisson distribution, then one can estimate N using SNR 2 .The SNR and N estimate called N are defined as: where µ I and σ I are the mean and standard deviation of the intensity.The SNR and N were computed on a per-image basis for both the direct-simulated and GAN-generated images using empirically estimated µ I and σ I from each image in the test dataset.The JS divergence was used as a measure to summarize the discrepancy between the SNR 2 PDFs of the direct-simulated and GAN-generated images.
3) The S2V model: Human female breasts can be categorized into four different types based on the relative amount of fat and glandular tissue [45].It is known that the amount of fat compared to the glandular tissue is an important factor impacting the risk of developing breast cancer, and the effectiveness of screening tests such as mammography in detecting breast masses [45], [54], [55].Fat and glandular tissue have different linear attenuation coefficients [46]- [48].Therefore, the ratio of fat-to-glandular tissue was chosen as the SIM-pertinent measure for evaluating the GAN trained on the S2V SIM.For the idealized S2V SIM described in Section IIIA, the ratio ρ F :G of the amount of fat-to-glandular tissue in a thin coronal slice of an NBP can be computed by first calculating the number of pixels F and G relative to the total image pixels having linear attenuation coefficient values close to that of fat and glandular tissue respectively, and then computing their ratio ρ F :G = F/G.Because the linear attenuation coefficient value of fat and glandular tissue are far enough to not confound a simple thresholding-based segmentation scheme, the values of F, G and ρ F :G can be estimated accurately both for the direct-simulated and GANgenerated images.Using this procedure, ρ F :G was estimated on a per-image basis for both the direct-simulated and GANgenerated images.The empirical PDFs of log ρ F :G computed from both the direct-simulated and GAN-generated images were plotted, and the JS divergence between the two PDFs was computed.
Apart from the above-described SIM-pertinent measures, basic ensemble statistics, such as the histogram of gray level values and the empirical image autocorrelation were computed from direct-simulated and GAN-generated images from all the SIMs in order to assess the ability of the GAN to learn these statistics accurately.As described in Bochud et al. [43], a Papoulis window was used in order to overcome boundary artifacts in the computation of the autocorrelation.The FID score between a direct-simulated and a GAN-generated test dataset, as well as two i.i.d.direct-simulated datasets was computed.The latter serves as a heuristic noise floor for the FID score for the particular SIM.A pre-trained InceptionV3 network [41] was employed for this purpose.All the evaluation measures were computed using 10,000 direct-simulated and GAN-generated images.Other test dataset sizes were examined, and the computed metrics were found to be qualitatively no different.

IV. RESULTS
This section is organized as follows.Section IVA qualitatively describes the images generated by the GAN.Section IVB describes the basic ensemble statistics learned by the GAN, such as the intensity histogram and the image autocorrelation.Section IVC describes and compares the FID score and the identified meaningful measures based on their ability to assess the fidelity of the trained GAN.Finally, Section IVD compares the ability of the FID score and the identified measures to assess multi-modal SIMs.
A. Qualitative assessment of images generated by the GAN Figures 1 and 2 show the images generated by the trained GANs alongside direct-simulated images from the training dataset for the single-class CLB, USS and S2V models.It Fig. 5: FID and empirical feature JS divergence measures between the real and GAN-generated distrbutions for opex99, simpiso, and doubiso models.The dotted lines represent the value of the measures between two direct-simulated datasets.Fig. 6: FID and empirical feature JS divergence between the real and GAN-generated distrbutions for opex99, doubiso, and the two doubiso-mixed models.The dotted lines represent the value of the measures between two direct-simulated datasets.can be seen that there is obvious visual similarity between the direct-simulated and the GAN-generated images.Note that this is even true for the zoomed-in images of the S2V model shown in Fig. 2. One important thing to note, however, is that some of the ligaments in the GAN-generated images appear broken at certain locations, which is not the case for the direct-simulated images.

B. Basic ensemble statistics learned by GANs
Figure 3 shows the ensemble empirical PDF of pixel gray levels for the CLB doubiso SIM, the USS SND-1 and the SND-30 SIMs, and the S2V SIM, computed from both the direct-simulated and GAN-generated images.A close match between these empirical PDFs indicates that the GAN is able to reproduce first-order statistics.The GAN performs similarly for the other CLB SIMs, which have gray-level distributions similar to the ones shown in Fig. 3a.It can be seen that for USS SND-30 SIM, which represents a fully developed speckle, the GAN reliably reproduces the expected Rayleigh distribution of grayscale values.For the USS SND-1 SIM, this distribution is far from Rayleigh both for the direct-simulated and GANgenerated images, yet the GAN recovers this distribution successfully.The pixel-value distributions corresponding to USS SND-2 and SND-3 SIMs appear intermediate between the ones shown in Fig. 3b and c.Fig. 4 shows the radial profile of the image autocorrelation computed using the direct-simulated and GAN-generated images for the CLB doubiso, USS SND-1 and S2V SIMs.It can be seen that the GAN was successful in recovering this particular second-order statistic.Similar results were obtained for the other CLB and USS SIMs considered.
C. SIM-pertinent measures learned by GANs 1) CLB Model: Figure 5 shows the FID as well as the texture feature JS divergence between the direct-simulated and GAN-generated distributions as a function of training iteration.In Fig. 6, the FID scores and the feature JS divergences for the doubiso mixed 50-50 and doubiso mixed 95-5 datasets are shown along with those for the single class doubiso and opex99 models.It can be seen that as the training progressed, both the FID as well as the empirical feature JS divergence converged for most of the SIMs considered.However, in some  cases, these measures either diverged or varied erratically as the training progressed.Furthermore, the high value of the feature JS divergence for the GAN trained on the doubiso mixed 50-50 model suggests that the GAN was not able to reproduce the meaningful feature statistics as well as the GAN trained on the single class dataset.On the other hand, the FID plot in Fig. 6 shows comparable FID scores for the various SIMs and does not predict the same trend as the feature JS divergence plots.This suggests that for this specific example, the FID score could be blind to telling if multiple modes in the distribution are learned correctly.
These findings were further investigated using the principal components of the data from the texture feature family that was learnt the least accurately by the GAN.The procedure for computing these components was described earlier in Section IIIB. Figure 7 plots this joint empirical PDF for the directsimulated and GAN-generated images.Note that these texture features are computed on a per-image basis.For most of the CLB SIMs, obvious dissimilarities between the original and learned distributions can be seen.These dissimilarities correlate well with the feature JS divergence values shown in Figures 5 and 6, but not with the corresponding FID values.For the doubiso mixed 50-50 SIM, it can be seen that the GAN failed to correctly learn the distribution of principal NGTDM and GLRM texture components for one of the classes.On further investigation and comparison with the individual texture distributions for the doubiso SIM, it was revealed that the GAN failed to learn the per-image NGTDM coarseness and complexity distributions of the images from the degraded class, as shown in Fig. 8.This was despite the GAN being able to learn ensemble measures such as the FID and basic first-and second-order statistics well.between the gaussian fit and their respective SNR 2 distributions and the mean scatterers per resolution cell estimate N of both direct-simulated (D.S.) and GAN-generated (G.G.) images.
2) B-Mode Ultrasound Speckle Model: The empirical JS divergence between the estimated SNR 2 PDFs computed from the direct-simulated and GAN-generated USS images (henceforth refered to as the SNR 2 -JS divergence) is shown in Fig. 9 alongside the FID score computed between the direct-simulated and the GAN-generated images.Although the SNR 2 -JS divergence approaches the noise floor and converges  for most SIMs, it behaves erratically for a few SIMs, even as the FID score for the corresponding SIM converges.In Fig. 10 the estimated SNR 2 PDFs are plotted for both direct-simulated and GAN generated USS images.As can be seen the GAN generated images tend to give SNR 2 distributions that somewhat match those of the direct-simulated images for the SND-1, SND-2 and SND-3 SIMs.Since the SNR 2 is theoretically expected to be distributed as a Gaussian for these SIMs [56], each distribution of direct-simulated and GAN-generated images was fit to a Gaussian.In Table I the mean and standard deviation of the best fit Gaussian distribution are shown in the first two rows while the third row shows the mean squared error between a given SNR 2 distribution and its Gaussian fit.The results for the mean and standard deviation of the Gaussian fit distributions confirm our visual inspection.The mean values were near perfect matches and so are the standard deviations with the exception of SND-30.However, the MSE between the GAN-generated empirical SNR 2 PDFs and their Gaussian fits was larger than the MSE between the direct-simulated empirical SNR 2 PDFs and their Gaussian fits.Finally, the mean estimate of scatterers per resolution cell N computed from GAN-generated images was close to that computed from the direct-simulated images for all USS SIMs except for SND-30.This is expected since the SNR 2 distributions do not match well for the SND-30 SIM.
In Fig. 9, the FID scores and the SNR 2 -JS divergences can be seen for USS Mixed 50-50 and USS Mixed 95-5 SIMs.As the training progresses, both the measures seem to converge in a similar fashion to most of the single class SIMs.Interestingly, the USS Mixed 95-5 SIM has one of the higher FID scores while also having the lowest SNR 2 -JS divergences over training.This could be because even if the SNR 2 distribution over the class having 5% prevalence was not learnt well, it may not significantly impact the JS divergence [57].Finally, it can be seen that the the GAN struggles in this case to properly reproduce the direct-simulated SNR 2 distributions.In the case of USS Mixed 50-50, the SNR 2 distributions of the two classes have greater variance for the GAN-generated images.This results in the GAN producing more images having a value of SNR 2 intermediate between the two classes.For the USS Mixed 95-5 SIM, the GAN was not able to reproduce the Ratio-JS Divergence Fig. 11: FID and empirical ratio-JS divergence between real and GAN-generated distributions for the S2V dataset.The dotted line represents the value of the measures between two direct-simulated datasets.mode corresponding to the class having 5% prevalence in the dataset, as seen in Fig. 10.
3) The S2V SIM: Figure 11 shows the empirical JS divergence between the empirical PDFs of log ρ F :G computed from the direct-simulated and GAN-generated images (henceforth refered to as the ratio-JS divergence) as a function of the training iteration.This is displayed alongside the plot of FID as a function of the training iteration.It can be seen that although the FID predictably converged as the training progresses, the ratio-JS divergence was erratic and did not converge the same way as FID. Figure 12 shows the empirical PDFs of log ρ F :G computed on a per-image basis from the direct-simulated and GAN-generated images.The direct-simulated distribution clearly shows the four different breast types based on the F : G ratio in their correct clinical prevalence.However, the GAN-generated distribution completely ignored or incorrectly represented many of the breast type modes.This was despite the GAN giving visually appealing images and accurate FID and other basic ensemble metrics.
V. SUMMARY Generative adversarial networks (GANs) could potentially be employed as stochastic image models for use in several tasks in medical imaging.However, GANs have traditionally been evaluated using mathematical or perceptual measures that may not correlate with those statistics that are important with respect to a downstream task.The objective of this work was to study the ability of GANs to reproduce medical image statistics that are meaningful and pertinent to the SIM under consideration, and to see how well traditional measures such as FID correlate with these pertinent statistics.
The GANs employed consistently produced images that visually appeared realistic, and were able to accurately and consistently reproduce basic statistics such as the intensity histograms and image autocorrelation.It was also observed that although most of the evaluation measures used in this paper converged, they did not necessarily converge at the same rate, and some of them diverged as the training progressed.This indicates that the convergence of a commonly used measure such as the FID score to a low value does not guarantee the correct convergence of those statistics that are meaningful to the particular medical SIM under consideration.Since the FID score measures the Fréchet distance in the feature space of an Inception network trained on the ImageNet dataset, it is not tailored to the specific medical image distribution considered.Additionally, the GAN may learn the distribution of different features to different degrees of fidelity, resulting in different performance rankings when examined by different measures.
We note that for all the SIMs considered in this paper, the GAN-generated images retained potentially impactful perrealization errors in some of the meaningful features identified.These errors manifested themselves in the empirical distributions of these meaningful features learned by the GAN, where among others, critical inaccuracies such as mode-dropping and merging of multiple classes or modes was observed.This was despite the GAN producing excellent agreement with the direct-simulated distribution in terms of ensemble measures, such as the FID and basic first-and second-order statistics.
These observations point to the need for choosing evaluation measures that are meaningful and pertinent to the SIM considered, are motivated by a downstream task, and are sensitive to the important aspects of a medical image distribution, such as multiple modes.While formulating such evaluation measures requires significant effort, it opens up the possibility of evaluating GANs in terms of those statistics that influence task-performance.
This study employed the StyleGAN2 architecture, since it has been shown to consistently produce realistic images when trained on a wide variety of datasets.However, the proposed analysis does not depend upon the GAN architecture employed, and could easily be performed on other GAN architectures.Canonical SIMs that produce simulated medical images provided the ability to examine the behavior of the GAN under a controlled setting with different parameter configurations.Nevertheless, evaluating GANs trained on real medical images remains a topic of future investigation.Lastly, although careful identification of meaningful evaluation measures is a key aspect of this study, it falls short of performing a task-based assessment of GANs.This will be the topic of a follow-up study.

Fig. 1 :Fig. 2 :
Fig. 1: Images simulated from the canonical CLB and USS SIMs and images generated by the GANs trained on images from the SIMs.

Fig. 3 :Fig. 4 :
Fig. 3: Sample empirical gray level PDFs of direct simulated and GAN-generated images for the three types of SIMs.

Fig. 7 :Fig. 8 :
Fig.7: Empirical PDF over the first two principal components of the CLB feature data.The selected texture feature family for each of the models is shown below each plot.The blue and the orange contour plots denote the direct-simulated and GAN-generated distributions respectively.

Fig. 10 :
Fig. 10: Estimated SNR 2 PDFs of both direct-simulated and GAN-generated images for SND-1, SND-2, SND-3, SND-30, USS Mixed 50-50 and USS Mixed 95-5.Although the direct-simulated and GAN-generated distributions tend to match well, occasionally this is not the case as can be seen in for SND-30 and USS Mixed 50-50.Note that the USS Mixed 95-5 SNR 2 PDF has the density in log scale with the red arrow pointing to the distribution of the SND-3 class having 5% prevalence.

Fig. 12 :
Fig.12: (a-b) The estimated PDF over the per-image number of pixels corresponding to fat and glandular tissue respectively, as a fraction of the total image pixels (denoted by F and G respectively).(c) The estimated PDF over log(F/G).

TABLE I :
A table showing the mean µ and standard deviation σ of the gaussian fit curve for both direct-simulated and GANgenerated SNR 2 distributions, the mean squared error (MSE)