Hierarchical Knee Image Synthesis Framework for Generative Adversarial Network: Data from the Osteoarthritis Initiative

Medical images synthesis is useful to address persistent issues such as the lack of training data diversity and inflexibility of traditional data augmentation faced by medical image analysis researchers when developing their deep learning models. Generative adversarial network (GAN) can generate realistic image to overcome the abovementioned problems. We proposed a GAN model with hierarchical framework (HieGAN) to generate high-quality synthetic knee images as a prerequisite to enable effective training data augmentation for deep learning applications. During the training, the proposed framework embraced attention mechanism before the 256 × 256 scale in generator and discriminator to capture salient information of knee images. Then, a novel pixelwise-spectral normalization configuration was implemented to stabilize the training performance of HieGAN. We evaluated the proposed HieGAN on large scale knee image dataset by using Am Score and Mode Score. The results showed that HieGAN outperformed all relevant state-of-art. Hence, HieGAN can potentially serve as an important milestone to promote future development of more robust deep learning models for knee image segmentation. Future works should extend the image synthesis evaluation to clinical-related Visual Turing Test and synthetic data augmentation for deep learning segmentation task.


I. INTRODUCTION
Existing supervised learning methods in medical image segmentation depend heavily on large quantity of highquality training data. The problem becomes apparent with the resurgence of deep learning, whose training requires huge volume of labelled data. To build-up large scale training datasets represent a daunting task for most medical image analysis researchers given the enormous financial costs and expert label time involved. Meanwhile, traditional augmentation methods such as scaling, rotating, flipping and elastic deformation fail to consider the variations in size, shape, location and appearance of specific pathology [1].
Generative Adversarial Network (GAN) [2] is a powerful unsupervised training approach. It learns the pattern of input samples and generate new outputs based on underlying structural information in training data. As a result, GANs are very useful for medical image synthesis. Prospectively, synthetic medical image-based augmentation offers solution to the lack of manually annotated data and inflexibility in traditional augmentation. Moreover, synthetic medical images are not associated with individual patient image information. There is no concern on data privacy regulations when sharing the data for reproducibility purpose.
Generation of high-quality synthetic medical image is a prerequisite to numerous image processing applications, including segmentation. In recent years, numerous works related to synthetic images generation using GANs have been found. Unconditional synthesis produces an image from the latent space of real image without any conditional information [3] and are often trained with more than 10,000 images [4]. For example, Sketching-rendering Unconditional GAN (SkrGAN) has been proposed to generate several types of synthetic medical images, including retinal color fundus, chest x-ray, lung computed tomography (CT) and brain magnetic resonance image (MRI) [5]. Nevertheless, GANs are notorious for their instable training performance. Deep Convolutional GAN (DCGAN) [6] and Progressive Growing GAN (PGGAN) [7] are two widely adopted in medical image synthesis attributed to better training stability. Chuquicusma et al. (2018) used DCGAN to generate benign and malignant lung nodule samples of size 56 × 56. Their Visual Turing Test results concluded that DCGAN produced highly realistic class specific nodules but the interobserver error was relatively high [8]. Frid-Adar et al. (2018) applied three DCGANs to generate CT images for three classes of liver lesion i.e. cysts, metastases and haemangiomas. The image size was 64 × 64. Accordingly, synthetic lesion images were found to be beneficial to classification task with improved sensitivity and specificity when combined with real training data [9]. Both studies concentrated on augmenting the training data size of different pathology classes and the capability of DCGAN is limited to low-resolution medical images.
PGGAN is capable to generate synthetic medical image of higher resolution up to 1024 × 1024 [7]. Beers et al. (2018) generated synthetic retinal fundus image and MR image of brain up to 512 × 512 to protrude subtle pathological features at the expense of heavy computational cost and slow training speed [10]. To date, PGGAN has been generalized to various image types such as gastritis image [11], cardiac MR image [12], body CT images [13], mammograms [14] and chest X-ray image [15]. In most works, the anatomical variation of organ is either small or cropped to the region of interest (ROI) before training. Otherwise, pertinent features from medical image fail to be synthesized.
Besides, it has been reported that the performance of PGGAN on mode collapse problem is still under expectation. Another work focuses on improving the training stability of GAN is illustrated through the introduction of Wasserstein GAN (WGAN) [16]. Wasserstein distance is utilized to replace conventional Jensen-Shannon Divergence in the training objective. While the model training becomes more stable, it is challenging to enforce the Lipschitz constant and hard to recognize complex image landscape. Some GANs are developed to tackle the salient feature recognition during the image synthesis process. For example, Auto-Embedding GAN (AEGAN) [17] has been proposed to generate high resolution images. The model learns a latent embedding extracted from an autoencoder. The model has reported good results for image synthesis task by using a single TITAN X Pascal GPU.
Human knee structure comprises of multiple types of musculoskeletal tissues [18]. As illustrated in Fig.1, there are three different cartilages and corresponding bone components that form the overall knee structure. Their anatomical geometries change significantly across the image slices [19]. An effective medical image synthesis framework that can maintain training stability and capture the fine details of irregular knee structure posts a great challenge that has never been addressed before. Besides, existing GANs [7,20,21] are mainly applied on CIFAR-10 and/or CIFAR-100 datasets. Natural image-based evaluation metrices such as Inception Score and Frechet Inception Distance are used in the works. Hence, a novel knee image synthesis via hierarchical framework is proposed. Specifically, the main contributions of this paper include: 1. The proposed knee image synthesis model in hierarchical architecture composes of layer-bylayer training structure with attention added in between the layers is designed to enhance its salient feature recognition capability. 2. The training stability of the proposed framework is improved by adopting a hybrid pixelwise-spectral normalization configuration in generator and discriminator in order to avoid incurring additional computational cost to the model training process.

Instead of Inception Score and Frechet Inception
Distance applied for most natural image evaluation, distance-based Wasserstein Distance and Mean Absolute Error as well as probability-based Mode Score and AM Score are adopted to better evaluate the proposed HieGAN framework in the context of medical image synthesis. 4. Findings have suggested that the proposed HieGAN framework can serve as promising medical image data augmentation option to tackle the scarcity of training data and class imbalance problem faced by existing deep learning segmentation models. (a)

Image datasets
The study comprised of 75 normal knee image datasets. MR image data was acquired by using 3.0 Tesla (T) MRI Scanner (Siemens Magnetom Trio, Erlangen, Germany) with quadrature transmit-receive knee coil (USA Instruments, Aurora, OH). Dual echo steady state (DESS) with water excitation (WE) imaging sequence was selected [22]. All knee image datasets were chosen randomly from the Osteoarthritis Initiative (OAI) database. The images have section thickness of 0.7 mm and an in-plane resolution of 0.365 × 0.365 mm 2 (field of view = 140 × 140 mm, flip angle = 25 o , TR/TE = 16.3/4.7 msec, matrix size = 384 × 384mm, bandwidth = 185 Hz/pixel). More details about the dataset can be found at http://oai.epiucsf.org/datarelease/About.asp.

Architecture of HieGAN
GAN architecture comprises of a generator ( ) and a discriminator ( ). The generator is responsible to produce synthetic image with distribution indistinguishable from training distribution while the discriminator is trained to distinguish between true samples and fake samples produced by the generator. Both and continuously engage in a min-max game given as where ℙ is the data distribution and ℙ is the model distribution implicitly defined by ̃= ( ),~( ). The input to the generator is sampled from noise distribution . Simultaneous training between these two competing components contribute to uneasy convergence or failure modes when equilibrium cannot be achieved.
Intuitively, the HieGAN model trains progressively from low-resolution (8 × 8) to high-resolution (256 × 256) scale. The model learns the diverse spatial features of knee images. As HieGAN progresses, the local spatial features are acquired in higher resolution layers. During the transition, nearest neighbour upsampling method is used to fade the new scale in. This approach helps to overcome sudden shocks to the existing trained, lower resolution scales. We also discover that Minibatch discriminator is computationally complex and sensitive to hyperparameter selection; so, we have selected a pixelwise-spectral normalization configuration.
Knee structure exhibits constant change of shape and size within the dataset. Conventional convolution operation might fail to capture the diversity of variation during model training, which can compromise the quality of image. An attention layer is proposed and implemented before the 256 × 256 scale in generator and discriminator. The architecture of HieGAN is illustrated in Fig. 2.

Training of HieGAN
In Fig. 3, training of HieGAN started from 8 × 8 and trained progressively until 256 × 256. A total of 14,920 training images were used to train the HieGAN. The model was implemented by using Tensorflow in Python. During the training, we set the learning rate, at 0.001, maximum iteration at 228,000, input noise at 256 and noise standard deviation at 0.01 in generator and discriminator. Leaky RELU was used in the model. The batch size at 8 × 8 is 128, 16 × 16 is 64, 32 × 32 is 32, 64 × 64 is 16, 128 × 128 is 8 and 256 × 256 is 4. For optimization, Adam was used with 1 = 0, 2 = 0.999 and = 1 × 10 −8 . All training was performed on desktop with NVIDIA GeForce RTX 3070 GPU. The model training took 3 weeks.
Original WGAN [16] utilizes weight clipping to attain 1-Lipschitz functions. But the weight clipping is susceptible to optimization difficulties, capacity underuse and exploding/vanishing gradient without careful tuning of the weight clipping parameter . We adopted Wasserstein loss function plus gradient norm penalty to achieve Lipschitz continuity in discriminator. The loss function was first proposed in WGAN with Gradient Penalty (WGAN-GP) [23]. A gradient penalty is a soft version of the Lipschitz constraint, which follows from the fact that functions are 1-Lipschitz if and only if the gradients are of norm at most 1 everywhere. The squared difference from norm 1 is used as the gradient penalty as shown in The generator loss function remains unchanged as shown in Pixelwise normalization was initially implemented in [7] to normalize every feature vector in pixel to unit length after each convolutional layer and avoid magnitudes in generator from spiraling out of control as a result of competition with discriminator. The formulation is illustrated in where = 10 −8 , is the number of feature maps and , and , are the original and normalized feature vector in pixel ( , ), respectively.
In this work, the pixelwise normalization is initiated by taking the pixel value of each channel of image at position ( , ). A feature vector for each ( , ), , is constructed and calculates the value for each feature map. Then, each vector is normalized between 0 and 1 by using Eqn. 4. The feature vectors are forwarded to the next layer.
Spectral normalization was proposed in [24] to stabilize the training of discriminator through tuning the hyperparameters. It controls the Lipschitz constant of discriminator by constraining the spectral norm of each layer : ℎ → ℎ . The Lipschitz norm ‖ ‖ is equal to ℎ (∇g(h)), where ( ) is the spectral norm of matrix ( 2 matrix norm of ) which is equivalent to the largest singular value of . Therefore, for a linear layer g(h) = Wh the norm is given by ‖ ‖ = ℎ (∇g(h)) = ℎ ( ) = ( ). Spectral normalization normalizes the spectral norm of weight matrix so it satisfies the Lipschitz constraint ( ) = 1

Fine Feature Learning via Attention
Convolution operator in GANs uses local receptive field to learn local neighbourhood representations. The operation curtails effective model learning when specific details are located at different locations. The long-range dependencies can only be processed after passing through several convolutional layers. As a result, PGGAN lacks the power in specifying the features of synthetic knee image. An attention layer computes response at a position as a weighted sum of the features at all positions, where the weights are calculated with only a small computational cost. The mechanism leverages on complementary features in distant portions of the image rather than local regions of fixed shape to generate consistent objects. Thus, it will filter the feature response to retain only the relevant activation.
An attention layer based on non-local model [25] was implemented into the HieGAN architecture. The attention mechanism is exhibited in Fig. 4. Three feature spaces i.e. , and ℎ are obtained by using 1 × 1 convolutions. The generator can extract fine details at every location that are carefully coordinated with fine details in distant portions of the knee image while the discriminator can enforce complex geometric constraints on the global image structure. Knee image features from previous hidden layer, ∈ ℝ × are first convolved with 1 × 1 convolution filter into two feature spaces and , where ( ) = , ( ) = , is the number of channels and is the number of feature locations of features from the previous hidden layer.
∈ ℝ ̅ × and ∈ ℝ ̅ × are the learned weight matrices. The feature spaces and are linearly combined using a matrix multiplication into , = ( ) ( ). Then, it is fed into the softmax layer to compute the attention to which the model attends to the ℎ location when synthesizing the ℎ region. The resultant attention map is given in where is the number of feature maps. Feature vectors of and have different dimensions than feature vector ℎ. We multiply the resultant attention map and the third feature space ℎ( ) = ℎ and then convolve with 1 × 1 convolution filter, ( ) = to obtain the output of attention layer given in For instance, , , and ℎ are the weight matrices of the 1 × 1 convolutional layer. To enable the generator to learn the local dependence and long-range global dependency of the knee image, we have multiplied the output of attention layer with a weight coefficient, and add it to the input feature map to obtain the final output of the attention mechanism given as = + (9)

Experimental Settings
In this experiment, the HieGAN framework was compared with relevant state-of-art WGAN-GP [23], PGGAN and AEGAN. A total of 5,100 test images were used. Inception score (IS) [26] was the benchmark evaluation metric to gauge the realism of synthetic image sample produced by GAN based on the Kullback-Leibler (KL) divergence between conditional label distribution, Intuitively, samples are expected to have low entropy, , if all classes are equally represented in the set of samples (high diversity) and low entropy for easily classifiable samples (better sample quality).
Despite IS has reported well correlation to human evaluation, it is pretrained on GoogleNet. The metric is limited to evaluate training data available at ImageNet classes instead of medical images. Furthermore, IS fails to recognize images that are generated in mode collapse problem. Similar problems are also reported in Frechet Inception Distance (FID) [27]. Mode Score (MS) [28] overcomes the limitation of IS by including the prior distribution of label over real data while AM score (AM) [29] incorporates the training data into account by replacing ( ) with KL divergence between * and to address the uneven data distribution problem.
The formulas of MS and AM are expressed in = ( ( ( | )‖ ( * )) − ( ( )‖ ( * ))) where denotes the image data sample and ( * ) denotes the empirical distribution of labels from training data, * . MS has a range between 0 and infinity. Higher MS score reflects better image quality. = ( ( * )‖ ( )) + ( ( | )) AM is minimized when * is close to and entropy of the predicted class label for sample , ( | ) is low. Smaller AM score indicates better image quality. We applied both MS and AM to assess the quality and variation of synthetic knee image. VGG-16 was used as the classifier.
Mean Absolute Error (MAE) and Wasserstein Distance (WD) use distance formulas to compare the probability distributions between real and synthetic data. MAE measures the average magnitude of error between original data distribution, and synthetic data distribution, ̂ over image samples. The formula is expressed in When the synthetic knee image is highly similar to original knee image, the value of MAE will be small.
WD is a measure of distance between probability distribution of synthetic, and real knee images, in the feature space of a trained classifier, Inception-v3 network. The formula is expressed in where we set the default value parameter = 1 in this work.

Evaluation of Model Training Performance
Due to its adversarial nature, training behaviour of GAN is always volatile. We have tested different configurations of normalization techniques applied in generators and discriminators of HieGAN. In Fig. 5, we depict the effect of different configuration of normalization techniques on synthesized knee images. In order to illustrate the stability in progressive training due to the use of hybrid normalization configuration, the training performance of HieGAN (refer to Fig. 7) was compared to the PGGAN (refer to Fig. 6) at different scales. In each plot, training losses of generator and discriminator are computed. Although progressive training has brought stability to training performance, mode collapse was detected at many occasions in PGGAN. The HieGAN also partially suffered from mode collapse but the occurrence was lesser and the training was more stable. Eventually, HieGAN was able to converge in 256 × 256 scale, which has proven it to be more reliable than state-of-art PGGAN.

Evaluation of Synthetic Image Quality
Assessment on the realism between real and synthetic knee images were shown in Table Figure 8: Comparison of real MR image of knee (a) to (f) against synthetic knee image at 128 × 128 scale (g) to (l) and 256 × 256 scale (m) to (r). Failure cases were indicated by red arrows in PGGAN (s) to (w) and by red box in HieGAN (w).
In Table 2 and 3, quality assessments of real and synthetic images were computed in 128 × 128 and 256 × 256 scales, respectively. The quality of synthetic knee image improved continuously from 128 × 128 until 256 × 256. The best AM score in 128 × 128 was 3.039 and improved until 2.419 in 256 × 256. Similar trend was observed in Mode score. The best score was recorded at 1.019 in 128 × 128 and improved until 1.383 in 256 × 256. HieGAN has demonstrated better capability in producing good quality synthetic image than other GANs in both scales.

Discussion
This is a novel work on knee image synthesis framework for GAN. The model performance of HieGAN was compared with other state-of-art i.e. WGAN-GP, PGGAN and AEGAN at different scales. We first assessed the data distribution difference between real and synthetic knee images by using MAE and WD. Then, we validated the quality and variation of synthetic knee image of 128 × 128 and 256 × 256 scales at different iterations by using AM and Mode score. Knee image synthesis is a challenging task because of its complex structure and varying anatomical geometry [30,31]. Equipped with a novel normalization technique and attention configuration, we have shown that the proposed framework has successfully produced realistic synthetic knee images. In the following, we describe the key lessons obtained from this work.
First, synthetic knee images are useful in segmentation tasks. One potential application includes the diversification of real training data with synthetic data to improve the robustness of deep learning segmentation model. According to Russ et al. (2019), three training configurations i.e. real data only, synthetic data only and real-synthetic data combination, were used to augment the training data of u-net segmentation model. Real-synthetic training data configuration had reported superior performance compared to previous two configurations [32]. Given the rising number of research works in knee segmentation using deep learning models [33], further investigations based on their findings will benefit future knee segmentation models.
Second, optimization of GAN training remains an active research topic despite its appealing potentials. In fact, conventional GANs are infamous for demonstrating mode collapse and vanishing gradient issues during the training process. Selection of normalization techniques has profound effect on the quality of knee image synthesis. For instance, the use of standalone spectral and pixelwise normalization have produced low quality knee images with blur background or inferior contrast, which cannot be adopted into subsequent deep learning segmentation models. Despite PGGAN has reported more stable training performance attributed to its progressive training nature, the model still suffers from training instability. PGGAN does not converge smoothly in some several scales. Meanwhile, original WGAN [16] was proposed to bring stability into GAN's training process but the outcome is highly dependent on the tuning of hyperparameter. As a result, WGAN might generate synthetic image with inferior quality while the training still fails to converge.
In recognition of the limitation, WGAN-GP uses gradient penalty to enforce the Lipschitz constraint. Nevertheless, WGAN-GP still lacks the capability to produce synthetic knee image with good quality. On the other hand, HieGAN manages to converge successfully at 256 × 256 scale after improvement was deployed on the model training stability. In addition, huge computational cost incurred by GANs is another topic of research. StyleGAN [34] is an extension of PGGAN. It has generated high-resolution attributes in natural images. Recently, it is extended to synthesize CT and MR images [35]. Unfortunately, implementation of StyleGAN requires extremely huge computational resources [36]. Thus, it is infeasible for common medical image synthesis applications AEGAN aims to tackle the problem of low-quality synthetic image by encoding the global structure features and extract salient image details. Based on the results, the model shows promising results in knee image synthesis. However, it is noteworthy that AEGAN consumes high computational cost at the expense of capturing fine details. In HieGAN, attention was employed as an alternative to capture salient features. The attention layer has successfully guided the discriminator to pay more attention to different features of knee images in order to compel the generator to produce realistic images without imposing extra computational burden to the training. Our quantitative findings suggested that the images have attained high degree of realism especially at 256 × 256 scale. Specifically, the overall image brightness is preserved, the boundary of cartilage-bone interface is well-preserved, the contrast between bone, cartilage and background is distinct, and the anatomical shape and size of cartilage and bone are conserved.
We have detected failure cases from samples generated by PGGAN. As such, PGGAN have generated seriously deformed knee structure wherein the structure of femur and tibia have been altered. Moreover, the boundary between femur and surrounding musculoskeletal tissues is overly diffused in several samples. The failure samples with serious deformation could potentially mislead the learning of deep learning models. Nonetheless, we also observed minor irregularity in one sample produced by HieGAN. The proposed model failed to distinguish between shrinking femur and tibia from the background musculoskeletal tissues. The boundary between knee bones and background is considered blur. These failure cases provide valuable insights for us to improve the model in the future.
At current stage, the study has some limitations. We do not assess the pathological feature of synthetic knee image. Some radiological features of knee osteoarthritis (OA) include osteophytes, bone marrow lesions and subchondral bone cysts are not taken into consideration. Besides, we decided the image generation until 256 × 256 scale in order to better understand the balanced results under the consideration of existing GPU capacity, salient feature recognition and acceptable medical image resolution scales. These factors are common among researchers in deep learning for medical image analysis field to build a sustainable medical image synthesis framework for GAN.

Conclusion
Research interest in GAN model development is growing among medical image analysis community along with the advances in GPU technology. Both quantitative and qualitative results showed that the HieGAN outperformed state-of-art PGGAN. In future, generation of synthetic knee image at higher resolution of 512 × 512 and 1024 × 1024 will be attempted by installing more powerful GPUs. In future work, Visual Turing Test will be conducted to investigate the capability of HieGAN in generating pathological features from knee images in collaboration with medical image experts. Then, the synthetic data will be tested VOLUME XX, 2017 9 along with real data to perform deep learning segmentation in an attempt to mitigate the curse of lack of training data faced by supervised deep learning models.