TilGAN: GAN for Facilitating Tumor-Infiltrating Lymphocyte Pathology Image Synthesis With Improved Image Classification

Tumor-infiltrating lymphocytes (TILs) act as immune cells against cancer tissues. The manual assessment of TILs is usually erroneous, tedious, costly and subject to inter- and intraobserver variability. Machine learning approaches can solve these issues, but they require a large amount of labeled data for model training, which is expensive and not readily available. In this study, we present an efficient generative adversarial network, TilGAN, to generate high-quality synthetic pathology images followed by classification of TIL and non-TIL regions. Our proposed architecture is constructed with a generator network and a discriminator network. The novelty exists in the TilGAN architecture, loss functions, and evaluation techniques. Our TilGAN-generated images achieved a higher Inception score than the real images (2.90 vs. 2.32, respectively). They also achieved a lower kernel Inception distance (1.44) and a lower Fréchet Inception distance (0.312). It also passed the Turing test performed by experienced pathologists and clinicians. We further extended our evaluation studies and used almost one million synthetic data, generated by TilGAN, to train a classification model. Our proposed classification model achieved a 97.83% accuracy, a 97.37% F1-score, and a 97% area under the curve. Our extensive experiments and superior outcomes show the efficiency and effectiveness of our proposed TilGAN architecture. This architecture can also be used for other types of images for image synthesis.


I. INTRODUCTION
Tumor infiltrating lymphocytes (TILs) play a significant role in cancer diagnosis and prognosis [1]. The presence of TILs in different cancer types (such as lung, colon, and breast cancer) signifies improved clinical outcomes and faster response to chemotherapy [2]. Recent evidence has emerged that the infiltration of antitumor type I lymphocytes can improve cancer prognosis [3]. TILs are a special white blood cell that shows a tendency to emigrate towards tumor cells from the bloodstream [4]. TILs comprise mainly T cells, B cells, mononuclear cells, and polymorphonuclear immune cells (such as neutrophils, eosinophils, and basophils) [5]. TILs normally float around tumor cells.
As per the World Health Organization and American Cancer Society, lung cancer is one of the most devastating cancers globally, accounting for almost 14% of new cancers in men and 13% of new cancers in women in the United States [6]. It has also been reported that in 2019, lung cancer caused approximately 228,150 new cases (116,440 men and 111,710 women) and 142,670 deaths (76,650 men and 66,020 women) in the United States [7]. For lung cancer prognosis, pathological image analysis is considered the primary and gold standard screening method. For this purpose, pathologists collect a small part of the tissue from the suspected tumor region. Next, the tissues are further processed and stained using different stains, including hematoxylin and eosin (H&E) [8], [9]. Lung cancer pathology images typically contain TILs, tumor cells, mitotic cells, stroma, etc. Under a microscope, TILs appear with round, deep bluish nuclei [10]. The details of the TIL and non-TIL regions are shown in figure 1. Pathologists follow manual image analysis procedures to analyze the tissue regions. This procedure fully depends on the knowledge of the pathologists. Moreover, it is costly and time consuming.
Deep learning has shown promising results in image analysis, signal analysis, video analysis, and many more fields [9]. Currently, this method is one of the most popular machine learning approaches and is used to solve many complicated tasks, such as object classification, image segmentation, and risk prediction. However, it has a few disadvantages. The most significant one is that deep learning requires a large amount of data to meet satisfactory performance [11]. However, biomedical data are expensive and not readily available, as approval from the patients and institutional review board are required to use them. Biomedical data may also contain artifacts, noise, etc., which also reduce the total number of data points.
To solve the data availability problem, the authors of [12] proposed a generative adversarial network (GAN) for natural image synthesis. The GAN comprises mainly a generator network/model and a discriminator network/model. The generator network generates synthetic/fake data, which looks like real data, while the discriminator checks the quality of the synthetic data. Categorically, the generator network learns from latent space, and the discriminator differentiates the real and synthetic data distributions [13]. The generator attempts to fool the discriminator by increasing the generator loss [14]. In this study, we present an efficient generative adversarial network, TilGAN, to generate high-quality decision-making processes. The main novelty exists in TilGAN architecture, loss functions, and evaluation techniques. This manuscript has five sections. Section I introduces the work, and Section II discusses related work. In Section III, the materials and methods are discussed. Sections IV and V discuss the results and present the discussion and conclusion, respectively.

II. RELATED WORK
A GAN, an unsupervised method, is used to generate millions of synthetic data, which resembles the real dataset [15]. Traditional generative models follow the rules of explicit approximation inference and Markov fields, but GANs do not follow this rule. The generative network of GAN produces high-quality fake data to mislead the discriminator. The training process of the GAN ends when a Nash equilibrium from game theory is reached [16]. Hence, the GAN learning process is considered a minimum-maximum optimization problem.
Initially, a GAN was developed for natural image synthesis [12], but gradually, the default architecture was changed to improve the synthetic image quality and to solve other data processing issues, including color enhancement [34], image translation [35], [36], nuclei segmentation [37], [38], cell-level visual representation [39], and image classification [40]. Various researchers have proposed different cost functions for the generator and discriminator networks to improve the quality of synthetic images, such as relativistic GAN, hinge GAN [41], relativistic average GAN [42], and Wasserstein GAN [43]. The main difference between the standard GAN and the modified GANs is that the standard GAN tries to prove that the input data are real, whereas modified GANs measure the probability that generated data are less realistic than the real data (or vice versa). With a standard GAN, the discriminator squeezes the output into two ends, i.e., 0 or 1. Modified GANs measure the distances or differences between fake and real images [44]. When the discriminator reaches an optimum level, gradients vanish. Many new GAN architectures have been proposed for natural and biomedical image synthesis. Cycle-consistency GANs are one of the most common GAN architectures and was designed for biomedical image synthesis [45], imageto-image translation [46], etc. [55]. In this manuscript, we perform image synthesis with TilGAN, which is constructed using different baseline architectures, such as Pathology GAN [49], BigGAN [50], a cycleconsistency GAN [56], and a relativistic average GAN [42].
Pathology images show important information, and small changes in the tissue characteristics may result in a wrong diagnosis and patient death. Therefore, it is a very challenging task to maintain the real image characteristics of synthetic images. Existing GAN architectures generate TIL and non-TIL patches, but our proposed network shows improved results. We targeted preserving real image features such as image appearance, chromatin information, stain colors, and tissue contents.
In summary, the novel technical contributions of this study can be summarized as follows:

•
The most important contribution of this study lies in the architecture of TilGAN. Due to its novel architecture, TilGAN generates millions of high-quality, clinically significant TIL patches.
• Second, we propose a modified version of the relativistic average cost function to preserve important pathological signatures.
• Third, to our knowledge, this is the first report to propose a GAN that specifically aims to generate TIL and non-TIL patches.
• Fourth, the generated synthetic images are used for classification model training.
The detailed method, along with the results, will be discussed in the subsequent sections.

A. DATASET
In total, 712 H&E stained WSIs of lung cancer (356 adenocarcinomas and 356 squamous cell carcinomas) were collected from The Cancer Genome Atlas data repository (https:// tcga-data.nci.nih.gov/tcga/). This is a public repository, and the data are freely available for research. For our study, the collected data were equally split into two sets, with zero overlaps. One half of the data, i.e., 356 WSIs (178 adenocarcinomas and 178 squamous cell carcinomas), was used for TilGAN, and the other half was used for classification purposes.
Out of the 356 WSIs, we used 75% (267 WSIs) for training and 25% (89 WSIs) for testing of the TilGAN architecture. The ground truths were generated by experts using HistomicsTK (https://digitalslidearchive.github.io/HistomicsTK/). To train our classification model, we used one million high-quality synthetic images generated by TilGAN of size 224 × 224 pixels and 10% (i.e., 36 WSIs out of the remaining 356 WSIs) real labeled data. The rest of the 320 WSIs were split into testing (50%) and validation sets (50%) to evaluate the classification model. In the figure 2, the WSIs distribution chart has been shown.

B. TilGAN METHODOLOGY
We developed the TilGAN architecture for the synthesis of TIL and non-TIL pathology images of size 224 × 224 pixels. This network was trained from scratch. We adopted a supervised learning strategy that uses hand-labeled images. There are many GAN architectures available for natural image synthesis, but few of them have been used for pathology image generation. Pathology images carry essential clinical features about cell nuclei, stroma, mitosis, lymphocytes, etc. Hence, image synthesis using pathology data requires special skills. Small changes in the visual appearance of nuclei, lymphocytes, etc., may change the clinical meaning. The workflow diagram of our proposed TilGAN architecture is shown in figure 3.

1) TilGAN ARCHITECTURE DETAILS-The
TilGAN architecture comprised a generator network G NET and a discriminator network D NET . The input of the generator was randomly chosen from the annotated real TIL and non-TIL patches of size 224 × 224 pixels. The output of the generator was synthetic TIL and non-TIL patches. The generator was defined as a mapping function z to learn the generator's distribution over the data y. The discriminator, D NET , showed the probability that y was more realistic than the generator's distribution. The generator network of TilGAN comprised six convolutional layers, five upconvolution layers, and two dense layers. The up-convolution, is obtained by a transposed convolution, operation increased the height and width of the feature maps by two. The discriminator of TilGAN was formed using six convolutional layers, six down-convolution layers, and two dense layers. The down-convolution, general convolution, operation decreased the height and width of the feature maps by two. This design helped the model learn the features from real images efficiently and effectively. Different learning rates were used for the generator and discriminator networks. The overfitting issue was tracked by incorporating more data and varying the dropout layers. During training, we set the dropout value to 0.5. The rectified linear unit (ReLu) activation function was used after each convolution layer.
The detailed architecture of our proposed TilGAN model is shown in table 2.

2) MODIFIED LOSS FUNCTION FOR THE TilGAN ARCHITECTURE-Pathology
images possess distinct types of textural, color, and morphological features, which are linked with the patient's diagnosis and prognosis. Hence, it is essential to handle these types of data separately, unlike other nonclinical data. The existing loss functions of GAN architectures generated TIL and non-TIL patches, but the clinical features of those images were not consistent. Hence, we developed a modified version of the relativistic average loss function to solve these issues. We used the modified relativistic average cost function for both networks (generator and discriminator). The fundamental theory of the modified relativistic average loss function originates from the binary cross-entropy loss function [12] as follows: Loss D NET y , 1 = 1 ⋅ log D NET y + (1 − 1) ⋅ log 1 − D NET y Loss D NET y , 1 = log D NET y (2) When the data are fake, the values of O d and O d will be 0 and D NET (G NET (z)), respectively.
Substituting these values into equation 1, we obtain The main purpose of the discriminator D NET is to distinguish real and fake images. Hence, equations 2 and 3 should be maximized. Next, the discriminator loss will be as follows: The final generator G NET loss will be Loss G NET = min logD NET y + log 1 − D NET G NET z The loss functions of a standard GAN can be classified into saturating and non-saturating loss functions [42]. Equation 7 is an example of a non-saturating loss function. In the case of saturating loss, the equation for the discriminator will be Loss D NET = − E y P image y logD NET y − E z P z z log 1 − D NET G NET z (8) In the standard GAN, D NET (y) has been represented as a D NET (y) = sigmoid(C(y)) [60], [61]. Here, C(y) determines the possibility of having real or fake data. Hence, it is also known as a critic or non-transformed discriminator output. If the value of C(y) is negative, then the input data are fake, and vice versa. After substituting the value of D NET (y) into equation 8, we obtain Loss D NET = − E y P image y log sigmoid C y − E z P z z log 1 − D NET G NET z (9) With a relativistic standard GAN, we compute the distance, which depends on the real and fake data distribution. Hence, D NET (y) will change to D NET (y r , y f ) = sigmoid(D NET (y r ) − D NET (y f ). Here, r and f indicate real and fake data, respectively. Now, the discriminator loss will be: Loss D NET = − E y r , y f ℝ, ℕ log sigmoid D NET y r −D NET y f (10) and the generator loss will be: Here, D NET (y r ) = D NET (y f ) = 0.5 has been set as an optimal point [61]. Equation 7 can also be generalized as Loss D NET = E y r ℝ f 1 D NET y r + E z ℝ z f 2 D NET G NET z (12) Loss G NET = E y r ℝ g 1 D NET y r + E z ℝ z g 2 D NET G NET z (13) Here, functions f and g map a scalar input to another scalar. The corresponding relativistic cost function will be as follows: Loss D NET = E y r , y f ℝ, ℕ f 1 D NET y r − D NET y f + E y r , y f ℝ, ℕ f 2 D NET y f − D NET y r Loss G NET = E y r , y f ℝ, ℕ g 1 D NET y r − D NET y f + E y r , y f ℝ, ℕ g 2 D NET y f − D NET y r From equations 14 and 15 above, we can say that f 1 (D NET (y r ) − D NET (y f )) = f 2 − (D NET (y f ) − D NET (y r )). Moreover, in the case of non-saturating loss, f 2 (D NET (y f ) − D NET (y r )) = g 1 (D NET (y r ) − D NET (y f )), and g 2 (D NET (y f ) − D NET (y r )) = f 1 (D NET (y r ) − D NET (y f )).
Based on the above properties, we can further simplify equations 14 and 15 as: Loss G NET = E y r , y f ℝ, ℕ f 1 D NET y f − D NET y r (17) The generic cost functions of the relativistic average GAN for a generator and discriminator can be computed as: Loss G NET = E y r ℝ g 1 D NET y r − E y f ℕ D NET y f + E y f ℕ g 2 D NET y f − E y r ℝ D NET y r (19) In our data, TILs appear round and dark purple with deep bluish nuclei. We aimed to maintain the stain color, tissue contents, morphology, and textural details in our generated images. The relativistic average GAN cost functions did not generate output as expected. Hence, we tweaked the relativistic average GAN as follows: Loss modified D NET = E y r ℝ sum g 1 D NET y r − E y f ℕ D NET y f + E y f ℕ sum g 2 D NET y f − E y r ℝ D NET y r Here, V (r, f) is computed using r * 1 − ϵ + f * ϵ where ϵ = e −4 . Figure 4 shows the results of the relativistic average GAN loss and our proposed loss functions. The results of figure 4(b) are much smoother than the figure 4(a). Moreover, in 4(b) nuclei are easily identifiable. From the results, it is clear that our loss function generates much better result. The training and validation loss graph of TilGAN is shown in figure 11.

3) TilGAN TRAINING AND TESTING PROCEDURE-The
TilGAN architecture was trained on 267 WSIs and tested on 89 WSIs. For the training of the TilGAN model, we did not perform data augmentation because it would generate additional noise with poor-quality images. Hence, our suggestion is to use as many real, high-quality, hand-labeled images as the input of the generator. We set the TilGAN model batch size to 100 and the learning rates of G NET and D NET as to 1e-4 and 1e-5, respectively. The initial weights were standardized to a mean of zero with 0.02 as a standard deviation. We used the Adam optimizer with adaptive momentum. The values of β1 and β2 were set to 0.5 and 0.99, respectively. We used the TensorFlow framework for the development of TilGAN.

C. CLASSIFICATION ARCHITECTURE DETAILS
Classification was performed to verify whether our synthetic images are efficient for discriminating real TIL and non-TIL patches. Our classification architecture, developed using Keras with the TensorFlow backend, was designed using six convolution layers, ReLU, two max-pooling layers, four dense layers, one flattened layer, one dropout layer, and one batch normalization layer. The details of our classification architecture are depicted in table 3. Figure 5 shows the classification model workflow. For the classification, we used one million high-quality synthetic image patches of size 224×224 pixels, which were generated by TilGAN. We added only 36 WSIs out of 356 WSIs with TilGANgenerated images for better classification performance. The rest of the WSIs were split into testing and validation sets to evaluate the classification model. For our classification algorithm, we used a sigmoid classifier and an Adam optimizer. The training parameters were as follows: learning rate as 0.0001, epoch as 50, dropout ratio as 0.5, and loss function as binary cross-entropy. We used rectified linear unit after each convolution layer.

1) EVALUATION METRICS FOR TilGAN-GENERATED IMAGES-For the quantitative evaluation of the images generated by our proposed TilGAN model, we used the
Inception score (IS) [15], kernel Inception distance (KID) [62], and Fréchet Inception distance (FID) [63]. All the scores were calculated using a pretrained Inception-v3 network [59], [64]. We calculated the IS as follows [59], [65]: IS = exp E y p image KL p x | y p x (22) The marginal class distribution can be evaluated as: p x = ∫ y p x | y p image y (23) Here, y ~ p image means that y is an image set of p image . p(x|y) represents the conditional class distribution. KL means KL divergence.

2) EVALUATION METRICS FOR THE CLASSIFICATION MODEL-The
classification performances have been measured by classification accuracy, precision, recall, and F1-score [8], [66], [67]. We also computed the confusion matrix and area under the receiver operating characteristic curve.

A. RESULTS OF TilGAN
We evaluated the quality of our proposed TilGAN-generated fake images through a clinical evaluation by our experts. They independently classified each image as real or fake from sets of almost 1000 images. A subset of all the real and TilGAN-generated fake images are shown in figures 6 and 7, respectively. Over 96% of the TilGAN-generated fake images were classified as real images, and all the real images were classified as real. Less than 4% of the TilGAN-generated fake images were classified as fake. From this experiment, it is obvious that even for an expert, it is difficult to distinguish TilGAN-generated fake images from a mixture of fake and real data. This finding means that the TilGAN architecture generates high-quality images and maintains the proper stain color with a significant amount of tissue content based on the tumor stage.
Moreover, we also evaluated the quality and diversity of the TilGAN-generated fake images by the most popular quantitative evaluation metrics for GANs, i.e., the Inception score, Fréchet Inception distance, and kernel Inception distance. The Inception score was used to evaluate the quality and diversity of the fake images. A high Inception score indicates that the generated fake image contains high-density and clear objects for all classes. However, this scoring technique has a few disadvantages. One of the main disadvantages is that it does not use the statistics of real data. To overcome this issue, we used the Fréchet Inception distance and kernel Inception distance. The Fréchet Inception distance has been used to calculate the distance between Inception feature vectors for fake and real images. The value of the Fréchet Inception distance changes with the image diversity, as it is robust to noise. If a dataset contains many diverse images, the Fréchet Inception distance will be low or closer to zero. On the other hand, if the image diversity between the real and synthetic images decreases, the Fréchet Inception distance will be high. The kernel Inception distance has been used to evaluate the similarity between real and fake images [62]. If the kernel Inception distance is low, the real and fake images are very similar to each other, or it is very hard to distinguish them from a mixture of real and fake images.
The results of the Inception score, Fréchet Inception distance, and kernel Inception distance are shown in table 4. We calculated the Inception score on both real and TilGAN-generated fake images because it only uses one kind of image at a time. We achieved an Inception score of 2.32±0.02 (mean ± standard deviation) for the real images and an Inception score of 2.90±0.04 (mean ± standard deviation) for the fake images. This finding indicates that the TilGAN-generated fake images contain high-density tissues and clear objects and or more diverse. For the Fréchet Inception distance and kernel Inception distance measurements, we used the outputs of the last hidden layer, i.e., the pooling layer, of the same pretrained Inception-v3 network [64]. The Fréchet Inception distance is 0.312, which is very close to zero, and the kernel Inception distance is 1.44±0.025. These two values are lower than the Inception scores of the real and fake data. Undoubtedly, the TilGAN generates a more diverse and high-quality dataset, which is almost similar to the real images. The real and fake data distribution is shown using the t-stochastic neighbor embedding (t-SNE) plot in figure 8. This plot gives a good understanding of the visual and color similarities of the generated synthetic images with the real images. The TilGAN generates wide varieties of fake non-TIL patches, which also include some white patches. Hence, some green dots are away from the dense population.

B. TIL AND NON-TIL IMAGE CLASSIFICATION RESULTS
In the previous sections, we have shown the results of the qualitative and quantitative analysis of TilGAN. In this section, we will show the performance of our classification model, where 90% (i.e., one million) TilGAN-generated images and 10% (i.e., 36 WSIs) real hand-labeled images were used for model training for distinguishing real TIL and non-TIL patches. Testing and validation of our trained model was performed using only real images with zero overlap. The classification model was run for up to 50 epochs, and the model started converging after 41 epochs. The training and validation losses of our classification model are shown in figure 12. A subset of the TilGAN-generated synthetic TIL images and non-TIL images are shown in figures 9 and 10, respectively. Figure 15 shows an accuracy plot of the real, fake, and combined outcomes. In this plot, when only fake images were used, we observed the unstable behavior of the model's performance from epochs 16 to 21. However, this behavior is normal for any classification model. This finding indicates that our proposed model is learning, and based on the quality of the batch images, the learning performance varies. We also noticed that after 36 epochs, the accuracy plots of the real, fake, and combined outcomes converged. However, their accuracy levels are different.
The accuracy for the real images is comparatively lower than that of the proposed model (where 90% fake and 10% real images were used). The main reason for this behavior is that we only used a minimal number of real hand-labeled data (i.e., 10%) for model training.
Significant changes in the accuracies were not observed when only fake and combined (90% fake and 10% real) images were used for model training.
The training scheme was repeated ten times with different data as per the Monte Carlo crossvalidation criteria. Each time, the training and testing dataset was split randomly, but the same principle applies. We achieved an average classification accuracy of 97.83%, an F1score of 97.37%, a precision of 98.34%, and a recall of 96.49%. Table 5 shows the 10-fold cross-validation results. We also computed the confusion matrix on 18,400 image patches (8750 TIL and 9650 non-TIL patches) of size 224 × 224 pixels. The confusion matrix is shown in figure 16. Figure 17 shows a receiver operating characteristic curve, which has an area under the curve of 97%. From the above results, we can say that our classification model accurately classifies the real TIL and non-TIL patches.
The classification results on the whole-slide pathology images are shown in figures 13 and 14 using a heat map. For heat map generation, the first whole-slide images were tiled into 224 × 224 pixels. Next, the probability score was calculated for each tile using our trained classification model. The probability score determines the probability of having TILs or non-TILs in a specific tile. In our experiment, 0 indicates a TIL tile, and 1 indicates a non-TIL tile. When the probability score was close to zero, then the tile was considered a TIL tile, and when the probability score was close to 1, then the tile was considered a non-TIL tile. In figure 13, red and blue regions from the heat map were separately highlighted and matched with the original WSI. The red and blue regions represent the non-TIL and TIL regions, respectively, of the original WSI. In figure 14, we show the heat map representation of the classification scores for two other whole-slide images.

V. CONCLUSION
In this study, we proposed the TilGAN model for improving the quality of synthetic pathology images. Our proposed architecture differs from existing GANs mainly because of architecture and loss functions. TilGAN does not contain an attention layer, similar to the Pathology GAN architecture [49]. In the TilGAN model, the numbers of layers in the generator and discriminator are different. Because of these properties, we can maintain the quality and quantity of specific types of target objects; in our case, it is TIL in each tile. The generated TIL patches are mostly covered by lymphocytes rather than stroma or other artifacts. Similarly, non-TIL patches are mostly covered by stroma and other artifacts rather than lymphocytes. This phenomenon is obviously a good sign of our architecture. This finding shows that we are not using up our resources on generating low-quality images. Another interesting point is our loss functions, which maintain the features of TIL morphology, texture, and color of real images in the synthetic images. Hence, image normalization, enrichment, and translation are not essential as in the approach proposed by [46]. We properly verified the quality of our synthetic images by the Inception score, kernel Inception distance, and Fréchet Inception distance metrics, which showed promising results. We plotted the t-SNE graph using real and synthetic images, which showed a strong correlation between the real and synthetic images. Next, the generated synthetic images were physically verified by experts. They faced difficulties in distinguishing the real and synthetic images from the mixture of real and synthetic images. The use of one million synthetic images for training the classification model was an additional evaluation measure for the TilGAN model. Here, we showed that the TilGAN-generated images can efficiently classify real TIL and non-TIL patches with improved accuracy. From the various image verification methods, we proved the usefulness and effectiveness of our proposed TilGAN architecture.
Therefore, we can say that our approach performs better in generating TIL and non-TIL images than other methods. In the future, this architecture can be used to generate radiology and other non-clinical data.