Improving Generative Adversarial Networks for Patch-Based Unpaired Image-to-Image Translation

Deep learning models for image segmentation achieve high-quality results, but need large amounts of training data. Training data is primarily annotated manually, which is time-consuming and often not feasible for large-scale 2D and 3D images. Manual annotation can be reduced using synthetic training data generated by generative adversarial networks that perform unpaired image-to-image translation. As of now, large images need to be processed patch-wise during inference, resulting in local artifacts in border regions after merging the individual patches. To reduce these artifacts, we propose a new method that integrates overlapping patches into the training process. We incorporated our method into CycleGAN and tested it on our new 2D tiling strategy benchmark dataset. The results show that the artifacts are reduced by 85% compared to state-of-the-art weighted tiling. While our method increases training time, inference time decreases. Additionally, we demonstrate transferability to real-world 3D biological image data, receiving a high-quality synthetic dataset. Increasing the quality of synthetic training datasets can reduce manual annotation, increase the quality of model output, and can help develop and evaluate deep learning models.

In contrast to these methods, synthetic training data can be used.In the past, synthetic training data was created mainly by physical simulation [15], [16].A framework to create realistic synthetic bright-field microscopy images has been The associate editor coordinating the review of this manuscript and approving it for publication was Mohammad Shorif Uddin .developed to omit manual labeling [16].The framework was part of the development of a new generation of a cervical cancer screening system.An easy-to-use, modern, and modular web interface was developed to simulate various fluorescence microscopy systems in [15].It reduces the installation and configuration barrier of existing tools.The downside of physical simulation is the expert domain knowledge needed to create high-quality results.Additionally, quality is reduced by unknown physical processes and approximations.On the other hand, simulation provides explainability and interpretability if needed.
Synthetic training data can also be created with a small amount of domain knowledge and neural networks (NNs) that perform unpaired (unsupervised) image-to-image translation.The neural networks learn to transform images x from source domain X to images y in the target domain Y.The transformation is often learned in both directions.When synthetic label images are used for one domain and the This is shown in (d) with four patches used for the prediction.The real-world image is a crop from the BBBC039v1 dataset [33].
real-world images are used for the other domain, the NN learns to transfer between both domains.After training, paired synthetic training data can be synthesized from the synthetic label images.
Unpaired image-to-image translation can be performed with Energy-Based Models (EBMs).An EBM parametrized by neural networks, trained by Markov Chain Monte Carlo (MCMC) sampling-based maximum likelihood estimation has been developed [17].The problems of instability and the lack of diversity have been solved with a coarse-to-fine image generation, increasing image resolution by expanding the energy function throughout training.Later, an EBM with a multidimensional latent space and a pretrained autoencoder was introduced to further increase the quality of image translation [18].
In addition to EBMs, generative adversarial networks (GANs) like CycleGAN, UNIT, DRIT++ or others can perform unpaired image-to-image translation [19], [20], [21], [22], [23], [24], [25], [26].Many GANs for unpaired imageto-image translation consist of one or more generators, transforming data between the domains and one or more discriminators evaluating the authenticity of the generated images.Furthermore, the cycle-consistency constraint introduced in CycleGAN enforces that an image translated from one domain to another and then back should closely resemble the original image, guiding the generators to learn meaningful mappings while reducing the need for paired training data.Cycle-consistency can also be enforced implicitly by a shared latent space used in UNIT.For a thorough description of the different architectures or an introduction to GANs, refer to [27].
While EBMs have mostly been applied to perform unpaired image-to-image tasks like the translation between cats and dogs or oranges and apples, GANs have already been used to create synthetic 2D, and 3D training data from unpaired synthetic label images and real-world images [8], [28], [29], [30], [31], [32].
When researchers decide to use GANs to synthesize training data, they must deal with the large amount of VRAM required.When the available VRAM is too small for a large-scale 2D or 3D image, and the resolution cannot be reduced, training and inference must be performed patchwise.For inference, different tiling strategies can be applied.A naive tiling strategy creates patches without overlap, and each patch is processed individually by the GAN.While the mapping for individual patches is correct, errors at patch boundaries in the final image occur.Objects present in multiple patches often inherit a sharp transition in texture, lightning condition, and color pattern.These errors especially appear when there is no direct one-to-one mapping between images in the source domain and the target domain, but a one-to-many mapping.The one-to-many mapping exists due to low entropy in the input image domain and high entropy in the output image domain [34].An example for the errors introduced when predicting microscopy images of cell nuclei, is shown in Fig. 1.While the prediction without tiling yields virtually no errors, the patch-based prediction with tiling yields errors at the patch borders.Another well-known example is the edges-to-shoes setting, where a GAN is trained to create pictures of shoes solely from the edges of the shoe [19].Low entropy edges of a shoe can match multiple drawings of a shoe, e.g., different colors, laces, or soles.When processing the edge image patch-wise, there is no guarantee that the GAN infers the same color, laces, and sole from the low entropy input domain to the high entropy output domain for all patches.
Advanced tiling strategies have been developed to reduce these errors without adding more domain knowledge to the synthetic label image domain.Bel et al. [35] use a CycleGAN to adapt histopathological image staining between centers.They introduced a tiling strategy to reduce the tiling artifacts of simple tiling.They made several adaptations to simple tiling: (i) Large overlapping tiles are processed.This increases the similarity of adjacent patches' mean and standard deviation during inference.Therefore, when using standard instance normalization output is also more similar.(ii) Overlapping patches are cropped after being processed by the GAN to reduce border effects introduced by padding and the difference in the receptive field for border pixels compared to pixels in the middle of the patch.(iii) The cropped patches still overlap, and the overlapping patches are stitched together with a weight map to ensure a smooth transition from one patch to the next.While the advanced tiling strategy has been proven to produce high-quality outputs and no changes in GAN training are needed, two drawbacks occur: (i) The large overlap used increases single-dimensional execution time nearly by a factor of four, while inference time scales exponentially with the number of input dimensions.For 3D data, this results in an increase of inference time by 64 compared to naive tiling.
(ii) a one-to-one mapping and a reasonable output for an object present in two adjacent patches can still not be ensured.This can still lead to errors in the final image.
On a single patch level, methods able to adjust the output for one-to-many mappings exist [23], [25].These multimodal GANs map an image x to many different correct versions of an output image ŷ.This is, e.g., done by injecting random noise into the generator or drawing a random style code from a style encoding feature space.Although one can adjust the single patch output for multimodal GANs, consistency across patches can not be guaranteed when processing an image with a tiling strategy.A straightforward strategy to use multimodal GANs to process large-scale images patch-wise would be to create multiple patches until the next patch matches the previously created patches.For real-world images, this is not feasible because oftentimes, there are no automated measures to decide whether the next patch does match the previous patches or not.
Neither existing multimodal GANs nor existing tiling strategies are able to omit the errors introduced during patch-based inference completely.This shows the need for improvement.Because GANs are complex architectures and do not work out-of-the-box for different problem settings, the adoption of new architectures is slow.Therefore, instead of developing a new GAN architecture, we developed a new tiling strategy, which can be directly incorporated into GAN training to further decrease the errors introduced during patch-based inference.
Our tiling strategy enables the GAN to incorporate adjacent patch information into the prediction of the next patch.The tiling strategy allows the GAN to produce arbitrarysized high-quality images while inference time is reduced compared to existing tiling strategies.Our contributions are as follows: (i) We introduce a tiling strategy benchmark dataset to quantitatively compare tiling strategies for GANs, (ii) we show and quantify errors of advanced tiling strategies, (iii) we introduce our new Stitching Aware Training and Inference (SATI) to reduce tiling errors and give quantitative results and (iv) we apply our method to a real-world 3D biological dataset.
The tiling strategy benchmark dataset created to compare tiling strategies is introduced in Section II.Afterwards, we present our method and show how we incorporated it into the CycleGAN architecture in Section III.In Section IV, quantitative results on the benchmark dataset are shown.Furthermore, we applied our method to a real-world 3D microscopy dataset and present qualitative results.Finally, we discuss our work in Section V and summarize our findings together with an outlook for future work in Section VI.

II. TILING STRATEGY BENCHMARK DATASET
For real-world images, errors occurring during patch-based inference are manifold and vary depending on the images in both domains.Therefore, visual assessment and error quantification are often not possible.To enable both, the tiling strategy benchmark dataset is introduced.
We used a coloring task for the tiling strategy benchmark dataset.The task is exemplarily shown in Fig. 2a.Each white circle in domain X is colored in red, blue, or green in domain Y.Since no color information is present in X , transformation X → Y is a one-to-many mapping.Different mappings are shown in Fig. 2b.There is no dependency between different circles in domain Y.This aligns with many image-to-image translation tasks.For example, the styles of two cars are not dependent when transferring labels to photos of street scenes.Stitching aware inference workflow.An image x too large to be processed by GAN XY without tiling is predicted patch-wise.The image z is created to provide the GAN with context of the previous prediction ŷx 1 .This enables the GAN to predict the correct color (red) for the circle on the top left of x 2 .Finally, ŷx 1 and ŷz are merged to ŷ .
To create a more diverse dataset and assure that many circles are at patch borders regardless of patch size, we scaled the problem to images with size 2048 px × 2048 px and placed 1000 circles with diameter 40 px on each image.Afterwards, grayscale Gaussian noise is added to domain X , and channel-wise Gaussian noise is added to domain Y. Images in domain X are afterwards converted to RGB color space, to match the generator input dimensions.All images are encoded with 8-bit for each color channel.A total of 512 unpaired images are created for each domain.Exemplary images for each domain are shown in Fig. 2c.
The simple output domain allows easy visual analysis and computational quantification of the errors introduced during patch-based inference.A circle is predicted correctly, if it consists exclusively of one of the colors red, green or blue.An erroneous circle is present, when multiple colors are in the circle.Because Gaussian noise is present in the images and the images are encoded with 8-bit per color channel, we cannot use simple thresholding to detect the presence of a color.Instead, we check whether a connected area of more than 30 px 2 with color values above the threshold of 60 exists.This is done for each color channel individually.We choose these values to ensure no areas are selected due to the Gaussian noise while being able to identify small mistakes.

III. METHOD
Our new method integrates information from previous predictions into the training and inference process for data with a one-to-many mapping.With this information, GANs are able to infer accurate results for consecutive patches during patch-based prediction.We call this method Stitching Aware Training and Inference (SATI).In this section, we present SATI together with the adaptations we made.First, we introduce overlap sampling, domain encoding, and loss ramping.Finally, the inference stitching strategy optimized for our approach and the pixel overlap weighting is introduced.An implementation to create the benchmark dataset and conduct the experiments is available at https://github.com/MoritzBoe/patch_based_image_translation.git.

A. STITCHING AWARE TRAINING AND INFERENCE
When training a standard unpaired image-to-image translation GAN on a one-to-many dataset for X → Y, the GAN will reduce the problem to a one-to-one mapping.
After training, an image x will be matched to an image ŷ.A different image ŷ can only be acquired when retraining with modified network initialization or hyperparameters.Based on the premise that the GAN can learn the mapping, the output for a single image will always be a correct prediction from the target domain.When an image is processed patch-wise with a tiling strategy, each patch is still a correct prediction from the target domain.However, errors arise when an object is visible in two or more patches (see Fig. 2d).
We solve this problem by adding information about adjacent patches when single patches are processed during inference (see Fig. 3).Adding all adjacent patches to the input vastly increases input size and is not feasible.Instead, we process overlapping patches with areas already predicted from domain Y and new areas from domain X .In the example in Fig. 3, two patches are needed to process the entire input image x.Patches are created by tiling.The patch x 1 is processed by the GAN and ŷx 1 is synthesized.Subsequently, the bottom part of ŷx 1 merges to the top of x 2 .A new image z consisting of both domains is created.We call this new domain Z.The same network that transforms images from X to Y is used to transform Z to Y, and ŷz is created from z. Adding the already predicted areas from ŷx 1 to x 2 enables the GAN to continue the prediction of the circle on the bottom left in ŷx 1 , which is on the top left of x 2 , in the correct color (red).Finally, ŷz and ŷx 1 merge into ŷ.
In contrast to standard GAN inference, our inference workflow adds images consisting of both domains X and Y to the process.Therefore, standard GAN training has to be adapted to handle the domain transfer Z → Y.This transfer has two constraints: (i) Areas in Z which are already from Y need to stay constant, and (ii) areas in Z which are from X need to be transferred to Y with respect to the areas from Y present in Z.
To meet both constraints, we added the procedure depicted in Fig. 4 to the training.The GAN transfers an image x to ŷx .Afterwards, a merged image z is created, where border regions of x are replaced with parts of ŷx .The merged image z is processed again by the GAN to create ŷz .A GAN able to perform the translation of an image x to ŷx and an image z to ŷz , can perform the inference workflow depicted in Figure 3.
Two new loss functions L stitch and L ZY adv are introduced to enable the transfer from z to ŷz and to meet the required constraints.The first loss function L stitch ensures, that the pixels from ŷx present in z stay constant after transfer to ŷz .We use the mean squared error as a loss function.To enable the transfer of pixels from domain X present in z to the image ŷz and therefore to domain Y, these areas are excluded from L stitch .Subsequently, the stitching loss L stitch is defined by: (1) where M corresponds to the indices of all pixels from ŷx in z and GAN XY (z) to ŷz .
The L ZY adv is used to ensure the overall quality of ŷz .A discriminator D Y trained to differentiate between real images from domain Y and synthetic images (ŷ x and ŷz ) is needed.Most GANs like CycleGAN, UNIT or DRIT++ have a discriminator D Y .Otherwise, D Y can be added to the architecture.L ZY adv can be defined as follows: Errors like the ones shown in Fig. 2c will be detected by the discriminator, and therefore, the generator is trained to omit these errors.
In addition to Fig. 4, the pseudocode for a training step using SATI is depicted in Alg.
We made several adaptations to the stitching aware training and inference to increase performance.The adaptations are shown in the following paragraphs.

B. OVERLAP SAMPLING
For 2D images, inference starts with a patch from domain X (see Fig. 3).All remaining patches are from domain Z.When processing an image row by row, patches in the first row have only one adjacent patch already predicted, shown in Fig. 5 z 2 .The first patch in each following row is depicted in Fig. 5 z  Image statistics for the mean and variance differ severely for the three cases.We use instance normalization without running mean and variance for our experiments.This combination will result in erroneous predictions when evaluating overlap combinations not used during training.Therefore, all overlap combinations are added to the training workflow to enable high-quality output for all three cases.While z 1 is utilized the most during inference, high-quality outputs for z 2 and z 3 are desired to omit propagation of errors from the image borders to the center of the image.Therefore, the merger (see Fig. 4) creates each of these cases with the same probability.

C. DOMAIN ENCODING
Convolutional neural networks work locally, and the receptive field is limited, especially in the early layers.When SATI is used, the GAN has to learn which parts of an image z are from domain X and which parts are from domain Y during training.This can be a challenging task for datasets where domain X and Y share significant local similarities.For our tiling strategy benchmark dataset, large local similarities are present in background areas, where no circles are placed.
Instead of adding an additional layer to the input patch which encodes the position in the image, we encode the origin domain directly into the image.We transfer the images from domain X into the range [−1, 0] and the images from domain Y into range [0, 1].As a result, the GAN can identify the domain according to the range of values and change or keep pixel values accordingly.

D. LOSS RAMPING
Unpaired image-to-image transfer is a challenging task and it is common for synthetic images to yield low quality for the first epochs.It is not useful to force the GAN to keep these low-quality parts of an image z in ŷz .Therefore, we increase the scaling of L stitch from zero to the final scaling factor λ stitch throughout the training.

E. STITCHING STRATEGY
As shown in Fig. 3, the final image consists of overlapping patches.The overlapping area can be selected from one of the patches.In Figure 3, the overlapping area from the bottom patch is selected, while the overlapping area from the top patch is dismissed.Preliminary tests showed, that using the complete overlapping area from one of the patches introduces errors for objects barely starting or ending in the adjacent patch.We prevent these errors by using the middle of the overlapping areas as the transition between patches in the final image.

F. PIXEL OVERLAP WEIGHTING
With the stitching strategy, we cut patches in the middle of the overlapping area to create the final image.By doing so, FIGURE 6. Pixels for L stitch are weighted with respect to two superpixels.The pixel with the biggest distance to areas from domain X is weighted with one (top left, white).The pixel with the biggest distance to areas from domain Y is weighted with zero (bottom right, black).The two superpixels are used for linear weighting with the euclidean distance for all other pixels (middle image).Subsequently, pixels from domain X are set to zero (right image).
we can allow the GAN to slightly change pixels at the border between domain X and Y when transforming an image z to ŷz .This can be beneficial if an object just starts at the end of a patch and the majority of the object is in the next patch.Having more information about an object in the next patch will allow the GAN to change the complete object accordingly.We enable this by weighting the pixels for L stitch according to their location.
An example is shown in Fig. 6.The final image on the right shows the utilized weight map.All pixels from domain X are weighted with zero.The farther away a pixel in the area from domain Y is from a pixel in domain X , the more the weight is increased.The weights are scaled linear between zero and one.Therefore, the relative size of the overlap is included in the weighting.For bigger overlapping regions, the GAN is given more freedom to change pixels near the transition from both domains.

IV. EXPERIMENTS
We incorporated the stitching aware training into Cycle-GAN, 1 since CycleGAN and its variations are often used for biomedical data synthesis and in material science [8], [28], [29], [31], [32], [36].For the generator architecture, we used the ResNet-Generator with instance normalization, 96 initial generator feature maps, and nine ResNet blocks in the feature space.For the discriminator architecture, we used the PatchGAN-Discriminator with instance normalization.We used the mean squared error (MSE) for the cycle-consistency loss (L cycle ), the identity loss (L idt ), and the discriminator loss (L disc ), which is also used to optimize the generators (L adv ).For the stitching loss L stitch , we also used the MSE and apply our pixel overlap weighting afterwards.We set the scaling for the stitching loss to λ stitch =10.The other scaling factors are set according to the original implementation of CycleGAN [19] with λ cycle =10, λ idt =5 and L adv is not scaled.The overall loss is defined by: ).The error rate is displayed in percent on a logarithmic scale.The black lines indicate the median for each method.Standard + no tiling is the performance achievable when the whole image can be transferred to the GPU during inference.While this is possible for the tiling strategy benchmark dataset, this is not possible for large-scale 2D and 3D data.
where the first row represents the standard CycleGAN loss and the second row is the additional loss added with SATI.

A. TILING STRATEGY BENCHMARK DATASET
In our experiments on the tiling strategy benchmark dataset, we compared SATI to a simple tiling strategy (simple tiling), to the advanced tiling strategy (weighted tiling) used in [35] and [37] and to the results processing the whole image at once (no tiling).Furthermore, we conducted an ablation study to quantify the usage of the different adaptations we made to the core of SATI.

B. BENCHMARK METHOD COMPARISON
The results for the comparison to the benchmark methods are shown in Fig. 7 and Tab.median error to 3.5%.Using SATI results in a median error of 0.5%.This is a reduction of 92% compared to Standard + simple tiling and a reduction of 85% compared to Standard + weighted tiling.
Exemplary images of errors occurring in the final images are shown in Fig. 8.For Standard + simple tiling the patch borders are clearly visible.An example circle present in four patches is predicted in all three colors, dependent on the patch.Using Standard + weighted tiling highly improves the results, and due to the weighted overlap, it is not possible anymore to identify individual patches.Instead of a sharp transition between colors, colors are overlapped.The overlapping results, e.g., in a purple circle being a mixture of red and blue.The only errors left using Standard + no tiling are errors not related to tiling.These errors are present in all inference strategies.They could possibly be reduced by longer training, more training data, adaptation of GAN parameters, or usage of a different GAN architecture.Also, for our method, errors look similar to the remaining errors in Standard + no tiling.

1) ABLATION STUDY
For the ablation study, we deactivated the single adaptations we made to our base method and evaluated the performance.The results are shown in the bottom part of Fig. 7. Disabling any of the adaptations reduces the performance.
The least impact on performance has SATI w/o loss ramping, where the median error is only increased by 2.7%.However, one run collapsed, and the mapping was not learned correctly, which is a common problem for unpaired image-to-image translation.Because of the limited sample size, we can only assume that disabling loss ramping increases the chance of the GAN collapsing.
Disabling pixel overlap weighting increases the median error by 34%.Pixel overlap weighting allows the GAN to adapt already predicted pixels in border areas to new parts of the images not predicted yet.For our tiling strategy benchmark dataset, this enables the GAN to change the color of objects which just started at the border between both domains.
The median error increases by 84.5% when using SATI w/o stitching strategy.Example images show that the GAN changes the color of circles with a tiny part in the overlapping area.This could be because a small part results in a small weight compared to the overall loss.Color changes can result in errors in the final image.Using the stitching strategy omits this problem without needing a specialized loss.
Disabling domain encoding results in a decreased performance.The median error increases by 59.0%.Therefore, domain encoding eases the learning task even for a clear difference between circles in domains X and Y and no evaluation of background.
Finally, disabling overlap sampling and only learning to transfer images where the top and left regions are from domain Y results in the biggest drop in performance.Four models collapsed, resulting in an increase of the median error by 151.9%.An example image is shown in Fig. 9.The Figure shows that the GAN cannot produce good results when only the top or left region of a patch is from domain Y.The input distribution differs a lot whether the top and left are from domain Y or only the top or the left.The GAN fails to achieve high-quality output in combination with instance normalization layers.It is interesting to see that, nevertheless, the GAN can recover for regions where the top and left are from domain Y. Recovering from erroneous previous predictions is a key requirement for patch-wise inference of large-scale images.It can be concluded that a bad prediction will not reduce the quality of future patches.
The contours of some circles are distorted on all images in Fig. 8.This occurs due to the GAN making imperfect translations between both domains.Therefore, the contours of the circles can also be distorted when no tiling is applied (Fig. 8 (c), bottom left).In contrast, the contour information is available in both domains, and therefore, the GAN is able to 127902 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.infer contours throughout different patches.This can be seen in Fig. 8 (a).To reduce contour distortions, a spatial constrain can be added to CycleGAN [38].

C. REAL-WORLD DATASET
Additionally to the fully synthetic tiling strategy benchmark dataset, we expanded SATI for 3D data and applied it to a real-world dataset of KP-4 cells, where nuclei are stained with Draq5.The goal of this evaluation is to prove the following hypothesis: 1) SATI can be expanded to 3D, 2) SATI can be applied to complex real-world data and the GAN is still able to learn the mapping between both domains.The dataset consists of four images recorded with a Leica SP8 confocal microscope (Leica Microsystems, Wetzlar, Germany), has a voxel size of 568 nm×568 nm×1000 nm and a resolution of 8-bit.The images are cropped to remove areas without cells.The crops range from 380 px to 550 px in the XY-plane and 140 px to 190 px in the Z-direction.Afterwards, the crops are downscaled by the factor of 2 in the XY-plane.Thus, the Z-resolution is matched, and more objects are present in a single volume during training, easing the learning task.A crop of an XY-slice can be seen in Fig. 10 (a).Elaborate methods to create 3D nuclei for the synthetic label images exist [32], [37].Both need a set of available annotations, which are hard to acquire for 3D data.On the other hand, it has been shown, that ellipsoids are a good estimate for 3D cell nuclei [29].Therefore, we created four synthetic label images by randomly placing ellipsoids.Each image has a size of 256 px × 256 px × 256 px.The background is set to a value of 10 and the foreground is set to 130. Afterwards, Gaussian noise with µ = 0 and σ = 3.33 is applied.Finally, we rounded the result to integers and clipped the result to the range [0, 255].An exemplary crop of an XY-slice can be seen in Fig. 10 (b).
We trained the network with SATI for a total of 1120 epochs with a batch size of 12 and 256 random crops of size 64 px × 64 px × 64 px per epoch.We set the scaling for the stitching loss to λ stitch =20 and the overlap to 16 px.The other scaling factors are the same as for the tiling strategy benchmark dataset.Furthermore, we use the MSE for all loss functions and start with a learning rate of 0.0002 for the Adam optimizer.The training took 31 hours on an NVIDIA A100.
A crop of an XY slice of the final generated image can be seen in Fig. 10 (e).For inference, we used patches of size 64 px × 64 px × 64 px with an overlap of 16 px.Therefore, 25 individual patches are shown in the crop.The generated crop shows that the GAN can match patches to their predecessors and no sharp transition inside a nucleus is visible at patch borders.In contrast, the crops shown in (c, d) were created with the same trained network, but SATI was not applied during inference.The borders of different patches are clearly visible in (c).The visual appearance of Standard + weighted tiling in (d) is comparable to (e).However, it must be denoted that a pixel in (d) is the weighted sum of up to eight individual predictions for 3D data.The results on the real-world dataset show, that SATI can be expanded to 3D and the GAN is still able to learn the mapping between both domains.
Although, the GAN is trained on the real-world data, a domain GAP regarding the brightness between the real-world data and the generated crops (c), (d) and (e) still exists.This is due to the spatial differences in the real-world images.The brightness of confocal microscopy images is lower towards the edges and for deep Z-slices.Standard GANs do not have information about spatial location during training and inference.If spatial consistency is needed, spatial information can be added to training and inference while still using SATI [32], [37].

V. DISCUSSION
Applying existing tiling strategies to the tiling strategy benchmark dataset shows a need for improvement.Weighted tiling improves the quality and errors are visually less prominent.The transition between patches is not learned, but improved in the post-processing.The advantage of this is that the training process remains unchanged.Due to the weighting, individual features are suppressed and erroneous objects could be smoothed or result in a mixture of object types.In contrast to this, SATI allows the GAN to learn what a meaningful transition between adjacent patches looks like.This allows our method to prevent errors that occur directly in the prediction of adjacent patches and cannot be corrected by the other tiling strategies.
SATI produces high-quality results on the tiling strategy benchmark dataset comparable to the best case results without tiling.The GAN learns the desired behavior with the training and inference strategies introduced.The additional complexity of the learning task is significantly reduced by our domain encoding adaptation, which is shown by the decrease in performance when domain encoding is disabled.We do not think that the remaining increase in complexity of the learning task is a problem for real-world datasets.Still, it must be evaluated individually for different learning tasks.
For our 3D real-world dataset of KP-4 cells, SATI was able to synthesize large-scale images without a visually notable transition between patches.Also, Standard + weighted tiling yielded visually appealing images.Therefore, Standard + weighted tiling can still be a valid option for grayscale data, while weighting up to eight individual patches for 3D data can potentially change the noise in the image.
SATI is designed to work for objects with a limited spatial extent and no relations between distant objects.This is the case for many biological, medical, or material science datasets.However, there are datasets where SATI is of limited use.For example, creating a high-quality image of a blue car with red or green exterior mirrors can still result in a red left mirror and a green right mirror.The spatial distance of the mirrors is too large to be present in the overlapping areas.Therefore, the GAN has no information whether the first mirror was red or green when predicting the second one.
The inference time of Standard + weighted tiling increases by the factor of 4 for each dimension compared to Standard + no tiling.This is due to the large overlap [35].The overlap is needed to guarantee a small change in image statistics between patches.SATI does not need a big overlap because it explicitly learns the transfer between patches.The overlap is only related to the spatial extent of information needed to predict the next patch correctly.In our experiments, the inference time is increased by a factor of 1.The complexity of the training task for the GAN is increased when SATI is used.This could potentially result in the need for increased network size.However, that was not the case for our experiments.Furthermore, using SATI during training results in increased VRAM utilization.The memory utilization on the benchmark dataset increased to 22.8GB compared to 19.9GB when training without SATI.
SATI results in high-quality images when used with the CycleGAN architecture.We aimed to design SATI to be integrable into different GAN architectures.This is necessary to ease usage and enable researchers to stick to their preferred architectures.A possible routine for working on new projects could be as follows: (i) Adapt a standard GAN architecture towards a new problem setting.(ii) Evaluate whether the patch quality is sufficient.(iii) Incorporate SATI to bridge the gap between patch-wise prediction and large-scale image prediction.This will result in a minimal additional workload.

VI. CONCLUSION AND OUTLOOK
Deep learning models for image segmentation require labeled training data.Labeling large-scale 2D and 3D data is a challenging task, time-consuming, and the interobserver variability is high.Researchers try to reduce manual labeling by using GANs performing unpaired image-to-image translation to create synthetic training data.Using GANs trained for unpaired image-to-image translation to predict large-scale 2D and 3D images requires patch-wise inference due to VRAM limitations.As of now, the final images are typically created using a simple tiling strategy or weighted tiling.
Our experiments show that GANs suffer from tiling-related errors for one-to-many transformation tasks.These errors are most prominent when using Standard + simple tiling.Advanced methods like Standard + weighted tiling cannot completely remove these errors.With SATI, GANs produce high-quality output when inference is performed patch-wise.We achieved an error of 0.5% compared to 3.5% using Standard + weighted tiling on the tiling strategy benchmark dataset.Therefore, we reduced the error by 85% compared to the state-of-the-art.
127904 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
The ablation study shows that the individual adaptations made to SATI further increase the final output quality, and the GAN can recover from single erroneous predictions throughout the patch-wise inference.This allows the prediction of arbitrarily large images.
The results using SATI on a real-world dataset prove that our method can create high-quality synthetic 3D images with complex content.
SATI can be incorporated into different GAN architectures to create large-scale 2D and 3D images.We hope that this will lead to better synthetic datasets for real-world problems.While better synthetic data is always desirable, the implications on downstream tasks using the synthesized data e.g., for training of segmentation networks is up to future research.It is highly dependent on the data, the learning task and the downstream method used, whether large-scale 2D and 3D images are needed.
Possible applications of SATI range from 3D microscopy to large-scale 2D data like whole slide images or aerial hyperspectral images.In the future, we want to apply SATI to a variety of real-world datasets and examine the influence of different tiling strategies not only on foreground objects, but also on image properties such as background noise.
The performance of SATI incorporated in different GAN architectures, especially for multimodal architectures like DRIT or MUNIT using content, and style or attribute feature spaces, needs to be evaluated in future work.The main limitation of SATI is the increased complexity of the learning task which could lead to longer training or the need for larger networks.Future research needs to focus on reducing this increase.In the future, we will adapt SATI to handle 3D+time data.

FIGURE 1 .
FIGURE 1. GANs trained with real-world images (a) and synthetic label images (b) are able to predict high quality images (c) from (b).If the prediction is performed patch-based, errors occur at the patch borders.This is shown in (d) with four patches used for the prediction.The real-world image is a crop from the BBBC039v1 dataset [33].

FIGURE 2 .
FIGURE 2. a: Tiling strategy benchmark dataset with problem setting transferring images between domain X and Y. b: Visualization of the one-to-many translation.The image from domain X in (a) can lead to all images from domain Y in (b).c: An example image for each domain of the final tiling strategy benchmark dataset.The dataset consists of 512 unpaired images in each domain.1000 circles with diameter 40 px are present in each image of size 2048 px × 2048 px.d: Simple tiling into four patches, piecewise inference, and stitching.The individual patch predictions are correct, while errors at the patch borders occur in the final image.The errors at the patch borders in the final image can be used to evaluate different tiling strategies.

FIGURE 3 .
FIGURE 3.Stitching aware inference workflow.An image x too large to be processed by GAN XY without tiling is predicted patch-wise.The image z is created to provide the GAN with context of the previous prediction ŷx 1 .This enables the GAN to predict the correct color (red) for the circle on the top left of x 2 .Finally, ŷx 1 and ŷz are merged to ŷ .

FIGURE 4 .
FIGURE 4. Training procedure added to the standard training.After transferring an image x to ŷx , both are merged to z and transferred again to ŷz .Afterwards, the adversarial loss and the stitching loss are calculated.The red and blue line illustrates the information flow through the GAN.The patch size used for x during training can be adapted to the available VRAM.

Algorithm 1
1.A random image from each domain is required for a training step.The steps to calculate the standard CycleGAN losses for the generators L CycleGAN G and the discriminators L CycleGAN D are not shown for simplicity.The CycleGAN generator that transfers images from X to Y is expressed by Gen XY and the discriminator trained to differ between real and generated images from Y by D Y .Finally, λ stitch is a scaling factor to vary the influence of L stitch on the training of the generator.Pseudocode for a training step using SATI integrated into CycleGAN.The steps needed to calculate L stitch and L ZY adv are shown.Calculations of standard CycleGAN losses are not included for simplicity.Comments are marked with #.Require: x, y ŷx ← Gen XY (x) # Generator training:

3 .
All patches not present in the first row or column have two adjacent patches and are shown in Fig. 5 z 1 .Starting with patches from the bottom right results in overlaps on the bottom side and the right side and therefore an equally complex training task.Starting in the middle results in more overlap combinations and should be avoided if not needed.

FIGURE 5 .
FIGURE 5. Different overlaps from domain Z created by the merger to train the GAN on all overlaps needed during inference.For 2D images, overlaps 2 (b) and 3 (c) are used for the first row and column of patches during inference, while 1 (a) is used for all other patches.

FIGURE 7 .
FIGURE 7. Our method (SATI) is benchmarked against Standard + no tiling, Standard + simple tiling and Standard + weighted tiling.For benchmarking, used models trained with standard CycleGAN (Standard + [tiling strategy]).Furthermore, the plot shows an ablation study, where we deactivated different adaptations (SATI w/o [adaptation]).The error rate is displayed in percent on a logarithmic scale.The black lines indicate the median for each method.Standard + no tiling is the performance achievable when the whole image can be transferred to the GPU during inference.While this is possible for the tiling strategy benchmark dataset, this is not possible for large-scale 2D and 3D data.

FIGURE 8 .
FIGURE 8. Exemplary errors present after application of different inference strategies compared to errors using our method SATI.Erroneous circles are marked with an arrow.For (c) and (d) few or no tiling-related errors are present.Therefore, we show examples of the remaining errors.In contrast to the tiling-related errors of (a) and (b), the errors in (c) and (d) can further be reduced with GANs better adapted to the task.The bottom left marked circle in (b) shows an error present, because the GAN made a faulty translation from X to Y.This error is comparable to the errors in (c) and (d).The top right marked circle in (b) shows an error, where a circle was predicted in blue in one tile and red in the other one, resulting in a purple circle.Due to the overlap, this type of error can only occur in Standard + weighted tiling.For each method, a complete image synthesized from the same image in domain X can be seen in the supplementary material.

FIGURE 9 .
FIGURE 9. SATI with disabled overlap sampling.The GAN only learned to predict images with the top and left regions from domain Y.It cannot produce good results for images with either top or left region from domain Y.

FIGURE 10 .
FIGURE 10.Each image shows a XY-slice of the corresponding 3D volume cropped to 256 px × 256 px.A crop of the real-world image is shown in (a).The images c-e are generated with the image shown in (b).The generation process for SATI is performed with 3D volumes of 64 px in each dimension and an overlap of 16 px.The final image in (e) consists of 25 individual patches.The brightness of (c-e) is higher than (a).This is because the brightness of the real-world images decreases at the borders of an XY-slice and for deep Z-slices.A standard CycleGAN cannot reproduce this behavior.Several approaches, compatible with SATI, can be used to remove the decrease from real-world images or enable CycleGAN to reproduce the behavior[32],[37],[39].An example for Standard + simple tiling is shown in (c).The patch borders are clearly visible compared to (d) and (e).
25 in each dimension compared to simple tiling.For 3D images, this results in an increase by the factor of 64 for Standard + weighted tiling and 1.95 for SATI compared to Standard + simple tiling.Memory usage is the same, for all methods during inference.Using SATI adds additional predictions and loss functions to the training procedure.This increases training time.Standard training finished after approximately 24h.A training run for SATI took around 27:30h.Although this is an increase by 15%, we did not evaluate the influence on the training needed for convergence by adding SATI to the training procedure, as it is still an open question how to determine when to stop training GANs.Therefore, we advise users incorporating SATI to use the same number of epochs they used without SATI.This led to good results for all our experiments.

TABLE 1 .
Results of the tiling strategies and the ablation study on the tiling strategy benchmark dataset.The best case, Standard + no tiling, and our method (SATI) are highlighted.The values correspond to the error in percent.
[27]e training of GANs is unstable and sometimes they do not converge, or mode collapse can occur[27].To reduce the influence of these training runs, we opted to evaluate our experiments regarding the median instead of the mean.The benchmark methods Standard + simple tiling, Standard + weighted tiling and Standard + no tiling only differ in the procedure used for inference, while the same trained networks are used.Processing complete images at once using Standard + no tiling, reduces the median error to 0.4%.This represents the best case scenario not applicable for large-scale 2D and 3D images.Using Standard + simple tiling results in a bad overall performance with a median error of 6.6%.Using Standard + weighted tiling reduces the