Joint Optimization of Class-Specific Training- and Test-Time Data Augmentation in Segmentation

This paper presents an effective and general data augmentation framework for medical image segmentation. We adopt a computationally efficient and data-efficient gradient-based meta-learning scheme to explicitly align the distribution of training and validation data which is used as a proxy for unseen test data. We improve the current data augmentation strategies with two core designs. First, we learn class-specific training-time data augmentation (TRA) effectively increasing the heterogeneity within the training subsets and tackling the class imbalance common in segmentation. Second, we jointly optimize TRA and test-time data augmentation (TEA), which are closely connected as both aim to align the training and test data distribution but were so far considered separately in previous works. We demonstrate the effectiveness of our method on four medical image segmentation tasks across different scenarios with two state-of-the-art segmentation models, DeepMedic and nnU-Net. Extensive experimentation shows that the proposed data augmentation framework can significantly and consistently improve the segmentation performance when compared to existing solutions. Code is publicly available at https://github.com/ZerojumpLine/JCSAugment.


I. INTRODUCTION
D ATA augmentation is a de facto technique in neural networks and has shown to improve model generalization [41].It is essential for medical image segmentation algorithms to perform well on unseen test data.Depending on when it is performed, we can divide data augmentation into training-time data augmentation (TRA) and test-time data augmentation (TEA).TRA aims to increase the variation captured by the training dataset by adding perturbed samples with the goal to capture the unseen test data distribution.TEA robustifies the final prediction by averaging predictions of predefined, assumed non-causal variations of test data, to which the model should be robust [42].An alternative approach for TEA is to modify the test data to achieve higher accuracy with the pretrained model by transforming the test samples to match the distribution of the training data, which is the opposite direction of TRA.Here, we consider to make TRA and TEA complement each other towards the goal of more accurate and robust predictions.[9], (c) [8], [18], [20], [29], [32] and (d) [22], [40], current methods optimize the data distributions by using a validation set as a proxy for unseen test data.Our framework brings improvements by integrating two conceptually simple and intuitive ideas: Common data augmentation strategies are usually designed based on heuristics and manually tuned configurations with respect to reducing validation error [18], [20].However, strategies designed for one task may not be optimal for another task or dataset.Consequently, data augmentation without consid-ering data and task characteristics may not always improve the model performance.In particular, different medical image segmentation tasks may require different data augmentation settings, due to changes in image acquisition protocols, modalities, and anatomical structures of interest [18].It is tedious to hand-engineer suitable augmentation strategies for each individual task.Therefore, methods have been proposed to automatically learn effective augmentations directly from the available training data [8], [22], [29], [32], [40], [43], [44].
However, we argue that there are two major limitations constraining the performance of current data augmentation strategies.First, previous studies [8], [22], [32], [40] mostly focus on either TRA or TEA separately, without considering their connections, despite the two being closely linked.This could lead to suboptimal results as the test condition can be adapted through TEA which is not taken into account by TRA when the two are considered in isolation.Second, most TRAs adopt the same transformations for all the samples without considering the different properties existing in different classes.Specifically, the foreground samples in segmentation are more prone to overfitting than background samples because they underrepresented due to class imbalance [30].Current data augmentation strategies fail to model the heterogeneity of samples from different classes and the resulting model performance may suffer from overfitting under class imbalance.
In this study, we aim to bridge the gap between TRA and TEA by presenting a gradient-based meta-learning framework to automatically discover optimal TRA and TEA strategies, simultaneously.As illustrated in Fig. 1(a,b,c,d), data augmentation improves model generalization by aligning the training and underlying test data distribution.Our data augmentation framework (c.f.Fig. 1(e, f)) further takes class properties and test condition into account, fundamentally restructuring the data distributions aiming for an increased overlap.We validate our method with medical image segmentation because of its imbalanced nature and clinical importance.
The contributions of this study can be summarized as follows: 1) We build a bridge between TRA and TEA through joint optimization of data augmentation policies during the training process, which improves alignment of training and test sample distributions and yields better generalization.2) We introduce a method that automatically finds different TRA policies for training samples from different classes, implicitly addressing the class imbalance problem.3) We design a transformation set for TRA with 15 cascaded transformations and 47 operations in total, as well as a transformation set with 83 operations for TEA.These transformation sets cover most transformations in medical image segmentation and can also be easily extended and applied to other applications.4) Extensive experiments performed on four datasets with two state-of-the-art segmentation models show that our method can consistently improve segmentation performance in various applications and demonstrate the potential to replace the heuristically chosen augmentation policies currently used in most previous works.

II. RELATED WORK A. Data augmentation model
The majority of data augmentation strategies consist of a set of transformations defined based on domain knowledge to represent the heterogeneity of the test data.Examples include rotations, flipping, and intensity shifts [7], [24].On the other hand, there are also heuristic perturbation techniques such as cutout [10] and mixup [47] that, even though they lead to unrealistic synthetic samples, have been empirically found to improve model generalization.More realistic transformations can be generated based on properties matching [49] or generative adversarial networks [15].Although these techniques showed promising performance, the design of data augmentation is difficult because it requires prior knowledge about the task at hand.Optimal strategies, however, may differ significantly between different tasks, datasets and types of input modalities, and thus will be difficult to hand engineer [8], [18].In this study, we aim to automate the process of designing data augmentation.
Currently most TRA methods adopt the same transformations to all the training samples except [30], which proposed to increase the variance of foreground samples by heuristically reducing the number of transformed samples for the background classes in order to alleviate class imbalance.However, they found different hyper-parameters are optimal for different datasets, and the chosen transformations and hyper-parameters were based on heuristics.In contrast, our method automatically learns different transformations for different classes and discovers the rules from the training data by itself.

B. Learning based training-time data augmentation
There have been many attempts to optimize TRA along with the training process to obtain task-specific TRA policies.Most of the studies are developed based on the idea of adversarial training [14].The basic adversarial augmentation might not improve generalization on real data as the constructed samples are not realistic.Recent methods attempt to improve real data heterogeneity by adopting an advanced augmentation model [5] or restricting the search space [38], which require strong prior knowledge.Different from these, some methods were proposed to generate artificial samples with task constraints [4], which encourage a generative model to produce additional well-classified images with class properties to enlarge the training data distribution.However, the wellclassified samples might not be very useful when the training data is sufficient as they would not make significant changes to the learning of the decision boundary.
Our method is closely related to the line of research which optimizes TRA based on the validation performance such that the learned model can best generalize.Those methods find the sets of augmentation policies that are optimal for a specific training database, out of a pool of possible transformations, based on reinforcement learning [8], [44], meta-learning [29], [43], or density matching [32].In our study, we consider to learn the parameters of a probability distribution over TRA with a meta-learning scheme.The meta-learner parameters are optimized with the aim of enabling the task segmentation network to perform better on a validation set.In this way, the meta-learner is explicitly trained to select augmentations that improve generalization.Our method improves existing solutions by the joint optimization of TRA and TEA as well as learning a separate augmentation per class.In addition, the defined transformation pool in our work is more comprehensive than previous studies for medical image segmentation, making it more practical to improve upon current heuristic solutions.

C. Learning based test-time data augmentation
In TEA, class-posterior probabilities from multiple predictions are averaged after applying predefined transformations to the test sample, which was found to be effective to improve accuracy.Recently, some methods were proposed to learn TEA by choosing the transformations obtaining low loss values on the validation set based on a pre-trained model [22], [40].Test-time adaptation is another kind of learning based TEA where the pre-trained model is adapted to fit a single test sample based on denoising autoencoder [21] or selfsupervision [16].These learning based TEA strategies can be seen as a post-processing to the segmentation and do not contribute to the learning of the model.In contrast, our method proposes to combine the optimization of TRA and TEA during training, which leads to not only learning the optimal TRA and TEA transformations that complement each other, but also learning optimal model parameters given the specific set of data transformations.

A. Preliminaries
We consider the image segmentation problem with c total number of classes.A training dataset D T = {(x i , y i )} N i=1 with N samples is given, where x i is a training image and y i corresponds to the segmentation label map with individual labels y ip ∈ {1, ..., c} for each image pixel p.Assuming a segmenter f θ parameterized by θ, our aim is to learn optimal θ * parameters, such that f θ * (•) minimizes the empirical risk over the training data.For any training loss function L train , the empirical risk of the segmentation model f θ is defined as with M samples along with a validation loss L val , which is taken as a proxy for unseen test data and used to tune the hyper-parameters including learning rates [28], network architecture [52] and data augmentation policies [8].Note that D V could come from a different distribution from D T , based on different assumptions of unseen test data.

B. Sampling transformations
For a sample x i (or xi ), we will apply transformation T i (•) which is specific to the i-th sample.T i is obtained by sampling from a set of K operations {O 1 , ..., O K } based on the corresponding probability distribution p = [p 1 , ..., p K ] ⊺ .In this study we represent with a different O j operation not only transformations of different type (for example rotations, contrast enhancement, etc.) but also transformations of the same type but different magnitudes (for example rotations of different degree).We do not further optimize the predefined magnitudes of transformations during training.Our method will optimize during training the sampling distribution p of different transformations for data augmentation, so that we learn which transformations are most appropriate for the given dataset and task.
In order to include the distribution into the gradient based optimization through the non-differentiable sampling process, we reparameterize the categorical distribution using the Gumbel-Softmax trick [19].We calculate the probability of assigning sample x i (or xi ) with operation O j as: where g ij is a sample drawn from the Gumbel distribution, i.e., g ij = -log(-log(ε)), in which ε is a random number by drawing ε ∼ Uniform(0, 1).It then holds that K j=1 s ij = 1 and 0 ≤ s ij ≤ 1, ∀j.In this way, the stochasticity involved in the sampling process is removed from the computational graph of network's training and the process of choosing augmentation T i based on probability distribution p now becomes differentiable.Specifically, T i is chosen as O j * where j * = argmax j (s ij ).The sampling probability p, which we would like to optimize, can still not be updated via backpropagation, both due to the non differentiable argmax and because the transformations are non-differentiable in the general case.To work around this, we also calculate a weight w i that corresponds to the sample x i (or xi ) with: which is a function of the sampling probability s ij .We then incorporate the weight into the empirical risk as In this manner, w i and s ij are part of the total loss and hence can be straightforwardly optimized.During the forward propagation, we utilize w i to evaluate the chosen transformation T i without affecting the training procedure, as w i is always equal to 1.During backpropagation, w i that is associated with a relatively effective T i is prone to be increased.As we enforce the computation of the second term in Eq. 2 to never require gradient, we can use w i as a means to optimize s ij and thus p with gradient descend.
For distinction between TRA and TEA, in the following paragraphs we define the probability distribution and transformation of TRA as p T and Ti while denoting the ones of TEA as p V and V i unless otherwise noted.Note that we are considering the problem of image segmentation, therefore the spatial transformations are always applied to y i simultaneously but we omit this for simplicity.

C. Overview of the training process
We aim to reduce the generalization gap explicitly by optimizing a probability distribution over data augmentations p T (or p V ) based on the gradient from the validation data ∇ p T L val (or ∇ p V L val ).Thus, p T (or p V ) is automatically adapted to the underlying task-specific characteristics.
We develop a training framework based on meta-learning via second-order optimization to accomplish this.The optimization process of the proposed method is illustrated in Fig. 2.During the sampling process, we first obtain the transformed training data with 1 ⃝ and optimize the model to f θ * with 4 ⃝ based on a single optimization step; then we pass the transformed validation data with 2 ⃝ through f θ * to compute the second-order gradients 5 ⃝ and backprop to 1 ⃝, to learn TRA that leads to learning θ * that best generalizes on validation data transformed with TEA; meanwhile, we also apply varied transformations of TEA to a single validation sample with 3 ⃝ and we try to learn TEA p V which can transform a single validation sample xi to have the lowest validation error with 6 ⃝.

D. Learning of class-specific training-time data augmentation
1) The design of predefined transformations: Following the design of data augmentation in many medical image segmentation frameworks [18], [20], we design the transformation set with L=15 cascaded operations including rotation, mirroring, gamma correction, histogram transformations, blurring, sharpening, adding noise, and simulating low resolution.The operation magnitudes are decided by uniformly sampling from predefined ranges.We summarize the detailed information about the operations in supplementary material.Specifically, the probability distribution and transformations of TRA is extended as p T = (p 1 , ..., p L ) and Ti = {T 1 i , ..., T L i }.We ensure that our design of TRA is able to accomplish the same functionality with the built-in data augmentation in prevailing frameworks such as DeepMedic [20] and nnU-Net [18], therefore our method can act as a replacement for heuristic TRA.We initialize p T with heuristic policies provided by these frameworks, as shown in Fig. 3. for a number of steps do ▷ Note: One step is sufficient in our experiments.Calculate θ * with an optimization step via Eq. 6.

8:
Optimize pT based on normalized meta-gradients: pt+1 Optimize p V based on normalized gradient via Eq. 13. ▷ Learning of TEA.In practice, we determine the class of a training patch with the central pixel of the patch.Note that in this study we only regard the training samples to come from 2 classes consisting of foreground (tumor, lesion, and organs) and background.
3) Policy optimization with meta-gradients: Similar to previous works on learning TRA [8], [29], we aim to learn TRA based on the performance of validation data and formulate the optimization of TRA as a bi-level optimization problem: We propose to solve this based on gradient descent following [12], [39].We train the model with a training batch containing n samples and a validation batch consisting of m samples.For simplicity, we shorten 1 n n i=1 w i L train (f θ ( Ti (x i ), y i )) as L train (θ, pT ) and 1 m m i=1 L val (f θ * (x i ), ỹi )) as L val (θ * ) in the following paragraphs.Based on the chain rule, the gradient of validation loss w.r.t.pT is derived as: where can be derived based on implicit function theorem [1].However, the calculation would introduce a Hessian which is not practical to calculate with the parameters of deep nerual network as the number of parameters is too large.There are many methods to approximate the gradient without Hessian calculation [12], [34], [39], in this study we choose to approximate θ * by using a single training step [11], [34].Specifically, we approximate the optimal θ * via a standard training step with: Here, α is the step length which we set equal to the learning rate of the task model.Eq. 6 defines the approximated optimal θ * when trained using the training data with sampled data augmentation Ti .In this manner, we can evaluate the effectiveness of the data augmentation policy pT based on the performance of the updated model f θ * on a held-out validation dataset.We differentiate this equation w.r.t.pT from both sides and yield: where . By substituting Eq. 7 into Eq.5, now we can update pT with: which can be interpreted as a gradient of the gradient from the task-driven training.In the above, β is the learning rate for determining the probability distribution.In this way, we can optimize the distribution pT explicitly with the aim to improve generalization of the segmentation model using the validation data.Eq. 8 includes a second-order gradient.As the distribution pT is represented with only a few parameters (K, which is in the order of 10-100), we find the complexity of the gradient computation to be O(|p T ||θ|) which is feasible and can be handled by prevailing toolboxes such as PyTorch and Tensorflow.
After updating pT , we update f θ to f θ * to fit the updated TRA policy for the next iteration.We optimize pT along with the training of the task model, and at the end of the training we may have higher performance than any model learned with a random or manually configured augmentation policy.
4) Gradient normalization: We normalize the gradient from different classes of training samples, as we notice that the contributions of training samples from different classes to the reduction of the validation loss varied a lot.For example, the foreground samples are more effective for reducing the validation loss, resulting in increased probabilities of the policies associated with the foreground samples.Specifically, if we rewrite the gradient in Eq. 5 by the chain rule as: we would find the magnitude of ∇ wi L val (θ * ) is significantly larger for the foreground samples than the background samples.To resolve this, we rewrite Eq. 9 as ) ⊺ with the normalized gradient h i : (10) where y ic is the central pixel label of the segmentation label map y i and 1 yjc=yic ∈ {0, 1} is an indicator function which is equal to 1 if and only if y jc = y ic .Thus, the gradients are normalized for different classes.Another benefit of gradient normalization is that we can guarantee that the probability of transformation which is not sampled in one iteration would remain unchanged.
5) Sampling normalization: We notice that the optimization process could also be biased towards transformations with high probability.Because the more frequently one transformation is sampled, its probability would be increased more as long as it is more effective than the majority of the transformations in the same batch.As a consequence, it would be likely to be trapped in a local minimum and the probability of any preferable transformation which is frequently sampled by chance could be increased a lot.Therefore, we also normalize h i with the sampling frequency and obtain ĥi as: E. Learning of test-time data augmentation 1) The design of predefined transformations: We design the transformation set for TEA with K=84 kinds of deterministic operations including identity along with 41 spatial transformations, 30 intensity transformations, and 12 noise transformations.We summarize the detailed information about those transformations in supplementary material.We initialize p V referring to the heuristic policies used in nnU-Net [18] which comprises mirroring and 180 • rotation in three directions, as shown in Fig. 3.
2) Policy optimization based on reverted predictions: The optimization of TEA is straightforward.We aim to optimize the function with normal gradient descend: where Z is the number of samples TEA transformations in a batch.We update p V by choosing the transformations which have the lowest validation loss with the same validation sample: where γ is the learning rate to update the probability and Lval is the validation loss function for TEA optimization, which can differ from L val .
3) Sampling normalization: Similar to TRA, we notice that the optimization of p V would be biased due to the sampling results.We simplify to Lval (θ), and derive the gradient based on the chain rule: Similarly, we normalize the gradients based on sampling frequency and calculate hk as:

F. Inference of test-time data augmentation
Given unseen test data, we adopt the learned TEA policy to transform the image.Specifically, in order to simplify the inference process, we do TEA at test time with the weighted sum of operations that have the highest z probability, where z is a hyper-parameter indicating the number of operations to be selected for aggregation.We choose z=8 for 3D U-Net while z=4 for DeepMedic.The weight of operation O j is set as the corresponding sampling probability which is calculated as e pj / K v=1 e pv .

G. Joint learning of training-and test-time data augmentation
We propose to jointly optimize TRA and TEA, and specifically we optimize pT with the transformed validation data based on p t V and rewrite Eq. 3 and 4 as: Note that we optimize both pT and p V in one training iteration.Bridging the optimization process of pT and p V has two advantages: First, we can reduce the risk of overfitting on the validation data as it is extended with augmented samples.Second, the model can generalize well to the transformations we adopt at test time.The full procedure is summarized in Algorithm 1.
Additional implementation details are provided in the supplementary material.We find that the policies do not need to be updated in every iteration, making training more efficient.In practice, we observe the training time would only increase by about 20% compared to standard training.Typically, when a model takes 4 days to train with an NVIDIA 1080TI GPU for the segmentation task, our method costs 20 hours to find the optimal data augmentation strategies.This is computational efficient than AutoAugment [8] which could takes thousands of hours.

IV. EXPERIMENTS, RESULTS, AND DISCUSSION
A. Experimental setup 1) Data pre-processing: We normalize all datasets using the pipeline of nnU-Net.Specifically, we adopt case-wise Zscore normalization for magnetic resonance (MR) images, and we normalize computed tomography (CT) images with dataset-wise Z-score normalization based on foreground samples after clipping the Hounsfield units (HU) values from 0.5% to 99.5%.
2) Network configurations: Our experiments are performed with DeepMedic [20] and a well configured 3D U-Net [18].We choose cross-entropy (CE) as L train for DeepMedic and an equal combination of CE and soft Dice similarity coefficient (DSC) for 3D U-Net.We find that the optimal choice of L val varies for different datasets and summarize the information in supplementary material.We adopt soft DSC for Lval for all the experiments.We choose the batch sizes n and m to be 10.We set the primary patch size as 37×37×37 for all the experiments with DeepMedic and a patch size of 64×64×64 for all the applications with 3D U-Net except prostate segmentation.We choose a patch size of 64×64×32 for prostate segmentation with 3D U-Net because images in this dataset have fewer slices.We train the networks for 1,000 epochs except for kidney and kidney tumor segmentation where we train for 2,000 epochs, as we observed that the networks need more iterations to converge on this task.All the reported results are the average of two runs with different random seeds.
3) Brain stroke lesion segmentation: We firstly evaluate the proposed method with binary brain stroke lesion segmentation using the dataset of Anatomical Tracings of Lesions After Stroke (ATLAS) [31].The images have a voxel spacing of 1.0×1.0×1.0 mm.With a total of 220 T1-weighted MR images, we randomly select 73 (50%) or 145 (100%) for training, 31 for validation, and 44 for test.
4) Kidney and kidney tumor segmentation: Secondly, we evaluate the proposed method with kidney and kidney tumor segmentation using the training dataset of Kidney Tumor Segmentation Challenge (KiTS) [17] which contains 210 CT images.We resample all images to voxel spacing of 1.6×1.6×3.2 mm.We randomly select 70 (50%) or 140 (100%) for training, 28 for validation, and 42 for test.We omit the segmentation results of kidney as find most methods perform well (DSC > 95.0) on the task of kidney segmentation.
6) Cross-site prostate segmentation: Additionally, we utilize our method to align training data and validation data of prostate segmentation from different domains [35].Specifically, we utilize 30 T2-weighted MR images from site A [2] which were collected with 1.5T Philips MRI machine with endorectal coil and 19 T2-weighted MR images from site B [26] which were collected with 3T Siemens MRI machines without endorectal coil.We resample all the images to a voxel spacing of 0.8×0.8×1.5 mm.We investigate the scenario where the target domain (site B) has limited labeled data.We select 20 cases from site A for training and 6 cases for test.We select 1 case from site B for validation and use 18 cases for testing.Note that for cross-site prostate segmentation, we report results with models trained with both training data and validation data as this serves as a fairer baseline compared to using training data only.

B. Compared methods 1) Heuristic:
We compare with the heuristic TRA and TEA which are set as the default configurations in DeepMedic [20] and nnU-Net [18].We also report a few results based on models trained using both the training and validation data with heuristic TRA.
2) Learned TRA: We compare with methods that adopt the data augmentation policies based on the validation performance without considering class dependency [8], [29], [32].
3) TRA with different transformation magnitudes: We also compare with RandAugment [9], which only changes the transformation magnitudes based on grid searching.Specifically, we keep the data augmentation probability and replace the operations of the same type with different magnitudes, yielding RandAugment-S, RandAugment-M and RandAugment-L.We summarize the results with RandAugment in supplementary material.
4) Learned TEA: We compare with methods which optimize TEA based on a pretrained segmentation model [22], [40].Specifically, after training the model with proposed TRA, we refine TEA as described in Section III-E.

C. Quantitative results
Taking the manual segmentation as the ground truth, we calculate evaluation metrics including DSC, sensitivity (SEN), precision (PRC), 95% Hausdorff distance (HD) (mm).We calculate the mean DSC results of different models under different settings for different datasets in Table I.We summarize more detailed results in Table II, III, IV and V, separately.In order to assess the overall segmentation performance of different methods, we rank the methods according to different metrics under the same experiment setting and report the average rank (AVG rank) of the four metrics.The learned probability distributions over augmentations for brain lesion segmentation based on different models with 100% ATLAS training data are summarized in Fig. 3.As shown for TRA policies, the darkness of different pie chart segments stands for the magnitudes of the operations.For example, the lightest grey segments refer to the operation without any transformations and the darkest (black) segments represent the operations with large transformations.We summarize all the learned policies under different settings in supplementary material.1) The effectiveness of class-specific TRA: Heuristic TRAs, which were tuned based on varied segmentation tasks [18], [20], significantly help improve the segmentation performance in all cases compared with models trained without TRA.This indicates that TRA is vital for medical image segmentation as limited training data and class imbalance can easily lead to model overfitting [30].
Learned TRA, which is optimized with validation data, can provide application-specific policies and is more effective than heuristic TRA in most cases.We find that the models trained with learned TRA can even outperform the ones trained with heuristic TRA that use both training and validation data, as shown in Table II.This may indicate that it will be more effective to increase the heterogeneity within the training data by adopting application-specific TRA than adding a small amount of training data.We find RandAugment with specific magnitude could be more effective than the learned one in some cases.Specifically, RandAugment-L is better than the learned ones for kidney tumor segmentation under specific setting, as shown in Table III.This might indicate  that learned TRA is prone to overfitting the validation data and the optimized policies are not guaranteed to be optimal for unseen test data, as also found in [9].
In contrast, class-specific TRA can better model the heterogeneity of the real data by taking class imbalance into account, and thus overfit less and perform better on unseen test data than alternative methods.We argue that class-specific TRA is important as it concerns the imbalanced nature of the segmentation datasets and directly regularizes the training data in an implicit way.As shown in Fig. 3, compared with heuristic TRA, the learned policies tend to generate larger transformations for foreground samples while adopting smaller transformations to background samples.In segmentation, foreground classes are typically underrepresented and a learned baseline model would be biased towards the majority class.As a result, the model would map the foreground samples near the decision boundary and cause false negatives, as shown in [30].Class-specific TRA can mitigate the class imbalance problem by inducing larger variance within the foreground samples, making the model learn a better decision boundary, consistently leading to better segmentation results with higher sensitivity.Particularly, we find class-specific TRA would improve the segmentation performance of rare classes more significantly (c.f.Table IV) as it can enhance the rare class representation by increasing the heterogeneity of foreground sample variation.We also find that the probabilities of spatial transformations change more significantly compared to intensity transformations.This might indicate that spatial transformations are more effective in increasing the heterogeneity within training data.We validate our methods for prostate segmentation under domain shifts where the training and test data is collected under different conditions.We find directly fine-tuning the segmentation models with limited target data provides worse results than training with data from both  domains.We report the segmentation results of both site B and site A with cross-site prostate segmentation in Table V.
Although the learned data augmentation is optimized based on the validation data from site B (target domain), the models can still generalize well on site A. In addition, as we show in supplementary material, we find that our method can help the models generalize better on unseen test domains which are different from either site A or site B. This indicates that our method is robust to domain shifts and can be a safe choice to calibrate the segmentation performance of different domains within multi-domain learning.
2) The effectiveness of joint optimization: We find heuristic TEA can help the pretrained models produce better overall segmentation results with higher precision.This is because the ensemble of multiple predictions can reduce false positives as the models are unlikely to produce the same kind of false positives with all the transformed images.However, when TRA is optimized based on validation data without TEA, heuristic TEA might not work well as the model may overfit to the original data distribution and thus fail to generalize to the transformed data.Specifically, we observe that heuristic TEA would decrease the model performance for 3D U-Net trained with 50% ATLAS training data (-0.3 in terms of DSC, c.f. Table II) and Deepmedic trained for prostate segmentation (-0.1 in terms of DSC, c.f.  transformations to fit the pretrained models and improve the results for most cases. However, learned TEA alone does not affect model training and cannot change the results significantly compared to heuristic TEA.In contrast, our method optimizes TRA based on TEA along the training process, jointly aligning the data distributions resulting in larger overlaps.For example, as illustrated in Fig. 3(a), the learned TEA policy would increase the probability of flipping in sagittal planes for DeepMedic trained with 100% ATLAS training data.It might be because the left and right hemispheres of human brains are generally symmetric.Correspondingly, TRA would tune the training data distribution with more samples flipped in the sagittal planes.In this way, the segmentation models not only have lower risks of making the same false positives but also generalize better on varied transformed samples.As a result, we find that the joint optimization further boosts the segmentation performance by achieving higher precision and sensitivity.
We argue that the joint optimization is crucial for data augmentation as it explicitly aligns the training and test-time conditions.Otherwise, the model may get stuck into a local minimum where we cannot find effective test-time transforma-  tions to fit the training data distribution.For example, we find that when compared with segmentation without TEA, learned TEA brings limited improvements for 3D U-Net trained with 50% KiTS training data (0.4 in terms of DSC, c.f. Table III) and DeepMedic trained for prostate segmentation (0.2 for site B in terms of DSC, c.f. Table V).This indicates that the predictions on most chosen transformations cannot contribute much to the results on top of the predictions of the original test images.In contrast, the joint optimization leverages the varied test-time transformations and improve the segmentation (0.9 and 2.8 separately in terms of DSC).
We notice that the learned TEA policies would generally prefer the original images (identity).In addition, the transformations which are not included in heuristic TEA are hardly useful.These findings indicate that we may not need to apply large transformations to the test data to improve generalization.
We visualize some segmentation results in Fig. 4. Similar to the findings in a previous study [30], the model trained with imbalanced dataset would be prone to undersegment the foreground samples as a result of overfitting under class imbalance.Our class-specific TRA model can significantly reduce false negatives and improve the sensitivity of segmentation results.We observe heuristic TEA could cause undersegmentation while the joint optimization can further help the model improve segmentation performance by identifying more foreground samples.We further validate our methods with cardiac segmentation in MR images in supplementary material to prove that our methods can work well with anisotropic images under domain shifts.

D. Limitations
Our data augmentation algorithm aims to optimize the sampling distributions for TRA and TEA, and thus, automatically adapt data augmentation policies to given task.However, it might not be very effective when the predefined policies are already nearly optimal.For example, we observe that our methods do not bring much improvements for kidney tumor segmentation based on 3D U-Net when trained with 100% training data.This is possibly because that the predefined policies were already optimized given it is the winning solution for the challenge.
We notice that class-specific TRA could be less effective with 3D U-Net on prostate segmentation.This may be due to the sampled patches always containing foreground, as the image size of this dataset is relatively small, and the structuresof-interest are relatively large.In this case, the optimization could be misled by the class-specific constraints.In practice, this could be alleviated by adopting a smaller patch size, and some investigations can be found in the supplementary material.Moreover, we could consider to restrict the regions of loss calculation to make our algorithms compatible with similar cases where the patch size is large up to the image size.This would need to be explored in future work.
The joint optimization of TRA and TEA will not be effective when TEA decreases the segmentation performance.For example, we find that the joint optimization cannot bring much improvements for DeepMedic with abdominal organ segmentation where most transformations for TEA do not seem to help much and the augmented validation data would improperly influence the TRA optimization.Therefore, we suggest validating the effectiveness of TEA before adopting the joint optimization.
Although we the proposed method can consistently improve the segmentation performance under varied scenarios, we observe that not all the results show statistical significance when compared to heuristic baselines.This might be due to the small size of the test set.We show that our methods show significant improvements when more test data is available (c.f.Table I).We observe that distance based metrics such as HD is unstable for the evaluation of imbalanced regions-ofinterest (ROIs) because small false positive predictions could largely increase those metrics.After eliminating the false positive predictions with component-based post-processing, our method can always perform better in terms of both DSC and HD, as we demonstrate in supplementary material.
We present and validate our method in the context of medical image segmentation.We think that it has the potential to be extended to long-tailed image classification tasks where different classes have different properties and TEA is also important for better generalization.We show some initial experiments in supplementary material and will leave the indepth investigation for future works.

V. CONCLUSION
We presented a general data augmentation framework for medical image segmentation.Compared with current solutions, our method aims to bridge the gap between training and test data distributions by class-specific TRA and joint optimization of TRA and TEA.We observe promising improvements in various tasks and models, making the proposed framework an attractive alternative to heuristic data augmentation strategies.We believe that the learned policies can provide valuable insights for practitioners to inform dynamic data collection and future designs of image transformations for data augmentation.

A. Learning Scheme
We describe the high-level learning scheme of the proposed method in Fig. 5.During an optimization process (commonly stochastic gradient descent for neural networks), we minimize the L train over all training data and yield the learned parameters θ T * .In practice, the learned θ T * is sub-optimal for validation data because the training data cannot cover all the underlying data properties.Data augmentation is widely utilized to implicitly reduce the generalization gap θ T * → θ V * by using additional artificial training samples T (D T ).However, it is not guaranteed that T (•) is always effective during this process.
We propose to explicitly close the generalization gap by aligning the training and test data distribution.On the one hand, we optimize the model parameter from θ T * to θ T * by learning a TRA model with a meta-learning scheme.On the other hand, we transform the validation data in the way it can be easier to be recognized and change the target optimal parameters of validation from θ V * to θ Ṽ * .By the joint optimization of class-specific TRA T j (•) and TEA V(•), we are able to close the generalization gap from θ T * → θ V * to θ T * → θ Ṽ * .

B. Detailed Optimization Process
The training process of one iteration is illustrated in detail in Fig. 6.

C. Implementation Details 1) Derivatives calculation based on implicit function theorem:
To compute ∂θ * ∂ pT , one can calculate the total derivatives on ∇ θ L train (θ * , pT ) = 0 w.r.t.pT from both sides, assuming that ∇ θ L train (θ * , pT ) is continuously differentiable at 0 [1], [46]: Then, with the assumption that the Hessain ∇ 2 θ L train (θ * , pT ) is invertable, we can yield: The results contain a Hessian ∇ 2 θ,p T L train (θ * , pT ) and not practical to compute.Therefore, we follow the heuristics used in [11] to compute the derivatives.
2) Meta-gradient calculation: The gradient calculation in Eq. 8 can be further simplified based on finite difference approximation following [34].With some small ϵ = 0.01/ ∥∇ θ L val (θ * )∥ 2 , we calculate two new parameters with With this notion, the second-order gradient can be written as: In this way, we reduce the calculation complexity from O(|p T ||θ|) to O(|p T | + |θ|) and can approximate Eq. 8 with two forward processes of f θ * .
3) Efficient data augmentation sampling: Attentive reader may find sampling and applying different transformations to x (or xi ) during each iteration is time-consuming.In practice, we bypass this bottleneck by fetching a number of transformed samples in advance.Then when we input the transformed samples into the model, we sample from the probability again and get the corresponding s i as if it is sampled from the current distribution.In this way, the sampling and data augmentation process can be done in parallel with network training and does not need additional time.
4) Efficient optimization: The proposed method described in Algorithm 1 would triple the training time compared with vanilla training process.However, the training time can be significantly reduced by updating the data augmentation policies (step 8 and 9 in Algorithm 1) once several iterations.We find we can achieve similar results when the policies are updated once 10 iterations, only increasing the training time by 20%.

D. Proof of Concept on CIFAR-10
In order to demonstrate that our method can effectively select useful augmentations, we first show results with a toy example on CIFAR-10 [23].In this experiment, we optimize TRA and TEA separately.We use Wide-ResNet-40-2 [45] as the network backbone and pick 5120 images as the training set while another 5120 images as the validation set.We test on the official split of test set including 10000 images. (

1) Predefined transformations:
We use the same transformation set for both TRA and TEA in this toy example.We initialize the augmentation distribution uniformly with 45 good transformations and 15 bad transformations.We adopt the good transformations from AutoAugment [8] including shearing, translation, rotation, color enhancement, posterization, solarization, contrast changing, sharpening, brightness changing, historgram equalization and inverting.We design the bad transformations as extremely low contrast and large intensity shifts.Given the original image x, the bad transformations would apply T (x) = x 2 , T (x) = x 4 , T (x) = x × 0.01, T (x) = -x × 0.01 and T (x) = x + 300.We show an example of totally 60 transformed images in Fig. 7.
2) Learning TRA: The resulting probability distribution after training with the proposed scheme is visualized in Fig. 7, the probabilities of all the bad transformations are low such that these are suppressed during training.The learned augmentations can improve the final accuracy from 83.8% to 85.1% on the validation set, compared to a policy with uniform probabilities of all 60 transformations.A model trained without any augmentation achieves 81.8%.
We apply the learned policy and train the same model with the total 10240 images from scratch, we find we can improve the performance from 88.2% to 89.0% on the test set, compared to a policy with uniform probabilities of all 60 transformations.A model trained without any augmentation achieves 87.1%.
3) Learning TEA: The resulting TEA probability during the training process is demonstrated in Fig. 9. Similarly, the probabilities of all the bad transformations are decreased during training.

E. List of Operations for Training-Time Data Augmentation
We list all the operations O we use for TRA in Table VI.Note that we design most operations with stochastic magnitudes in a symmetric way.That is to say, the transformed image could be transformed back to the original image using the same operation, with the probability of 50%.In this way, we can increase the variance of training dataset with more realistic and potentially useful samples.Different from the operation set in AutoAugment which includes operations with deterministic magnitudes [8], we design each operation with stochastic magnitudes which are sampled from a uniform sampling probability.In this way, we can cover transformations with larger variance and realize similar functionalities with TRA used in prevailing segmentation methods.
A straightforward question is whether we can further improve TRA by 1) adding the number of predefined transformations, 2) optimizing the magnitude of transformations as well or 3) including more complicated transformations such as generative models (i.e.utilizing a network to generate the augmented samples).In fact, we find that if we extend the transformation set with more choices (i.e.larger K), the performance would not be further improved.It is because that enlarging K would make transformations being sampled less and make TRA harder to optimize.We find 2-10 is a reasonable range for K.It would be feasible to additionally optimize the magnitude of the operations.However, we find that it would make the optimization process unstable when we also optimize the operation magnitudes.This is might because controlling the operation magnitude is hard and the transformed images could become unrealistic.In addition, we are not sure if it is appropriate to optimize operation magnitude for the reduction of validation loss, as this might converge to local minimum.We also try to utilize a neural network to transform the training samples.Nevertheless, our initial experimental results show that the model would easily overfit the validation data but cannot perform well with unseen test data.The transformation model can easily align the training and validation data very well as the transformation model has a large mount of parameters.However, the relationship is too specific and does not generalize well on underlying test data.We observe that reducing model parameters of the transformation models or adding model regularization can alleviate the overfitting issue but it is still hard to achieve similar results with heuristic TRA strategies.

F. List of Transformations for Test-Time Data Augmentation
We list all the operations O we use for TEA in Table VII.Note that the operations always have deterministic magnitudes.In this way, we can make sure we would apply the same set of transformation to a test sample.

G. Choices of Validation Loss for Different Settings
We find L val should be chosen differently for different settings to achieve optimal results.L val can be cross entropy (CE), soft Dice similarity coefficient (DSC) or a combination of the two loss functions.We summarize the optimal choices for different settings in Table VIII.

H. The Policies and Results of RandAugment
We implement RandAugment [9] based on the heuristic TRA policies of nnU-Net [18].Specifically, we keep the probability of adopting operations and substitute all the heuristic operations within the same type with the ones having certain magnitude.We create three RandAugment policies based on varied magnitudes, denoted as RandAugment-S,  RandAugment-M and RandAugment-L.The sampling distribution of these TRA policies are shown in Fig. 11.We summarize the segmentation results when trained with RandAugment for kidney and kidney tumor segmentation based on 3D U-Net with 50% training data in Table IX.We find that RandAugment with large transformation magnitudes performs better on kidney tumor segmentation.The proposed class-specific TRA can perform better than all the RandAugment variants.

I. Learned Policies with Different Settings
We summarize all the learned policies for different tasks based on different models in this section.We expect it to be taken as a reference for participants to collect datasets and design transformations for data augmentation.The data augmentation policies for different network architectures are initialized with the heuristic policies provided by these frameworks.These default policies are designed for general purpose and specifically to fit the properties of different segmentation models.Therefore, the learned policies also differ considerably between models.
1) ATLAS: We summarize the learned data augmentation policies for brain stroke lesion segmentation with 50% ATLAS training data in Fig. 12 and Fig. 13.As brain stroke lesion is relatively small and often under-represented, we find the learned TRA policies tend to apply transformations with larger magnitude to foreground samples and transformations with smaller magnitude to background samples.Specifically, when compared with heuristic policies, the learned policies would increase the probabilities of adopting transformations for foreground samples while decrease these probabilities for background samples.We observe consistent changes in all kinds of transformations.This indicates that the segmentation models would benefit from foreground samples with more variances and background samples with limited transformations.
We find the probabilities of spatial transformations such as scaling and rotation would be largely increased.This indicates that spatial transformations make more differences to the training data distributions.We also find that the policies learned with 100% training data would often adopt larger transformations when compared with the ones learned with 50% training data.This indicates that the optimal TRA policies vary for training datasets with different amounts of samples.Specifically, we should utilize larger transformations to effectively extend the data distributions with sufficient training data.It may be because small transformations are hard to add more information to the training datasets on the three top of sufficient training data.
We find the learned TEA policies for DeepMedic increase the probabilities of flipping in sagittal planes.It might be because the initialized TRA policy for DeepMedic has large probabilities of flipping in the sagittal planes (which is designed by taking into account the symmetrical brain structure).We also notice the learned TEA policies would always largely increase the probabilities of identity for both DeepMedic and 3D U-Net.This indicates that the predictions of images in the original data distribution are fairly accurate.
2) KiTS: We summarize the learned data augmentation policies for kidney and kidney tumor segmentation with 50% KiTS training data in Fig. 14 and Fig. 13.We also summarize the policies learned with 100% KiTS training data in Fig. 15.We find when the segmentation models are trained with 50% training data, the learned TRA policies tend to generate larger transformations for foreground samples, similar to the case of brain stoke lesion segmentation.This might be because the kidney and kidney tumor are underrepresented with less training data.In contrast, the policies learned with 100% training data do not have consistent bias towards the foreground samples.This may be because the class imbalance problem probably would not affect the learning process too much as the training datasets contain a sufficient amount of foreground samples.Under such condition, the policies are learned to generate class-specific transformations.Specifically, the learned policies increase the probabilities of scaling for foreground samples while increase the probabilities of noise transformations such as sharpening and simulating low resolution for background samples.
When compared with brain stroke lesion segmentation in T1-weighted MR images, we find the learned policies is prone to adopt intensity transformations such as gamma correction and intensity shifting and noise transformations such as adding Gaussian noise and simulating low resolution for TRA for kidney and kidney tumor segmentation in CT images.This is may because the objects in CT images have low contrast and blurred boundaries.In this case, the simulated images with varied imaging quality can help the segmentation model generalize better with unseen image conditions.
3) Abdominal organ: We summarize the learned data augmentation policies for abdominal organ segmentation in Fig. 16.The learned policies for abdominal organ segmentation are quite different from the cases of brain stroke lesion and kidney tumor segmentation.This is may due to the complexity of foreground class which contains many different classes of abdominal organs.We find the learned TRA policies are prone to adopt large scaling transformations to the background samples.This maybe because some background objects which are similar to the foreground objects vary in size.The segmentation models can perform better when learned simulated background objects with varied scales.Similar to the case of kidney tumor segmentation in CT images, the learned TRA policies tend to adopt many intensity and noise transformations.We think this is also related to the low imaging quality of CT.
We find the default TEA policy would decrease the performance of the segmentation model, therefore we utilize and initialize with a TEA policy with increased probability of identity, as shown in the upper part of Fig. 16.The learned TEA policy for DeepMedic further increases the probability of identity.This indicates that DeepMedic cannot perform well with transformed images.This is because the initialized TRA policies for DeepMedic do not contain large transformations.This could be also related to the network architecture of DeepMedic which contains more convolutional layers in the original size.The network design drives DeepMedic to make predictions relying more on the local features, potentially being more sensitive to noise.
4) Cross-site prostate: We summarize the learned data augmentation policies for cross-site prostate segmentation in Fig. 17.We find the learned TRA policies are generally very close to the initialized ones.This indicates that the default TRA policies fit this task well.We notice the learned TRA policies for background adopt more intensity transformations such as intensity shifting and noise transformations such as simulating low resolution.Those transformations might help align the MRI datasets acquired with different settings.5) Cross-sequence and cross-site cardiac: We summarize the learned data augmentation policies for cross-sequence or cross-site cardiac segmentation in Fig. 18.Experimental details can be found in Section -M and Section -N.We find in both cases the learned TRA polices would select large intensity transformation for FG samples while choose large spatial transformations for BG samples.This might indicate that style transformation of FG samples could help the model generalize better across domains for cardiac MR images.This finding is consistent with previous studies which are based on 2D networks [37].

J. Logit Map Distributions
In order to illustrate the effectiveness of data augmentation methods, we visualize the activations of classification layer when models are trained and deployed under different conditions.We summarize the histograms of logit distributions when processing training and test samples of ATLAS with DeepMedic trained with 100% training data in Fig. 19.Specifically, we calculate the distance of logits to the decision boundary with (z 1 − z 2 )/ √ 2, where z 1 is the logit for background and z 2 is the logit for lesion.To simplify the observations, we only monitor the logit distributions of lesion samples.
We calculate the intersection regions of training and test distributions and summarize them on the top of each figure .TRA and TEA both improve the model performance by aligning the training and test data distributions, when compared with model trained without data augmentation (c.f.Fig. 19(a)).However, they are not always optimal for the given datasets, as shown in Fig. 19(b, c).Our proposed methods learned application-specific and class-specific TRA, therefore fit the given tasks better when compared with heuristic TRA, as shown in Fig. 19(d).The joint optimization of TRA and TEA (c.f.Fig. 19(e)) drive the data distributions to overlap more, thus generalize better.
To obtain a better understanding of network behaviour, we also investigate the network activations when processing test samples of different classes.The logit distributions of DeepMedic trained with 100% ATLAS training data and 50% KiTS training data are both summarized in Fig. 20.We find that data augmentation can always help the model build better representation, especially for the minority classes such as lesion and kidney tumor.Without data augmentation, the model would map the samples from the minority class across the decision boundary, causing false negatives.We observe that heuristic data augmentation can help the model map the logits of the samples from the minority classes away from the decision boundary.The proposed data augmentation strategy can help the model further reduce logit shifts of the minority classes and build a better decision boundary.Moreover, it can make the model map the logits of samples from the same class to a more compact cluster.This indicates that the model trained with our method can implicitly encourage the feature to have better inter-class separability and intra-class compactness, thus generalize well.

K. The Decrease of Validation Loss During the Training Process
We visualize validation loss curves of four settings in Fig. 21 and Fig. 22.We find when the model is trained without TRA, it would be very likely to overfit the training dataset as we observe that the calculated CE of validation data would even increase during the training process, but not for DSC.This might indicate that the model becomes overconfident with the prediction of the hard cases.
Heuristic and learned TRA can help the model decrease the validation loss, while the proposed learned class-specific TRA is the most effective to decrease the validation loss.This indicates that our class-specific transformation model is more capable of mimicking the underlying data distribution and thus help the segmentation model generalize better.The validation loss curves can be utilized as a good indicator to assess the test performance of the trained models.In practice, we suggest the practitioners utilize the validation loss curves to choose the best TRA hyper-parameters for their settings.

L. Segmentation based on a Vision Transformer
In this study, we evaluate our data augmentation algorithms with convolutional neural network (CNN) based segmentation models including DeepMedic and 3D U-Net.Here we extend our experiments with a transformer based segmentation model, nnFormer [27], [50].Similar to our previous experiments, we train nnFormer using 50% training data from ATLAS and 50% training data from KiTS.We choose a patch size of 64×64×64 for both applications.We initialize TRA for nnFormer using transformations with small magnitudes as we find large transformations could easily decrease the segmentation performance.We choose the same initialized TEA for nnFormer as 3D U-Net.
We summarize the quantitative results in Table X and Table XI, separately.We find nnFormer perform worse than DeepMedic and 3D U-Net in both tasks.Similarly, our proposed data augmentation methods consistently improve the segmentation results with higher DSC.

M. Cross-Sequence Cardiac Segmentation
Here, we further evaluate the proposed algorithm for crosssequence cardiac segmentation in MR images.In this experiment, we train segmentation models with short-axis cardiac MR images which are collected with different MRI sequences.We utilize 45 balanced steady-state free precession (bSSFP) MR images and 45 late gadolinium enhanced (LGE) MR images from [51].We resample all the MR images to an inplane spacing of 1.25×1.25 mm following [37].We report the segmentation performance of three cardiac structures including left ventricle (LV), myocardium (MYO) and right ventricle (RV).Similar to the setting of cross-site prostate segmentation, we investigate the application scenario where only a small portion of labelled data is available for the target domain (LGE MRI).We select 30 cases from bSSFP for training and 10 for testing.We select randomly select 1 case from LGE for validation and utilize the rest 44 for testing.As the cardiac MR images are highly anisotropic, we train a segmentation models based on 3D U-Net using a patch size of 128×128×8.
We summarize the segmentation results in Table XII.The proposed data augmentation methods can improve the segmentation results when compared with heuristic policies for the target domain (LGE).As the training patches always contain foreground samples, the additional advantage of class-specific TRA is not significant.

N. Cross-Site Cardiac Segmentation
Here, we further validate our method with cross-site cardiac segmentation where cardiac MR images are collected with 5 different sites using different scanners [3].This dataset contains totally available 345 cardiac short axis MR images.Each images are annotations at the end-diastolic (ED) and endsystolic (ES) phases including LV, MYO and RV.We resample all the images to 1.25×1.25×10mm.Following the setting of the challenge [3], we utilize 175 cases from for training, 34 cases for validation and 136 for testing.In order to evaluate the generalization ability of the segmentation model when deployed on data with domain shifts, we include data which is collected with different sites from the training data in the validation and test datasets.Specifically, there are 10 cases and 40 cases collected from unseen test set in the validation and test dataset, separately.We encourage the readers to refer to the challenge paper for detailed experimental settings [3].Similar to the network settings in Section -M, we train a segmentation model based on 3D U-Net using a patch size of 128×128×8.We summarize the results in Table XIII.We also compare our methods with top ranking methods in this challenge [13], [36], [48].We take their results directly from the challenge report (results of vendor D in [3]).We observe that the proposed data augmentation strategies can improve the segmentation performance in different settings, outperforming other competitive solutions in the challenge on unseen site.We should note that all the top ranking methods are based on the same network architectures with us (nnU-Net) but utilize different hand-engineered TRA policies or normalization techniques.Therefore, the results further demonstrate that our method is superior than current heuristic data augmentation strategies.

O. Sensitivity Analysis of the Size of Validation Dataset
We optimize the data augmentation policies based on the model performance on a set of held-out dataset.In other words, we choose the TRA and TEA policies which can help the model perform well on this validation dataset.Here, we investigate the effects of validation data size and optimize the policies with varied amounts of validation data.Specifically, we optimize the joint learning of class-specific TRA and TEA based on DeepMedic with 50% training data using different amounts the validation samples.
We summarize the quantitative results in Table .XIV.The results show that the proposed data augmentation framework is capable of improving the segmentation accuracy with different amounts of validation samples.
Initially, the probability of specific transformations for TRA would be largely increased when we only reduce the size the validation data.As a result, the performance of the segmentation model is unstable.Specifically, with less validation samples, the segmentation model would achieve higher sensitivity and is prone to over-segmentation when trained using training data with large variance.This is probably because the learned policies would bias towards specific kinds of transformations which can benefit the segmentation of the small portion of validation data.Therefore, here we choose smaller learning rate for optimizing TRA β when less validation data is available, in order to reduce risks of overfitting to specific transformations.In this way, we observe that the segmentation model can bring stable improvements.We suggest the practitioners also reduce β with small validation data.

P. Cross-Site Prostate Segmentation When Trained with Small Patches
We found when the patches always contain foreground samples, the class-specific constraints could decrease the segmentation performance.This is what we observe in the experiments of cross-site prostate segmentation when training with patch size of 64×64×32.Here, we conduct experiments for cross-site prostate segmentation with smaller patches.Specifically, we train the segmentation models with patch sizes of 48×48×24 and keep the rest settings the same.
We summarize the segmentation results in Table XV.The results show that the proposed class-specific TRA shows better results than heuristic policies and class-agnostic TRA.This demonstrates that class-specific TRA can work well with small patches when the regions of interest (ROIs) are relatively large.
However, we notice that the segmentation models trained with smaller patches would generally perform worse when compared with models trained with large patches.This is because the models trained with smaller patches have less context information but the local features cannot generalize well.Therefore, we remind the readers to strike a balance be-tween large context information and class-specific constraints to achieve better performance when dealing with similar cases.

Q. Cross-Validated Segmentation Results
Most of our experiments choose fixed data split and report the model performance on a separate test set.Here, we extend the experiments with three-fold cross-validation and report the model performance on the whole dataset, to further validate our algorithms.Specifically, we train DeepMedic for kidney and kidney tumor segmentation with three different 70 cases and report the segmentation results on the rest data.We summarize the results in Table XVI, which are consistent with our previous experiments.Our proposed methods bring significant improvements to the segmentation of kidney tumor in terms of DSC.

R. Segmentation Results with Post-Processing
In this study, we investigate the problem of class imbalance in medical image segmentation and conduct experiments with datasets containing small objects.Specifically, the tasks of brain lesion and kidney tumor segmentation are challenging because the positions of these objects are quite random.The segmentation model could make false positive predictions which are far from the ground truth locations.As a result, the distance based evaluation metric, such as HD, is unstable to represent the quality of segmentation results.For example, HD would be large due to small positive predictions which are distant from the ROIs.In addition, the HD penalty of failing to make any predictions in a volume is large.This is the reason why our proposed methods do not always lead to the best HD in the experiments.
In practice, the false positive predictions could be easily eliminated with some simple post-processing techniques.We adopt a component-based post-processing approach where we only keep the largest component within the segmentation results but suppress the other predictions.We apply the postprocessing approach to the segmentation results of ATLAS and KiTS based on 3D U-Net with 50% training data and summarize the quantitative results in Table XVII and Table XVIII, separately.The results show that our methods always achieve both the best DSC and HD in all settings.
We notice that this component-based post-processing approach would decrease the segmentation performance of brain lesion segmentation in terms of DSC, when compared to results in Table II.This is because the post-processing would introduce false negatives predictions when multiple brain lesions exist in an image.This is the reason why we do not utilize post-processing for all the experiments and is not the focus of this study.We think more advanced post-processing could be effective to improve segmentation results in that case.

Fig. 1 .
Fig. 1.Data augmentation improves segmentation model performance by aligning the training and validation/test data distribution.As illustrated in (b)[9], (c)[8],[18],[20],[29],[32] and (d)[22],[40], current methods optimize the data distributions by using a validation set as a proxy for unseen test data.Our framework brings improvements by integrating two conceptually simple and intuitive ideas: (e) We adopt different kinds of training-time data augmentation (TRA) for training samples from different classes, effectively extending the training data distribution and alleviating the class imbalance issue.(f) We jointly optimize TRA and test-time data augmentation (TEA) during every training iteration, making the data distributions overlaps more.
Fig. 1.Data augmentation improves segmentation model performance by aligning the training and validation/test data distribution.As illustrated in (b)[9], (c)[8],[18],[20],[29],[32] and (d)[22],[40], current methods optimize the data distributions by using a validation set as a proxy for unseen test data.Our framework brings improvements by integrating two conceptually simple and intuitive ideas: (e) We adopt different kinds of training-time data augmentation (TRA) for training samples from different classes, effectively extending the training data distribution and alleviating the class imbalance issue.(f) We jointly optimize TRA and test-time data augmentation (TEA) during every training iteration, making the data distributions overlaps more.

Fig. 2 .
Fig. 2. The optimization process of the proposed method.In this study, data augmentation is formulated as the probability distribution of multiple predefined transformations, as demonstrated in 1 ⃝, 2 ⃝ and 3 ⃝.During the same iteration, class-specific TRA is optimized based on meta-gradients with 5 ⃝ while TEA is optimized based on the validation losses of Z transformed samples with 6⃝.

Algorithm 1 3 :
Joint Optimization of Class-Specific Trainingand Test-Time Data Augmentation in Segmentation Require: DT = {(xi, y i )} N i=1 : training data, DV = {(xi, ỹi )} M i=1 : validation data; f θ (•): the segmentation model, Ti(•): TRA which is determined by drawing from class-specific probability pT , Vi(•): TEA which is determined by drawing from probability p V .α, β, γ: learning rate to update θ, pT and p V .1: Initialize pT , p V with heuristic policies referring to the ones in DeepMedic [20] or nnU-Net [18].2: for each iteration do Sample a batch of training data BT = {(xi, y i )} n i=1 from DT and a batch of validation data BV = {(xi, ỹi )} m i=1 from DV .4:

5 : 6 :
Sample a set of { Ti(•)} n i=1 with Gumbel-Softmax distribution parameterized by p T j based on sample class.Sample {Vi(•)} m i=1 and {V k (•)} Z k=1 with Gumbel-Softmax distribution parameterized by p V .7: Update θ to θ * .▷Training the segmentation model.12: end for 2) Class-specific data augmentation: We adopt different TRAs for training samples from different classes.Specifically, we extend the probability distribution to pT = (p T 1 , ..., p T c ) which contains different probability distributions for c classes.In this way, TRA becomes more flexible and powerful as it gains the ability to draw Ti from different distributions for different classes.

Fig. 3 .
Fig.3.The heuristic data augmentation policy and the learned probability distributions over augmentations based on different segmentation models for brain stroke lesion segmentation with 100% ATLAS training data.We also visualize an example of the transformed foreground (FG) training sample with different sampling distributions for TRA.Our framework provides application-specific and class-specific data augmentation policies.We find the learned policies would adopt larger transformations to the FG than the background (BG) samples, implicitly alleviating the class imbalance issue.

Fig. 4 .
Fig. 4. Visualization of different datasets and segmentation results with different data augmentation methods.The proposed data augmentation framework can help the model produce overall better segmentation results with higher sensitivity.Best viewed in color.

Fig. 5 .
Fig. 5.The learning scheme of the proposed method.When we train the model with the training data D T and the training criterion L train , the model does not always generalize well on the validation data D V .We aim to close this generalization gap explicitly by the joint optimization of class-specific TRA T j (•) and TEA V(•).

Fig. 6 .
Fig. 6.Illustration of one iteration during the training process.It is consistent with Algorithm 1.

Fig. 9 .
Fig. 9. Proof of concept on TEA with CIFAR-10.We manually add the same 15 bad transformations to the transformation set and our method learns to decrease their probabilities during training.

Fig. 10 .
Fig. 10.Visualization of K=84 choices of transformations for TEA we use in this study.We apply different transformations to a test sample of KiTS.

Fig. 11 .
Fig.11.The sampling distribution for three heuristic TRA policies which are created by only changing the transformation magnitudes, referring to[9].

Fig. 12 .
Fig. 12.The learned probability distributions over augmentations based on different segmentation models for brain stroke lesion segmentation with 50% ATLAS training data.We also visualize an example of the transformed foreground training sample with different sampling distributions for TRA.

Fig. 19 .Fig. 20 .
Fig. 19.The histograms of activations of the classification layer when processing training (blue) and test (orange) data of ATLAS with DeepMedic using different data augmentation methods.We visualize the data distributions as the distance of lesion logits to the decision boundary, which is calculated as (z 1 −z 2 )/ √ 2 (logit z 1 for background and logit z 2 for lesion).TRA and TEA increase the intersection regions of the training and test data distributions.Our proposed methods can make the distribution overlap more.

Fig. 21 .
Fig. 21.Validation loss curves of different models trained for kidney and kidney tumor segmentation with KiTS.Compared with heuristic and learned class-agnostic data augmentation, the learned class-specific data augmentation can decrease the loss more.This may indicate that class-specific data augmentation is more effective in aligning the training and validation data distribution.

TABLE I AVERAGE
DSC RESULTS OF BOTH DEEPMEDIC AND 3D U-NET FOR DIFFERENT SEGMENTATION TASKS UNDER VARIED SETTINGS USING DIFFERENT DATA AUGMENTATION METHODS.ORGANr IS THE AVERAGE PERFORMANCE OF ALL RARE ORGAN CLASSES.

TABLE II EVALUATION
OF BRAIN STROKE LESION SEGMENTATION ON ATLAS BASED ON DIFFERENT NETWORK ARCHITECTURES WITH DIFFERENT AMOUNTS OF TRAINING DATA USING DIFFERENT DATA AUGMENTATION METHODS.BEST AND SECOND BEST RESULTS ARE IN BOLD, WITH BEST ALSO UNDERLINED.

TABLE III EVALUATION
OF KIDNEY TUMOR SEGMENTATION BASED ON DIFFERENT NETWORK ARCHITECTURES WITH DIFFERENT AMOUNTS OF TRAINING DATA USING DIFFERENT DATA AUGMENTATION METHODS.BEST AND SECOND BEST RESULTS ARE IN BOLD, WITH BEST ALSO UNDERLINED.

TABLE IV EVALUATION
OF ABDOMINAL ORGAN SEGMENTATION BASED ON DIFFERENT NETWORK ARCHITECTURES USING RANDOM DATA AUGMENTATION METHODS.BEST AND SECOND BEST RESULTS ARE IN BOLD, WITH THE BEST ALSO UNDERLINED.AVGr IS THE AVERAGE PERFORMANCE OF ALL RARE CLASSES INCLUDING GB, E, AO, IVC, V, PA, RA AND LA. ∼ * p-value < 0.05; * * p-value < 0.01; ∼ p-value ≥ 0.05 (compared to Heuristic TRA w/o TEA or Heuristic TRA w/ Heuristic TEA) † We adopt a heuristic TEA policy with larger probability of identity transformation here because typical ones would decrease the performance.

TABLE V EVALUATION
OF CROSS-SITE PROSTATE SEGMENTATION BASED ON DIFFERENT NETWORK ARCHITECTURES USING DIFFERENT DATA AUGMENTATION METHODS.BEST AND SECOND BEST RESULTS ARE IN BOLD, WITH THE BEST ALSO UNDERLINED.
Table V) using learned classspecific TRA.In comparison, learned TEA can refine the We train these models with both training and validation data.

TABLE VI LIST
OF ALL OPERATIONS THAT OUR METHOD CAN CHOOSE FOR TRA.WE FORMULATE TRA AS A COMPOSITION OF L=15 TYPES OF OPERATIONS, WITH EACH OPERATION HAS K CHOICES OF MAGNITUDES.K VARIES FROM 2 TO 7.

TABLE VIII CHOICES
OF THE VALIDATION LOSS FUNCTION AND INITIAL LEARNING RATE FOR OPTIMIZING TRA WITH DIFFERENT SETTINGS.

TABLE IX EVALUATION
OF KIDNEY AND KIDNEY TUMOR SEGMENTATION BASED ON 3D U-NET WITH 50% TRAINING DATA USING DIFFERENT DATA AUGMENTATION METHODS.BEST AND SECOND BEST RESULTS ARE IN BOLD, WITH BEST ALSO UNDERLINED.p-value < 0.05; * * p-value < 0.01; ∼ p-value ≥ 0.05 (compared to Heuristic ‡ TRA w/o TEA) *

TABLE X EVALUATION
OF BRAIN STROKE LESION SEGMENTATION ON ATLAS BASED ON NNFORMER WITH 50% TRAINING DATA USING DIFFERENT DATA AUGMENTATION METHODS.BEST AND SECOND BEST RESULTS ARE IN BOLD, WITH BEST ALSO UNDERLINED.

TABLE XII EVALUATION
OF CROSS-SEQUENCE CARDIAC SEGMENTATION BASED ON 3D U-NET USING DIFFERENT DATA AUGMENTATION METHODS.BEST AND SECOND BEST RESULTS ARE IN BOLD, WITH BEST ALSO UNDERLINED.p-value < 0.05; * * p-value < 0.01; ∼ p-value ≥ 0.05 (compared to Heuristic ‡ TRA w/o TEA or Heuristic ‡ TRA w/ Heuristic TEA) ‡ We train these models with both training data collected with bSSFP and validation data collected with LGE. *

TABLE XIII EVALUATION
OF CROSS-SITE CARDIAC SEGMENTATION BASED ON 3D U-NET USING DIFFERENT DATA AUGMENTATION METHODS.BEST AND SECOND BEST RESULTS ARE IN BOLD, WITH BEST ALSO UNDERLINED.p-value < 0.05; * * p-value < 0.01; ∼ p-value ≥ 0.05 (compared to Heuristic TRA w/o TEA or Heuristic TRA w/ Heuristic TEA)TABLE XIV EVALUATION OF KIDNEY AND KIDNEY TUMOR SEGMENTATION BASED ON DEEPMEDIC WITH 50% TRAINING DATA OPTIMIZED USING DIFFERENT AMOUNTS OF VALIDATION SAMPLES.BEST AND SECOND BEST RESULTS ARE IN BOLD, WITH BEST ALSO UNDERLINED.p-value < 0.05; * * p-value < 0.01; ∼ p-value ≥ 0.05 (compared with Heuristic TRA w/ Heuristic TEA) TABLE XV EVALUATION OF CROSS-SITE PROSTATE SEGMENTATION BASED ON 3D U-NET SIZE USING DIFFERENT DATA AUGMENTATION METHODS.THE MODELS ARE TRAINED AND DEPLOYED USING SAMPLES WHICH ARE CROPPED WITH SMALL PATCH SIZE (48×48×24).BEST AND SECOND BEST RESULTS ARE IN BOLD, WITH BEST ALSO UNDERLINED.

TABLE XVI THREE
-FOLD EVALUATION OF KIDNEY AND KIDNEY TUMOR SEGMENTATION BASED ON DEEPMEDIC WITH 50% TRAINING DATA USING DIFFERENT DATA AUGMENTATION METHODS.BEST AND SECOND BEST RESULTS ARE IN BOLD, WITH BEST ALSO UNDERLINED.p-value < 0.05; * * p-value < 0.01; ∼ p-value ≥ 0.05 (compared to Heuristic TRA w/o TEA or Heuristic TRA w/ Heuristic TEA)TABLE XVII EVALUATION OF BRAIN STROKE LESION SEGMENTATION ON ATLAS BASED ON 3D U-NET WITH 50% TRAINING DATA USING DIFFERENT DATA AUGMENTATION METHODS.THE RESULTS ARE CALCULATED WITH POST-PROCESSING.BEST AND SECOND BEST RESULTS ARE IN BOLD, WITH BEST ALSO UNDERLINED.p-value < 0.05; * * p-value < 0.01; ∼ p-value ≥ 0.05 (compared to Heuristic TRA w/o TEA or Heuristic TRA w/ Heuristic TEA) TABLE XVIII EVALUATION OF KIDNEY AND KIDNEY TUMOR SEGMENTATION BASED ON 3D U-NET WITH 50% TRAINING DATA USING DIFFERENT DATA AUGMENTATION METHODS.THE RESULTS ARE CALCULATED WITH POST-PROCESSING.BEST AND SECOND BEST RESULTS ARE IN BOLD, WITH BEST ALSO UNDERLINED.

TABLE XIX EVALUATION
OF PROSTATE SEGMENTATION WITH UNSEEN DATA BASED ON 3D U-NET USING DIFFERENT DATA AUGMENTATION METHODS.BEST AND SECOND BEST RESULTS ARE IN BOLD, WITH THE BEST ALSO UNDERLINED.