Evaluating the Predictive Value of Glioma Growth Models for Low-Grade Glioma After Tumor Resection

Tumor growth models have the potential to model and predict the spatiotemporal evolution of glioma in individual patients. Infiltration of glioma cells is known to be faster along the white matter tracts, and therefore structural magnetic resonance imaging (MRI) and diffusion tensor imaging (DTI) can be used to inform the model. However, applying and evaluating growth models in real patient data is challenging. In this work, we propose to formulate the problem of tumor growth as a ranking problem, as opposed to a segmentation problem, and use the average precision (AP) as a performance metric. This enables an evaluation of the spatial pattern that does not require a volume cut-off value. Using the AP metric, we evaluate diffusion-proliferation models informed by structural MRI and DTI, after tumor resection. We applied the models to a unique longitudinal dataset of 14 patients with low-grade glioma (LGG), who received no treatment after surgical resection, to predict the recurrent tumor shape after tumor resection. The diffusion models informed by structural MRI and DTI showed a small but significant increase in predictive performance with respect to homogeneous isotropic diffusion, and the DTI-informed model reached the best predictive performance. We conclude there is a significant improvement in the prediction of the recurrent tumor shape when using a DTI-informed anisotropic diffusion model with respect to istropic diffusion, and that the AP is a suitable metric to evaluate these models. All code and data used in this publication are made publicly available.


I. INTRODUCTION
A S THE automated image-based diagnosis and delineation of glioma has improved, partially due to the emergence of machine learning techniques and the availability of large public datasets [1], the next major step is that of predicting the disease trajectory.A long history of modelling tumor growth as a biophysical process of diffusion and proliferation has shown promise in predicting the development of glioma in real patient data [2].However, there is currently no consensus in how to precisely formulate and evaluate the tumor growth problem.The large variety in approaches, from image processing to model fitting, and the lack of public data make it difficult to compare models and estimate their predictive value.Furthermore, the predictive value of tumor growth models in treated low-grade glioma (LGG) is not well studied.
Predicting the spatial patterns of tumor growth can aid diagnosis and treatment in several ways [2].Local treatment such as radiotherapy can be informed by the most likely pattern of recurrence and the location and extent of tumor infiltration beyond the visible boundaries [3], [4], [5].Additionally, a growth model can be used to aid automated image analysis methods such as tumor segmentation and image registration [6], [7].Furthermore, especially in the case of models that are rooted in the biophysical understanding of tumor growth, a better prediction indicates a better understanding of the disease or a better fit to specific patient cases.Knowing the infiltrative behavior of glioma cells, and especially being able to differentiate that behavior between patients, could have prognostic value [8], [9].Furthermore, an accurate model of tumor growth could aid in the differentiation between true progression and pseudoprogression by identifying changes that do not adhere to the expected growth pattern [10].
The modelling of LGG especially presents a large potential for clinical application.Patients presenting with LGG have a better prognosis compared to high-grade glioma (HGG), but there is no consensus on the optimal treatment [11].A pro-active treatment regimen might be effective at increasing overall survival, but there is also a risk of unnecessary treatment burden.With their potential to predict and increase the efficacy of treatment, predictive modelling may be of value in the application to LGG [9].
The problem of tumor growth is challenging in many ways, not in the least due to limited observations used to calibrate and evaluate tumor growth models.Magnetic resonance (MR) imaging is commonly used as both the input for calibration of model parameters and evaluation of results, as it provides a non-invasive visualization of both the tumor and surrounding tissue.However, the relation between imaging characteristics and actual tumor cell density is difficult to characterize exactly [12].While more precise observations exist, such as tissue samples [4] or PET imaging [3], for most clinical cases the best available approximation of glioma infiltration is the delineation of abnormalities on MR imaging.It is well-recognized that LGG show differences in growth behavior from HGG.In LGG, often the entire lesion is non-enhancing, but tumor cells are known to infiltrate even beyond the visible boundaries on MR imaging [13].A number of studies have evaluated growth models on patients with LGG [14], [15], [16], but the lower incidence and slower growth compared to HGG make it difficult to accumulate large datasets and to observe growth patterns accurately.
Due to the nature of the available data, growth predictions are often evaluated as a segmentation problem, for example by using an overlap metric such as the Dice similarity coefficient (DSC) based on a single sample in time [17], [18].Although this metric comes natural to the ground-truth data, it may be less representative of the underlying problem.The main disadvantage of overlap-based metrics is that they treat all voxels equally, independent of their location.Intuitively, we would want to assign more significance to false predictions at a large distance to the predicted tumor boundary, as changing their prediction would require a larger change to the model.This intuition is represented in metrics based on the segmentation boundary, such as the symmetric surface distance [15], but even a distance metric considers only a single binary prediction.Using a boundary metric also becomes less appropriate when the ground truth contains new, disconnected lesions.
The personalized fitting of model parameters provides an additional challenge in the comparison of different models.A good prediction is the result of both a good fit to the initial situation and a prediction of future behavior.Although the ideal model would fit both perfectly, more complex models are at an advantage with more degrees of freedom to fit the initial tumor shape.Given the ill-posedness of that initial problem, there is no guarantee that the improved fit always translates to a better estimate of future behavior.At the same time, the nature of the problem makes it difficult to distinguish the two, especially when there is limited growth.
In this work, we present a novel approach to the evaluation of tumor growth models and aim to evaluate whether tumor growth models have predictive value in LGG, compared to a trivial baseline of isotropic expansion.Our main contributions are: 1) We propose a novel evaluation metric for tumor growth predictions that elimates the need for a specific volume threshold.As the problem of spatial infiltration is in its essence a ranking problem, we propose to use the average precision as an evaluation metric.2) Using this metric, we compare models with different assumptions on the spatial diffusion characteristics.Instead of applying patient-wise tuning of the model parameters, which may cause a bias towards models with more degrees of freedom, we selected parameters that represent common assumptions in diffusion-proliferation models and compare their predictive performance in a clinically representative dataset.3) All code and data used in our experiments are made publicly available, to aid future innovations in this field.This work builds upon a preliminary version presented at the 2021 MICCAI Brainlesion workshop [19].Although the approach to the evaluation as a ranking problem is repeated in this work, we have refined the model definition, evaluation, image processing and patient selection.

A. Tumor Growth Model
The tumor growth models in this work are all based on a diffusion-proliferation model with parameters to include both an isotropic and anisotropic diffusion component.The model is defined by a partial differential equation for the cell density c, which is updated with each timestep dt according to: where ρ is the growth factor, n δ is the normal vector at the boundary between the brain and cerebrospinal fluid (CSF), and D is a tensor comprising an isotropic and anisotropic component: where κ and τ are parameters to weigh the two components, I is the identity matrix, F(x) is the local fractional anisotropy (FA) and T is the normalized diffusion tensor obtained by dividing the diffusion tensor by the mean diffusivity (as described in [20]).The isotropic diffusion depends on the local tissue type through separate diffusion factors κ w and κ g for white matter and grey matter respectively (as described in [21]).Because the tissue segmentation may contain partial volume effects, the two diffusion parameters are weighted by the local tissue probabilities p w and p g : The brain boundary is also derived from the local tissue probabilities by setting a threshold p b on the combined tissue probabilities p w (x) + p g (x) < p b = 0.8.This threshold was chosen so that the sulci are optimally visible in the CSF segmentation, preventing the model from growing across the cortical folds.
From the prediction c(t, x) we can derive a segmentation S(t) by applying a visibility threshold c(t, x) > c v , which is set at c v = 0.5 in this work.The initial condition of the model is provided by an initial cell density c(t = 0) which is defined as a gaussian distribution centered at a seed location x s with a standard deviation of 1mm.When applying to patient data, the seed location is set to the center of gravity of the initial tumor segmentation.
The model was implemented in FEniCS [22] in a cubic mesh of 1mm isotropic cells, using a finite element approach and Crank-Nicolson approximations for the time stepping.The model has four parameters (ρ, τ , κ w , κ g ) and an additional implicit parameter in the form of the seed location x s .

B. Model Selection
Generally, each patient will present with a different rate of diffusion and proliferation, and the variation in relative diffusivities in white and gray matter is not known.However, to fit both the initial location and model parameters on a single observation of the tumor is an ill-posed inverse problem.Rather than optimizing the parameters for each individual patient, we opted to design three different models: • BASE As a baseline.The diffusion tensor is isotropic (τ = 0) and the same in both gray and white matter (κ w = κ g ), and only limited by the boundaries of the brain.
• TISSUE This model also has an isotropic diffusion tensor (τ = 0), but the rate of diffusion depends on the tissue type (κ w > κ g ).
• DTI The DTI model has an anisotropic diffusion tensor (τ > 0), informed by the local DTI tensor and FA measurement.The weight of the isotropic element of the diffusion tensor (κ(x)) is the same for both gray and white matter (κ g = κ w ), assuming that the difference in diffusion is captured in the anisotropic element.The parameters (ρ, τ , κ g , κ w ) were selected to achieve a similar relative diffusivity (ρ/ Tr(D)) in white matter and overall growth speed, while showing a clear difference in tumor shape.The tissue type and local DTI measurements were inferred from a healthy brain atlas (see Section II-F).Table I shows the tuned parameter settings for the three growth models.

C. Evaluation Metric
In this section, we propose that tumor growth prediction could be framed as a ranking problem, aimed at predicting the relative time-to-invasion of each voxel in the brain.

TABLE I GROWTH PARAMETERS SELECTED FOR THE DIFFERENT MODELS
We assume that a growth model could produce a segmentation of the tumor S(t) at any time t > 0. It may therefore assign to every location x in the brain a time T (x), which is the first time t when the tumor reaches that location.We only require that the estimated T (x) is a ranking of voxels in the brain, such that: Based on this perspective, we propose to use the average precision (AP) as an evaluation metric to assess the spatial accuracy of growth prediction.This metric separates the spatial accuracy from the temporal axis, such that an accurate estimate of the growth speed is not required.This is deliberate, as an estimate of both spatial and temporal growth, from a single initial scan, is beyond the possibilities of most growth models.A separate metric could be used to evaluate the temporal accuracy.The ranking can be evaluated with the ground-truth segmentation S ′ using the AP, which is defined as the area under the precision-recall (PR) curve: where R and P are the recall and precision, and t is a threshold on the time-to-invasion ranking T , leading to the predicted segmentation S(t) = {x : T (x) ≤ t}, and comparing to the reference segmentation S ′ : Note that although S ′ is time-dependent, as it depends on the time at which the patient was scanned, t is defined as a threshold on the prediction and unrelated to the actual timing of S ′ .The AP metric weighs the precision scores with the difference in recall, so that all tumor volume predictions S(t) are taken into account from the tumor onset to the threshold where the recall is one, which means that the ground-truth segmentation is completely encompassed by the prediction.Although it is possible that the prediction never reaches a perfect precision, e.g.due to poor initialization, we assume that a perfect recall is always possible since each voxel would be assigned some t < ∞.
To compare the AP to an overlap-based metric, we consider using the dice similarity coefficient (DSC) at a threshold of equal volume.An evaluation based on a single threshold t would represent a point on the PR curve.If we take a sample at the threshold t v where the estimated tumor volume equals the observed tumor volume ( , the recall is equal to the precision (R(t v ) = P(t v )) and therefore the equal-volumebased DSC (DSC v ) can be expressed as: When comparing two models in terms of their prediction, the AP takes into account the precision and recall at each possible cut-off value in T (x).Even if the binary prediction of a voxel is the same for both models in terms of the DSC v , a change in its rank T (x) will still affect the AP metric.On the other hand, voxels that are wrongly predicted according to the DSC v , but are ranked close to the decision threshold t v , will have a relatively small effect on the A P compared to voxels that are either ranked very early but negative (causing the PR curve to drop early) or ranked very late but positive (causing a low tail of the PR curve).This is illustrated in Fig. 1.
Formulating the problem as a ranking and using the AP has a number of additional qualitative advantages.First, the ranking T is correlated to the speed of the tumor boundary.If the ranking is smooth, the gradient of T represents the local movement of the visible tumor boundary.Second, we can quantify the agreement between T and S locally, by using the rank of each voxel T (x) as a threshold on the PR curve and calculating the local precision P(t = T (x)) and recall R(t = T (x)).Fig. 2 illustrates the computation of the AP metric and the resulting local precision and recall.

D. Simulations
To illustrate these models and their effect on tumor shape and growth speed, simulations were performed with different seed locations both in the deep white matter and cortical gray matter of a healthy brain atlas.The tumor growth was tracked in terms of mean tumor diameter (MTD) (M T D = 2(3V /4π ) 1/3 , where V is the volume |S(t)|), which is known to increase approximately linearly with time, to compare the effective growth rates in simulations.The effective relative diffusivity can be estimated from the local gradient of c(x), which will be steeper in less diffuse models.Therefore the cell densities c(x) were compared at the time t where the volume |S(t)| was approximately 20 mL for all models.To quantify the effect on tumor shape, the resulting tumor shapes S from the different models were compared in terms of the A P and DSC v .For this comparison, the B AS E model was used as a ground truth S ′ to evaluate the rankings T (x) generated by the DT I and T I SSU E models.In order to investigate the effect of the timing of the follow-up scan, several different cut-off volumes were used to generate S ′ .

E. Patient Data
A retrospective dataset was selected from Erasmus MC of patients referred for awake craniotomy who a) were diagnosed with a low-grade, IDH-mutant glioma and b) were treated with surgical resection, but received no chemo-or radiotherapy.Three MRI scans were selected: a pre-operative scan (t 0 ) used for treatment planning, which includes a 3D T1-weighted (T1w) scan and a 2D or 3D T2-weighted (T2w), and two scans acquired during follow-up (t 1 and t 2 ) both with a pre-and postcontrast T1w scan, a T2w scan and a T2w-FLAIR scan (2D or 3D).A DTI scan was not required as the healthy brain template was provided by an atlas.
Patients treated with radiotherapy were excluded due to the possibility of treatment effects, making it impossible to accurately distinguish recurrent tumor growth from treatmentinduced abnormalities.The pre-operative scan was used for the initial tumor shape S ′ 0 , while the first available post-operative scan was used to measure the residual volume (|S ′ 1 |).The last available follow-up scan before the start of a new treatment was selected to define the recurrent tumor S ′ 2 .Patients were excluded from the analysis if the measured tumor growth (|S ′ 2 | − |S ′ 1 |) was less than 5 mL.Any contrast enhancement at t 2 was noted but not a reason for exclusion.
This study design was reviewed by the Erasmus MC Medical Ethics Committee (MEC-2016-419) and performed according to the declaration of Helsinki and the Dutch regulations on medical research.

F. Image Processing and Annotation
The application and evaluation of a tumor growth model requires a healthy brain template, since the tumor infiltration affects the diffusion properties and appearance of the brain anatomy.This is commonly approximated by registering to an atlas [3], [6], [21].In this study, the IIT Human Brain Atlas was used [23], [24] as a template for the DTI tensor and tissue probabilities ( p w , p g ).The patient-specific imaging was used to establish the tumor boundaries and to exclude resected regions of the brain.
The T1w scans at t 0 and t 2 were registered to the atlas using Elastix [25], [26].As the tumor growth and resection may cause large deformations of the brain, a non-rigid registration was applied after the initial rigid and affine deformations.All registration steps used the mutual information as a metric, and the final non-rigid registration was performed using a b-spline transformation with a 25mm grid spacing.As a sanity check, and to estimate the effect of the non-rigid registration, the change in tumor volumes due to non-rigid registration Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.was computed.A decrease in volume would indicate that the registration has compensated for mass effect in the tumor, while an increase in volume would be unexpected as it represents a shrinking of the tumor-infiltrated tissue.Additionally, the tumor outlines after registration were compared to the original images to visually inspect the registration quality.A heatmap was generated to visualize the combined spatial distribution of the dataset.
Voxels outside of the brain (i.e.CSF) were masked from the evaluation, which means that those voxels were excluded entirely in the computation of the DSC v and AP.In preoperative imaging (t 0 ) the boundaries of the brain were determined from the atlas as described in section II-A.For the post-operative imaging (t 2 ), a patient-specific CSF mask was needed to capture the resection cavity.For this purpose, FSL FAST [27] was used to extract the CSF from the T2w-FLAIR scan.FAST can be used to cluster brain voxels in a predetermined number of classes, depending on the tissue contrast and presence of lesions.Due to the variation in intensity in the lesion and healthy tissue within the T2w-FLAIR image, we found a total of 6 tissue classes to be optimal to separate the CSF.
For the pre-operative images, which did not include a T2w-FLAIR sequence, the initial tumor S ′ 0 was segmented manually on the T2w scan.Tumor segmentations S ′ 1 and S ′ 2 for consecutive images were produced using HD-GLIO [28], [29] and corrected manually, if needed, using ITK-Snap [30].

G. Evaluation
The three models, BASE, TISSUE, and DTI, were applied to the patient data by setting the initial location x s to the center of gravity of the initial tumor segmentation S ′ 0 .The rankings T (x) generated by the growth models were compared to the initial tumor segmentation S ′ 0 and follow-up segmentation S ′ 2 , in terms of the AP and the DSC v .The reference segmentations S ′ 0 and S ′ 2 and the volume threshold used for the DSC v metric were computed after non-rigid registration to the atlas and masking of voxels outside the atlas brain volume or in the CSF (for S ′ 2 ), as described in section II-F.The performance metrics were compared using the Wilcoxon signed rank test, using p< 0.05 as a threshold for significance.To investigate the effect of the brain boundaries on model performance, we repeated the evaluation with p b = 0.5 and p b = 0.9.

H. Data Availability
All code was made publicly available through Gitlab. 1he imaging data and patient-specific results, such as the segmentations, registered volumes and predictions, were made publicly available through Health-RI XNAT. 2

A. Simulations
The resulting growth patterns of the simulation experiments, started from three different seed locations, are visualized in Fig. 3 in terms of the cell density distribution c(x) and in Fig. 4 in terms of the growth speed.We can observe that the effective growth speed (MTD over time) and the effective diffusivity, which is represented by the gradient of c(x) at the edge of the tumor, are similar between models.Fig. 5 compares the isodensity contours of the different models at the three seed locations.Fig. 6 shows the error with respect to the B AS E model in terms of the DSC and AP metrics as it depends on the volume of the ground-truth segmentation S ′ .Based on these simulations, the estimated difference between the models is in the order of 5% for DSC v or 1% for AP, that does not change much beyond a reference volume of 20mL.

B. Patient Data
A total of 14 patients were included.The general characteristics of these patients are listed in Table II.The volumes of S ′ 0 , S ′ 1 and S ′ 2 are visualized with the relative scan date in Fig. 7.Note that the time difference between the resection and the scan selected for follow-up (t 2 ) varied from two to ten years.Two patients presented with a small nodular contrast    III.
The different tumor locations at baseline are visualized as a heatmap in the atlas space in Fig. 8, which shows that the tumors included in this dataset are mostly located in or near eloquent areas.

C. Image Registration
In all patients, the non-rigid transformation to atlas space caused a reduction in tumor volume with respect to the affine transformation (10-47%, mean 29%).For the followup scans, deformations were caused mostly by the resection     (DSC v : p=0.17,AP: p=0.08).When considering S ′ 2 , the DTI model was significantly better than the BASE and TISSUE model (p<0.001) in terms of both DSC v and A P. The TISSUE model showed a significantly higher performance than the BASE model in terms of the AP (p=0.035), but not in terms of the DSC v (p=0.43).Fig. 10 shows a visualization of the model result for patient 02.

TABLE III RANGE (MIN, MAX) OF SCANNER PARAMETERS FOR EACH SEQUENCE, PREOPERATIVELY (t 0 ) AND AT FOLLOW-UP (t 1 AND
The results for different values of p b are listed in Table IV.Although the threshold affects the performance of different

IV. DISCUSSION
In this work we aimed to evaluate the predictive value of tumor growth models in LGG after resection and thereby propose a novel approach that formulates tumor growth as a ranking problem.Using this approach, we investigated whether a diffusion-proliferation model based on a tissue segmentation or DTI measurements improves the prediction with respect to a baseline of homogeneous isotropic diffusion throughout the entire brain.Special attention was given to a careful comparison that is not biased through individual parameter tuning, in order to test the general hypothesis underlying the models.From the results in patient data, we found a significant improvement in the prediction of the recurrent tumor shape when using a DTI-informed anisotropic diffusion model.Although the model informed by structural tissue segmentation improved upon the baseline of homogeneous diffusion, it was less effective than the DTI-informed model.When looking at the initial tumor, the only clear difference was found between the DTI-and tissue-informed model, while neither were significantly better or worse at fitting the initial tumor shape than the baseline.In interpreting these results we have to consider the selection of the seed location, which was at the center of the initial tumor segmentation.This choice may have caused a bias towards the baseline model with homogeneous isotropic growth.The fact that DTI information provides an improved prediction of growth is in line with existing research on the direction of glioma growth [17], [31].In this work we have shown that this improvement is apparent even without any individual parameter tuning, and with respect to a baseline model.
Due to the strict requirements for inclusion (availability of imaging, no treatment), our dataset was relatively small, limiting the detection sensitivity for subtle performance differences between models.Collection of larger datasets, if possible in a prospective setting with harmonized image acquisition protocols, would therefore be desirable, in order to enable verification of our findings and further in-depth comparative studies of tumor growth models using our proposed evaluation framework.Furthermore, for clinical relevance of these methods it is important that they can be applied to patients treated with radiotherapy.If we are able to predict the pattern of tumor growth accurately, this could help to diagnose treatment-associated changes appearing outside the expected pattern of recurrence.However, it is currently impossible to make a reliable distinction between treatment-associated changes and progressive tumor.By excluding patients treated with radiotherapy we can be certain that the ground-truth only consists of tumor growth, and is not polluted by treatmentassociated changes.Therefore, in order to use this model for patients treated with radiotherapy in the future, we think it is better to exclude them in this stage of model development.
In terms of the growth model, we may expect further improvement of the prediction when we tune the model parameters to individual patients, especially if we optimize the initial location together with the other parameters.The fitting of growth model parameters, including the point of onset, has been an active research topic for many years [2].The general consensus is that the problem is ill-posed when using a single observation, and it is therefore impossible to estimate all parameters with any degree of certainty [20].However, the main motivation for not tuning model parameters in this work Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. is the risk of bias.In terms of model optimization, we know that models with more degrees of freedom will be better able to fit individual cases.As a strict separation of fit and prediction is not possible in personalized growth models, considering that S ′ 0 is at least partially included in S ′ 2 , this could lead to a bias towards more complex In work, we used the information embedded in S ′ 0 only to estimate the seed location, and by using the center of gravity the main risk would be a bias towards the baseline model with homogeneous isotropic diffusion.We expect that further improvement can be made by an optimization of the initial location and model parameters, as well as more refined models, but to present a novel model with optimal prediction is not the aim of this work.
Besides the evaluation of tumor growth models in clinical data, this work also proposes a novel problem formulation for tumor growth, which uses the AP as an evaluation metric.The main benefit of the AP over distance-or overlap-based evaluation is that it matches the spatiotemporal nature of the problem and does not rely on a specific cutoff in volume or time.Although the advantages are qualitative, we think that this problem formulation is a useful step forward.In the patient-wise results, we do notice that the choice of metric can affect the relative performance of models.The benefit of the DTI-based model is more consistent in the AP metric than in the DSC v metric, possibly due to its effects at the more distant infiltration that is not captured by any of the models when using a volume-based threshold.This is also reflected when testing for significant differences between models, where the AP metric shows a higher level of significance in some of the comparisons.This could indicate that the AP metric is more sensitive to differences between models, and less affected by other factors, although we should be careful in attaching strong conclusions to a difference in p-values.In general, there is no way of objectively comparing one evaluation metric to the other, as the metric should match the (clinical) purpose of the prediction.If the aim is to design, for example, a clinical target volume (CTV) for radiotherapy, a different metric could be preferred.In the design of the CTV the aim would be to keep the irradiated volume to a minimum while covering as much of the (potential) recurrent tumor as possible, so a high recall is required [5].When using the DSC v , the cutoff is chosen so that precision and recall are equal, while the AP does not assume a preferred trade-off between precision and recall.
As optimization plays an important role in this field, both in parameter inversion and statistical models for growth [18], [32], [33], the AP could thus be considered as an alternative target for optimization in future work.Although the biophysical growth models in this work derive the ranking from a moving boundary, linking the ranking explicitly to the predicted time-to-invasion, the same evaluation could be applied to the output of statistical model such as a neural network.In future work, we envision that a loss term based on the AP, combined with a larger longitudinal dataset, could be used to develop solutions based on deep learning.
With most research on growth models being aimed at HGG, it is unclear how well those methods generalize to LGG.From the results in this work, we can conclude that the models informed by structural imaging and DTI do show a clear improvement with respect to a baseline of isotropic growth in this patient group.However, we must also conclude that discrepancies between models in terms of prediction are small compared to the actual error.This raises the question whether the cause of the error is in the model or due to other factors.One important factor in the error is the registration to the atlas space.By using a non-rigid deformation, we tried to capture the deformation caused by mass effect and resection.Although a ground-truth is not available for the registration problem, we can observe from visualizations that the non-rigid step improves the alignment.However, it is clear that a more accurate non-rigid atlas registration would improve the accuracy of the model and its evaluation.A quantitative estimate of the error due to misalignment is difficult to achieve, but research on the topic reports a mean error of up to 3 mm for state-of-the-art deformable models [34].
The dataset used in this research was a selection of patients that underwent surgical resection, but no radio-or chemotherapy.Although it is fair to assume that the diffusive behavior of the tumor is not affected during the surgery, so the model parameters would stay the same, the future growth pattern can be affected by the removal of tumor tissue.Including the resection in the model, e.g. by removing tissue during the simulation, could further improve the predictive performance.The decompression that occurs at resection also complicates the registration of post-operative imaging, leading to registration errors.However, with surgical resection being the recommended primary treatment for LGG [35], this is a complicating factor that cannot be avoided.It is also possible to include treatment effects in a biophysical growth model, including chemo-and radiotherapy [17], but in this work we have chosen to exclude patients who received radiotherapy due to its effect on imaging.For the evaluation of growth models across treatment, it is essential to accurately distinguish treatment effects from tumor growth.A limitation of excluding patients with further treatment is that it results in a selection bias towards patients with a relatively good prognosis.Furthermore, due to the selection of patients referred for awake surgery, there is a bias towards tumors located in or near eloquent areas.
LGG are known to remain stable for long periods of time, or grow at a slow and constant speed, but also to show sudden accelerated growth as a result of malignant transformation.In this work, we have aimed the evaluation exclusively at the spatial growth pattern, therefore allowing for variations in growth speed.However, it is likely that malignant transformation also affects the spatial growth pattern.Although two patients presented with nodular contrast enhancement at progression, indicating potential malignant transformation, we were not able to assess the tumor grade at progression due to the absence of tissue verification.Also, it is possible that patients were included who would be considered grade 4 according to the CNS-WHO 2021 classification [36] due to the homozygous deletion of CDKN2A/B, since this molecular assessment was not available in these patients.In general, it would be interesting to study the effect of differences in grade and changes in malignancy over time on the spatial growth pattern.Growth models can provide an important tool to answer such research questions as they offer a general framework to quantify growth patterns.

V. CONCLUSION
To conclude, in this work we presented a novel formulation of the tumor growth problem that fits the spatiotemporal nature of the prediction.From this formulation, the use of average precision (i.e.area under the precision-recall curve) as an evaluation metric follows naturally.This metric was used to compare common diffusion-proliferation models in their prediction of LGG growth after surgery.By avoiding individual parameter tuning, we are able to make an unbiased comparison to a baseline model of homogeneous isotropic diffusion.In this comparison, we conclude that there is a significant improvement in the prediction of the recurrent tumor shape when using a DTI-informed anisotropic diffusion model as opposed to an isotropic diffusion model.Through a novel evaluation method and the publication of code and data, we enable a better comparison of growth models.

Fig. 1 .
Fig. 1.Illustration of the AP metric for two model predictions A and B, evaluated with the same ground-truth segmentation (shown in grey).Dotted outlines indicate predictions at different times, so thresholds on T(x).The red dotted outline illustrates the volume-based threshold for DSC v .The purple (outer) dotted outline shows the threshold at maximum recall.The orange (inner) dotted outline indicates the maximum threshold with perfect precision.The precision-recall curve for both models is illustrated on the right, with the changes at the three thresholds indicated by an arrow in their respective colors.The separate lesion does not affect the DSC v between model A and B, but it does have an effect on the AP.Arrows indicate the early drop in precision and tail of low recall in model A and their corresponding causes.

Fig. 2 .
Fig. 2. Left: cross-section with thresholds on the prediction T(x) generated by a tumor growth model (patient ID 04, BASE model).The ground-truth segmentation S ′ is indicated by a red overlay.Middle: corresponding precision-recall curve with the same thresholds indicated by vertical lines.The value of DSC v , achieved by thresholding the prediction at equal volume, is indicated by a red x.Right: quantification of agreement by R(T(x)) outside S ′ and P(T(x))) for voxels inside S ′ .The boundary of S ′ is indicated by a red outline.

Fig. 3 .Fig. 4 .
Fig. 3. Comparison of models in terms cell density distribution c(x) at t = 900, where |S(t)| = 20mL approximately, with the isocontour c v indicated by a blue line.(Left: BASE, Middle: TISSUE, Right: DTI.) enhancement at time of follow-up.The overview of MR scan parameters is shown in Table

Fig. 5 .
Fig. 5. Comparison of models in terms of shape of the resulting segmentation obtained in simulations with three different seed locations x s .Isodensity contours of c v at 20mL, shown as an outline on the healthy brain atlas (T1w).Intersection of the white dotted lines is the seed location.

Fig. 9
Fig. 9 shows summarized measures of the model performance in patient data.The mean DSC v scores in the initial tumor (S ′ 0 ) were 0.72, 0.70 and 0.72 for the BASE, TISSUE and DTI model respectively, and 0.59, 0.60 and 0.63 for

t 2 )Fig. 6 .
Fig. 6. of DTI and TISSUE models with respect to the simulated ground truth from the BASE model measured in DSC v (top) and AP (bottom) versus volume, for each of the three seed locations shown above.

Fig. 7 .
Fig. 7. Tumor volumes measured at pre-operative imaging (S ′ 0 ), at postoperative imaging (S ′ 1 ) and at follow-up (S ′ 2 ).Volumes from the same patient are connected by a line.The time difference with respect to the resection is indicated on the x-axis.If nodular enhancement was found at follow-up, this is indicated by an enlarged 'x' marker.

Fig. 8 .
Fig. 8. Heatmap of the registered baseline segmentations S ′ 0 to the atlas space for all patients combined.

Fig. 10 .
Fig. 10.Individual model result for patient 02 at S ′ 2 .Top left and top middle: ground-truth segmentation and model results for an axial and coronal slice.The registered ground-truth segmentation is shown in a red overlay; the model rankings thresholded at equal volume are shown as outlines.The voxels omitted from the evaluation, due to being classified as CSF in the atlas or on the follow-up T2w-FLAIR scan, are masked.Top right: precisionrecall curves of the three models, with the threshold at equal volume, where precision and recall equal the Dice similarity coefficient (DSC v ), indicated by a black 'x'.Bottom: local precision and recall values for different models, as described in section II-C.Three locations with notable differences between models are indicated by a blue arrow.

TABLE II CLINICAL
DETAILS OF INCLUDED PATIENTS.TIME IS RELATIVE TO THE DATE OF RESECTION.CE = CONTRAST ENHANCEMENT

TABLE IV PERFORMANCE
METRICS OF MODELS IN PATIENT DATA, MEASURED IN AP (MEDIAN [IQR]) FOR DIFFERENT VALUES OF p b models, the general trend was unchanged.The DTI model performed significantly better in terms of AP than BASE and TISSUE for each threshold value at S ′ 2 .S 0 , the DTI model performed significantly better than TISSUE at each threshold value, but never significantly different from BASE.