Improving the Timing Resolution of Positron Emission Tomography Detectors Using Boosted Learning -- A Residual Physics Approach

Artificial intelligence (AI) is entering medical imaging, mainly enhancing image reconstruction. Nevertheless, improvements throughout the entire processing, from signal detection to computation, potentially offer significant benefits. This work presents a novel and versatile approach to detector optimization using machine learning (ML) and residual physics. We apply the concept to positron emission tomography (PET), intending to improve the coincidence time resolution (CTR). PET visualizes metabolic processes in the body by detecting photons with scintillation detectors. Improved CTR performance offers the advantage of reducing radioactive dose exposure for patients. Modern PET detectors with sophisticated concepts and read-out topologies represent complex physical and electronic systems requiring dedicated calibration techniques. Traditional methods primarily depend on analytical formulations successfully describing the main detector characteristics. However, when accounting for higher-order effects, additional complexities arise matching theoretical models to experimental reality. Our work addresses this challenge by combining traditional calibration with AI and residual physics, presenting a highly promising approach. We present a residual physics-based strategy using gradient tree boosting and physics-guided data generation. The explainable AI framework SHapley Additive exPlanations (SHAP) was used to identify known physical effects with learned patterns. In addition, the models were tested against basic physical laws. We were able to improve the CTR significantly (more than 20%) for clinically relevant detectors of 19 mm height, reaching CTRs of 185 ps (450-550 keV).


I. INTRODUCTION
A RTIFICIAL intelligence (AI) is finding its way more and more into medical imaging [1,2], including the research field around positron emission tomography (PET) [3].In contrast to computed tomography (CT) [4] or magnetic resonance imaging (MRI) [5], PET is a functional imaging technique that does not reproduce anatomical structures but can visualize metabolic processes in the body.PET uses the effect of electron-positron annihilation to obtain information about processes within the object of interest.A radioactive tracer is administered to the patient, accumulating in highly metabolic regions and emitting positrons [6].These positrons annihilate with the surrounding tissue, producing two γ-photons emitted backto-back defining a line-of-response (LOR).The γ-photons are subsequently registered in coincidence by a PET scanner (see Fig. 1) equipped with scintillation detectors, which convert the γ-photons into many optical photons in the visible light range that can be measured with a photosensor [7].Analog silicon photomultipliers (SiPMs) utilize an application-specific integrated circuit (ASIC) to digitize the signal pulses, whereas digital SiPMs perform digitization at a single photon avalanche diode (SPAD) level.Based Figure 1.Principle of PET.A radioactive tracer (red blob) is administered to a patient (bluish).By detecting the created back-to-back γ-photons, defining many LORs (red lines), with the detector ring (grayish), spatial and temporal information can be inferred to reconstruct a PET image.Modern scanners utilize TOF information to estimate the annihilation point on the LOR. on the detection information, especially the spatial and the temporal information, a PET image can be reconstructed.While the application of neural networks in medical imaging usually focuses on the image reconstruction process [8][9][10], improvements along the complete imaging chain, from detecting physical signals [11] to processing the resulting data, can improve the resulting image to facilitate medical diagnoses.In this work, we show the application of learning algorithms in the context of residual physics at the detector level to significantly improve the achievable coincidence time resolution (CTR).Especially in medical and physical applications, it is desired to get insight into the inner workings of models to ensure that the algorithms can capture meaningful relations.Therefore, we use eXplainable AI (XAI) [12][13][14][15] methods to check whether trained models are able to understand simple physical constraints implied by the data generation.State-of-the-art clinical PET scanners combine high spatial resolution with precise time-of-flight (TOF) information (see Fig. 1).Including the timing information in the image reconstruction process provokes an improvement in the signal-to-noise ratio (SNR) of the image [16] without increasing the radioactive dose and therefore improves also lesion detectability [17].Most PET systems [18,19] utilize segmented scintillator topologies (see Fig. 2) due to the readout simplicity and very good timing performances.Contrary to this, (semi-)monolithic detector concepts spread the light over multiple channels.Recently, they have gained attention [20,21] as they provide high spatial resolution [22][23][24] but also offer intrinsic depth of interaction (DOI) capabilities [25,26], thus, reducing parallax errors at reduced costs compared to segmented topologies.While, e.g., the γ-positioning strongly profits from the spread detection information, it creates disadvantages for the timing performance due to an enhancement of timewalk effects [27][28][29] and jitter in signal-propagation times [30,31], which deteriorate CTR.Therefore, monolithic detector concepts demand advanced readout algorithms and calibration routines to infer the needed information from the detected optical information.Due to the lightspreading characteristic of (semi-)monolithic detectors, many approaches use machine or deep learning techniques, e.g., to infer the γ-interaction position within the scintillator volume.This strategy suggests itself since the detected optical photons represent abstract patterns that can easily be recognized by learning algorithms.However, applying machine learning for time skew calibration and estimation still remains experimentally a hard task since skew effects can vary in their magnitude and also in the incorporated nature on an event basis without the need for a spatial relation.Besides this, supervised learning demands labeled data, which is a priori not accessible without using simulation techniques, and unsupervised learning is often used in the context of clustering and association algorithms [32] unsuitable for the proposed problem.Recently, we proposed an analytical timing calibration technique [33] suitable for traditional segmented and light-sharing-based scintillator topologies.This analytical calibration aims to reduce sequentially major skew effects by using a convex optimization of a matrix equation.When applying the technique, one observes that the skew effects are iteratively reduced.Within each iteration, the experimenter can address different characteristics of the time skews, e.g., by choosing a different separation into sub-volumes (called voxels) of the scintillation crystals.At a certain number of iterations, we see that the reported correction values ⃗ c oscillate around the baseline and that the CTR does not improve further, indicating that the linear formulation of the problem, with M denoting the matrix and ⃗ ∆t the estimated mean time difference between the calibration objects, has limited capability of completely describing the physical situation.This challenge can theoretically be addressed by changing the mathematical formulation representing also the effects of higher order.However, this requires prior knowledge of the precise optical processes [34] taking place in the chosen scintillator topology, in order to change the mathematical formulation [35].Furthermore, depending on the readout infrastructure and the detector concept, the problem might depend on numerous variables and parameters [36,37] which are hard to determine in advance.Hence, covering the effects of higher order can become arbitrarily complicated.In addition, detectors can vary in response such that an optimized representation might only fit the specific detector.A statistical approach using maximum likelihood was presented by Van Dam et al. [38], focusing on differences between timestamps.
We propose to use a machine learning approach instead and furthermore utilize a special way of experimental data generation to propose simple prior physical knowledge to the model by shifting a radiation source to different known positions [39].We intend to apply this technique on top of the conventionally used analytical approach, forcing the algorithm to learn the effects of higher order, which we understand as residual physics [40,41].By following this, we free ourselves from precisely modeling and catching all non-linear effects in the complete scintillation and detection process.In this work, we employed gradient boosted decision trees (GBDT) as learning algorithm since it is able to handle missing data [42] and allows usage in (near) real-time processing systems [43] due to the simplicity of the model's architecture.The proposed approach is studied using experimental data acquired with a coincidence setup equipped with a semimonolithic (3.9 mm × 31.9 mm × 19.0 mm) and a one-toone coupled (3.9 mm × 3.9 mm × 19.0 mm) detector array concept.We trained multiple models on the acquired data and studied their performance based on the physics-related learning task, the agreement with theoretical expectations and bias effects, and the obtained CTR values.

II. RELATED WORKS A. Approaches towards Residual Physics
To the authors' knowledge, the first popular mention of 'residual physics' in the context of artificial intelligence was by Zeng et al. [40].In their work, they investigated whether a robotic arm is able to pick up arbitrary objects and throw them into selected target boxes.While the problem of throwing can be described sufficiently well in theory by Newtonian physics, the real-world implementation for arbitrary objects is very challenging due to numerous additional variables that affect the throw.Similar works have been [44][45][46][47][48] and are still being published [49] in the context of 'hybrid controllers'.All of the studies having in common that they exploit the residuals between well-understood idealized physics and actual measurement.Alternative approaches aiming to combine physics domain knowledge and Artificial Intelligence (AI) are given by 'physics-informed learning' [50][51][52][53], where the utilized loss function is often modified to guide the model to physicsmeaningful predictions.

B. Timing Capabilities of PET Detectors
In recent publications [42,[54][55][56], it has been shown that (semi-)monolithic detectors are able to provide good performances.Especially their timing capabilities have been studied under various experimental settings.Van Dam et al. [38] were able to reach sub-200 ps CTR for a monolithic crystal (24 mm × 24 mm × 20 mm) using a maximum likelihood approach and a measurement temperature of −20 °C, challenging to implement in a PET system designed for the clinical domain.Sánchez et al. developed a new ASIC (HRFlexToT [57]) with redesigned energy measurement for linear time-over-threshold (ToT) behavior while reducing power consumption and improved timing response, and achieved 324 ps CTR for a big monolithic crystal (25 mm × 25 mm × 20 mm).In a recent simulation study, Maebe et al. [58] reported 141 ps.In their simulation, they used a monolithic detector (50 mm × 50 mm × 16 mm) and a convolutional neural network (CNN), while the network's input is given by the digitized waveforms truncated to a window of 3 ns using a step size of 100 ps.Zhang et al. [55]  In a proof-of-concept study performed by Berg et al. [60] using two small lutetium fine silicate crystals (5 mm × 5 mm × 10 mm) coupled to a single photomultiplier tube, a timing resolution of about 185 ps was achieved using CNNs.Onishi et al. [61] proposed a simple method for unbiased TOF estimation by applying a combination of a CNN and a leading edge discriminator (LED) to an oscilloscope equipped with a pair of single scintillation Lutetium-yttrium oxyorthosilicate (LYSO) crystal of dimensions 3 mm × 3 mm × 10 mm reaching 159 ps.

A. Gradient Boosted Decision Trees
While we utilized GBDT in this work, the presented calibration approach is also applicable to different learning architectures, e.g., deep neural nets.GBDT is a supervised learning algorithm based on an ensemble of binary decision trees, where each tree is trained on the residuals of the already established ensemble (additive training).In this work, we use the GBDT implementation of XGBoost [62], with the model ϕ, being given as the superposition of the K trees (weak learners) f k .Each tree f k is an element in the CART [63] space Ω, In its design, GBDT is a relatively simple architecture compared to widely used deep neural networks [64][65][66].
However, it has proven high predictive power in many applications [67][68][69][70][71], and due to its simplicity, GBDT allows usage in high throughput software [43] suitable for complete PET systems or even the application directly on the detector level [72,73].Regarding the scope of this work, two hyperparameters of GBDT models are of particular importance, namely the maximal depth d, denoting the maximal number of decisions within an ensemble, and the learning rate lr, measuring the residual influence on the learning of the following tree.The learning rate must be optimized in most cases to find a compromise between training duration and accuracy.A third prominent hyperparameter is the number of trees n of an ensemble.We excluded n from the hyperparameter search in this work since we used an early stopping criterion.

B. Shapley Additive Explanations
The SHAP (SHapley Additive exPlanations) framework [74,75] is used as an explainable AI technique to analyze feature importance in order to search for correlations between physical effects and patterns the model has learned.In particular, in this work, we utilized the TreeExplainer implementation [76] because of the chosen learning architecture.The framework uses mathematical game theory.Each input sample and corresponding prediction is connected by assuming a coalition game.The players in the game are represented by the feature values of the input sample, where each feature influences the model's prediction.These influences are called contributions and are expressed in the same physical unit as the predictions.Contribution values are mathematically either positive or negative, while the model's output is equal to the sum over the contributions.The magnitude of a given contribution indicates the level of its importance.SHAP uses Shapley values [77], which are a measure to quantify the contribution of a feature regarding the specific model's output.In a mathematical sense, SHAP combines three concepts essential for providing a consistent picture concerning feature importance.Firstly, the SHAP values must satisfy local accuracy, meaning that for a given input sample, the sum of the estimated feature contributions must be equal to the corresponding model's prediction that should be explained.If a feature is missing, it cannot attribute to importance, which is covered in SHAP using the concept of missingness.Lastly, consistency is required, ensuring that when changing a model such that a particular feature has a larger impact on the model, the corresponding attribution cannot decrease.Practically, for each feature value f k of a given input sample X = {f k }, an associated SHAP value SV (f k ) can be computed, reporting a local explanation that connects the feature value with its contribution to the model's output y.By combining many local explanations, one can conclude a global understanding of the model.

C. PET Detectors
The study is conducted using two different detector types, where one detector is based on a one-to-one coupled scintillator design, and the other detector is based on a semi-monolithic scintillator design (see Fig. 2).Each scintillator concept is glue-coupled (Meltmount, Cargille Laboratories) to a sensor tile holding 4 × 4 digital SiPMs (DPC3200-22, Philips Digital Photon Counting, Aachen [78]).Each SiPM is formed by 2 × 2 readout channels (also called pixels) and a twin time-to-digital converter, where one readout channel consists of 3200 SPADs.Each SiPM of a sensor tile works independently and follows a configured acquisition sequence if a predefined internal two-level trigger scheme is fulfilled.After the reception of a trigger, it is checked during the validation phase if the geometrical distribution of discharged SPADs met the configured requirement.If both trigger thresholds are fulfilled, the acquisition is continued.Each triggered SiPM provides information that encloses a timestamp and four pixel photon count values, called a hit.Both scintillators use LYSO as scintillation material (Crystal Photonics, Sanford).Concerning the scintillator architecture, an array of 8 × 8 LYSO segments of 4.0 mm pitch and 19.0 mm height is utilized in the one-to-one coupled design.Each segment is wrapped with enhanced specular reflector (ESR) foil and covers the pitch of one pixel.The semi-monolithic detector concept comprises eight monolithic LYSO slabs, each having a volume of 3.9 mm × 31.9 mm × 19.0 mm.Each slab aligns with one row of pixels.ESR foil is located between every second slab and on the laterals walls to reduce light sharing between trigger and readout regions.The slab detector is able to provide intrinsic DOI information due to its monolithic characteristics.Not all SiPMs that are partly covered by a slab might be triggered and send hit data corresponding to a γ-interaction due to the independent operation of the SiPMs.

D. Coincidence Setup
The experimental setup comprises a source mounting, in addition to the detectors, and is located in a tempered dark box.The source mounting is connected to a programmable translation stage system, allowing motion in all three spatial axes (see Fig. 3).The distance between the detector surfaces is given to be 435 mm.The precision of the translation stage considering the complete measurement range is given to be 10 µm which translates regarding a coincidence measurement to an uncertainty in the time domain of about 0.067 ps.The source mounting is equipped with a 22 Na source with an activity of approximately 12 MBq and a diameter of 0.5 mm.Coincidences are acquired by utilizing flood irradiation and moving the source to various positions between the detectors.

IV. EXPERIMENTS A. Data Acquisition
The proposed calibration technique uses supervised machine learning and therefore demands labeled data.The labels are generated by moving the radiation source to specific positions between the facing detectors and measuring coincidences.Thanks to the known source position, one can calculate the expected time difference E({∆t}) because of the different path lengths the γ-photons have to travel until reaching the detector.The source was moved to 47 different z-positions (see Fig. 4) with a step size of 5 mm ranging from −130 mm to 100 mm, while at each z-position, a grid of 5 × 5 positions in the xyplane with a step size of 6 mm was utilized to acquire coincidences.Both, x and y positions ranged from −12 mm to 12 mm.At each grid point a measurement time of 600 s was set, resulting in a total measurement time of about 8 d.The acquired measurement data consisting of 6.82 × 10 8 coincidences (≈ 5.80 × 10 5 coincidences per position) is finally used to form three datasets for training, validation, and testing during the model-building process comprising 3.29 × 10 8 , 1.56 × 10 8 , and 1.97 × 10 8 input samples, respectively.We decided to evaluate the final CTR performance (see Section V-B), using data from a measurement conducted at a different day using the same conditions and detectors to prove the predictive power and generalized applicability of the trained models.This dataset comprises 4.20 × 10 6 coincidences acquired with the radiation source located near the iso-center of the setup, as it is usually done for CTR evaluations.To allow a clean separation in the naming, the dataset used for the CTR performance evaluation is called performance dataset, while the three other datasets remain in the usual naming (training, validation, testing).During both acquisitions, the sensor tile reported a constant temperature of 2.1 °C for the one-to-one coupled detector and 0.0 °C for the slab detector.Both sensor tiles were operated in first-photon trigger [79].The excess voltage was adjusted to 2.8 V, while the validation pattern was set to scheme 16 (0x55:AND) demanding on average 54 ± 19 optical photons [80].

B. Data Pre-Processing & Preparation
1) Coincidence Clustering: Data associated with one γ-interaction has to be clustered due to the independent readout of the DPCs.A cluster window of 40 ns is reasoned by the timestamp difference distribution of the hits' uncorrected timestamps to combine all hits into a cluster correlated to the same γ-interaction.Measured raw data were corrected for saturation effects, and the time-to-digital converters of each DPC were linearly calibrated against each other, assuming a uniform distribution of triggers regarding a clock cycle [31].Clusters with less than 400 or more than 4000 detected optical photons were rejected for noise reduction since the non-calibrated photopeak of the 511 keV γ-photons was located at 2300 and 2800 for the slab and one-to-one coupled detector, respectively.Coincidences were grouped on cluster level using a sliding coincidence window of 10 ns considering the first timestamp of two clusters.
2) Position & Energy Estimation: A subset of the features used during the proposed time skew calibration is given by the γ-interaction position inside the scintillator volume and by the deposited and calibrated energy in units of keV.To acquire the positioning and energy information of each event, dedicated calibrations already established in previous works [22,25] were performed.While the γ-positioning in the one-to-one coupled detector is given by the pixel's position showing the highest photon count, the semi-monolithic slab detector requires a calibration procedure to estimate the 3D interaction location.For this purpose, GBDT [22,25,62] models are trained based on data acquired with an external reference using a fan-beam setup [81], which irradiates the scintillator at known positions.While the positioning resolution of the one-to-one coupled detector is given to be 2 mm, the slab detector's resolution is in the planar direction 2.5 mm and in the DOI direction 3.3 mm.The positioning resolution is determined by the full width at half maximum (FWHM) of the positioning error distribution [25].The energy value associated with a γ-interaction is estimated using a 3D-dependent energy calibration utilizing an averaged light pattern.The crystal volume is divided into n x × n y × n doi voxels, where for each voxel, the mean number of detected optical photons is estimated, based on γ-events, whose interaction positions were located inside the voxel volume.The slab detector is divided into 8 × 8 × 4 voxels, while the one-to-one coupled detector is divided into 8 × 8 × 1 voxels.The energy resolution of the one-toone coupled detector was evaluated at 10.4 %, while the energy resolution of the slab was estimated to be 11.3 %.

3) Analytical Timing Calibration:
The first part of the calibration is given by performing an analytical calibration, which has been studied in previous publications [31,82,83] and relies on well-known mathematical principles like convex optimization.In this work, our calibration formalism [33] was used.However, the principle of exploiting residual physics remains also functional for every other analytical calibration.During the calibration process, multiple sub-calibrations are conducted, where in each sub-calibration different hyperparameters are applied such that one tries to address many aspects of time skew effects.The same convex optimization process is used within each sub-calibration in order to find suitable corrections ⃗ c, with ⃗ c, and ⃗ ∆t, denoting the calibration channel vector and the mean time difference vector, respectively, and M encoding different channel combinations in the form of a matrix.After some number N of performed subcalibrations, a convergence of the detector CTR value as well as the estimated corrections with i denoting the number of applied sub-calibration.For this work, we used three sub-calibration iterations (based on time-to-digital converter (TDC) regions, readout channels, and voxels as described in [33]) mainly addressing fixed skews due to differences in the signal propagation and time jitter introduced by the scintillator itself.At this point, it becomes inconvenient to add more and more subcalibrations since the benefit decreases strongly.

C. Residual Timing Calibration
We propose to use a data-driven approach on top of the conventionally used technique to explore new corrections that have not been covered by the analytical formulation and improve the CTR.A suitable way of doing this is by using artificial intelligence to search for patterns in the acquired coincidence data.We decided to employ the supervised algorithm GBDT (see Section III-A), which was also used during the γ-positioning (see Section IV-B2).Using a supervised approach demands labeled data (input samples and corresponding target values known as labels) to train a model.However, for the proposed problem of non-static time skew effects labeling is difficult, since it is a priori not clear how many and how strong the worsening effects are pronounced in each measured coincidence.Using an analytical estimator to generate the ground truth would limit the capabilities of the trained model to the chosen estimator.In order to solve the problem of labeling, we propose to shift the radiation source to different positions and measure coincidences between the facing detectors.5. Overview of the used features F to train the GBDT models.There are three different feature sets, given by purely slab detector-related features F s , purely one-to-one detector-related features F o , and features associated with both detector concepts F so .F so consists only of the difference ∆tmeas between the first timestamps from slab and one-to-one coupled detector, respectively.The sets F s and F o are symmetrical in their content and can again be grouped into the subsets timestamp information F s/o T , energy information F s/o E and position information F s/o Pos .Information about the processed timestamps (denoted as TS), the SiPM IDs of those timestamps, the timestamp spread (difference between first and last timestamp of a cluster), and the number of generated timestamps is given.The latter equals also the number of hits within the cluster.Besides this, the photon counts of the corresponding SiPMs, the calibrated energy value and the spatial interaction position are used.
The γ-photons travel varying path lengths to the detectors resulting in different expected time differences per source position.The different path lengths of the γ-photons (see Fig. 4), lead to different travel times t 1 and t 2 .One can conclude the expected time difference E[ {t 1 − t 2 }] , which is subsequently used as label y, with c air denoting the speed of light in air and z s denoting the source offset under the assumption that the coordinate system z is located at the iso-center of the setup (see Fig. 4).For Gaussian distributions, the expectation value E is identical to the mean value of the distributions.Data acquired with the aforementioned scheme is further processed and finally used to train GBDT models.The input features F can be grouped into three categories: purely slab detector-related features F s , purely one-to-one detector-related features F o , and features associated with both detector concepts F so .While F so consists only of the difference ∆t meas between the first timestamps from slab and one-to-one coupled detector, respectively, F s , and F o can be separated into the subsets timestamp information F s/o T , energy information F s/o E and position information F s/o Pos (see Fig. 5).Since the detector-specific feature sets F s and F o are symmetrical in their content, we will explain the specific features in a generalized way in the following.The subset timestamp information F s/o T contains the four (three) first timestamp values reported by the slab (oneto-one coupled) detector.A trade-off between available information and needed memory reasons for the choice of the different number of used timestamps.For both detectors, the cumulative distribution of the number of generated timestamps per cluster was analyzed and determined to the value matching roughly 80 % of all clusters.Let T j be the set of timestamps provided within a cluster j by the photodetector, T j = {t j,0 , t j,1 , . . ., t j,i , . ..}, (8) with t j,i denoting the i-th timestamp of cluster j.Since the photosensor consecutively reports the timestamp values throughout the measurement, they need to be processed after the coincidence search to be suitable for feeding into a machine learning algorithm.Therefore, the very earliest timestamp t j,0 of a cluster j is subtracted from the following timestamps t j,i of this cluster, with tj,i denoting the processed timestamp i of cluster j employed as input.Furthermore, the origin of the respective timestamps is used and represented by their SiPM ID number.Besides this, information about the cluster's timestamp spread (the difference between the first and last timestamp) and the number of timestamps (equals the number of hits) in the cluster is utilized.The subset energy information F s/o E contains information about the deposited energy as estimated energy value in keV, and as raw photon counts that have been detected on the corresponding SiPMs.The γpositioning set F s/o Pos holds information about the interaction position of the γ-photon within the scintillator volume.While this is given as a 3D position for the semi-monolithic case, the one-to-one coupled design provides only planar (2D) information.In order to find suitable hyperparameters regarding the learning task, a grid search was conducted considering the maximal tree depth d and the learning rate lr, with d ∈ {12, 15, 18, 20}, and ( 10) During the model-building process, the maximal number of estimators n within an ensemble was set to n = 500, where the final number of used estimators was defined by the built-in early stopping criterion considering ten early stopping rounds to suppress possible overfitting.The learning task is performed using XGBoost's default squared error loss function [62].

D. MAE Evaluation & Linearity of Predictions
The mean absolute error (MAE) is used to evaluate the performance of a trained GBDT model based on the testing data, with y i (z s ) denoting the label of sample i belonging to the source position z s , and ŷi (z s ) denoting the corresponding model prediction.We utilize information about the test data prediction distributions to verify their Gaussian shape using a goodness-of-fit approach and to validate that the linearity condition given by Eq. ( 7) is fulfilled.This validation ensures that the trained models obey the physical principle and do not compress the time differences since it would artificially improve the CTR.Therefore, a linear regression is performed for each trained GBDT model and each grid point (x s , y s ) in a range from −75 mm to 45 mm, considering the fitted mean value of the prediction distributions µ s and the associated source position z s .We assumed a linear dependence following while in theory ε All fitting procedures are performed using SciPy's ODR package [84].The uncertainty σ µs on µ s was based on the uncertainty on the mean reported by the fit procedure.Furthermore, an uncertainty on the translation stage position was given to be the same for all source positions σ zs = 0.1 mm.Finally, the global linearity performance is given by the averaged ε-value for each model.

E. CTR Performance
To evaluate the timing performance, the FWHM of the predicted time difference distribution is estimated by fitting a Gaussian function.The error on the estimated timing resolution is given by the uncertainty on the fitted σ-parameter of the Gaussian.The input data is given by the performance dataset.The CTR is estimated for unfiltered data, for coincidences within a large energy window from 300 keV to 700 keV, and for coincidences within a smaller energy window from 450 keV to 550 keV.

F. SHAP Analysis
Due to computational costs, the SHAP analysis was performed for the model showing the best MAE and CTR performance using a subset of 23 500 samples of the performance data.The analysis was done without applying any filters.

A. MAE Evaluation & Linearity of Predictions
The MAE performance (see Fig. 6) is similar for all chosen hyperparameter configurations.The distribution shows a symmetrical behavior around the median value of z = −15 mm, with a slight skewness that can be observed going from negative offset positions toward positive ones.While the prediction quality strongly decreases at the borders of the presented data, the models' predictions work very well in the central region.In general, one observes that models with a lower learning rate perform slightly better than those with a learning rate equal to or higher than 0.3.Furthermore, Table I reveals that the MAE is reduced by restricting the allowed energy of the test data.The model with hyperparameter configuration (d = 18, lr = 0.1) achieved the best MAE performance.For the linearity analysis, we excluded the predictions located outside an interval of ±60 mm around the median (grayish areas in Fig. 6) to be able to give an unbiased evaluation of the performance in the large central region of the data.
Figure 7 shows exemplarily the distribution of the predictions considering the complete data range for the model (18, 0.1) in combination with the goodness-of-fit per number of degrees of freedom (χ 2 /ndf ) for a Gaussian function.Both distributions are symmetrical.The model is able to infer the expected time difference on a coincidencebasis according to the input data.Considering the goodnessof-fit, the shapes of the predicted distributions are in very good agreement with the expected Gaussian.A substantial deviation from the Gaussian shape is observed when moving toward the far left and far right source positions.A part of the linearity analysis for model (18, 0.1) is exemplarily depicted for the position (x s , y s ) = (12, 0)mm in Fig. 8.The global ε performance for each model is shown in Fig. 9.
The estimated ε-parameters of all trained models are within a 3σ-interval in agreement with the theoretical value of ε = 1.

B. CTR Performance
The CTR performances of the trained models, as well as the performance of applying only the analytical corrections, are listed in Table II.As one can see, the best CTR was achieved by the model (18, 0.1), which also performed best regarding the MAE evaluation.The model improved the      CTR by about 50 ps down to (185 ± 2) ps for an energy window of 450 keV to 550 keV.Except for the models having a max.depth of d = 12, all other models yield an improved CTR performance when using lower learning rates.A comparison of the time difference distributions before and after using the model (18, 0.1) is depicted in Fig. 10.Regarding the shape of the emerging distribution especially coincidences in the tails of the distribution have been recovered to smaller time differences.

C. SHAP Analysis
The model (18, 0.1) was chosen for the analysis using SHAP [74][75][76] since it provided the best performance regarding MAE and CTR.The mean absolute contributions of the different feature sets F are depicted in Fig. 11.The most important feature set is F so , which consists of the measured time difference ∆t meas .Besides this substantial contribution, timestamp information F s/o T , and energy information F s/o E also seem to be crucial for good model performance.The feature group F s Pos shows a slightly higher contribution compared to F o Pos , due to the introduction of DOI information.The specific contributions of the planer coordinates, however, differ only marginally for the slab and the one-to-one coupled detector.Furthermore, one observes a similar behavior comparing the feature sets of the slab and the one-to-one coupled detector.
When looking at the progression of the SHAP values SV (∆t meas ) in dependence on the number of detected optical photons for the SiPM providing the first timestamp (#OP s/o 0 ), one observes different developments.Figure 12a) shows a clear separation between different SHAP values for a given feature value of ∆t meas depending on the number of detected optical photons.This is not observed for the slab detector since, in Fig. 12b), the strong separation regarding the number of detected optical photons is not given.

VI. DISCUSSION
All models have been trained successfully.The predictions follow a Gaussian function for a large area of the trained data range, as it can be seen in Fig. 6, Fig. 7, as well as in Table I.When moving to the borders of the presented data, the models' outputs deviate from the expected shape, and the prediction quality decreases.This effect is known for many machine learning algorithms and can be reasoned by the inability to extrapolate to values outside the training range.In future studies, we want to address this issue with different strategies, with using a higher sampling rate at the edges being one of them.
Within the central region, where the models show stable behavior, the means of the prediction distributions follow the expected linear relation of Eq. ( 13) (see Fig. 8).No systematic deviation from the linear relation between offset position z s and predicted mean time difference µ s could be observed, indicating that the trained models are capable of learning the given physical problem.The averaged ε-values are slightly bigger than the expected value of ε theo = 1 (max({ε i − ε theo }) ≤ 3.9 × 10 −3 ), which consequently enlarges time differences, and therefore produces an overestimation of the determined CTR values, such that a re-scaled resolution might be even better than the here reported one.To compensate for this effect, one could introduce a scaling function s(µ s ) which would correct the slope to the desired value of ε = 1 for a given mean time difference.Since the observed effect is insignificant and the estimated slope factors ε agree for all models with the theoretical value within a 3σ-interval, this procedure is unnecessary for the GBDT models used in this work.All trained models can improve the achievable CTR values, such that sub-200 ps resolution could be reached for an energy window from 300 keV to 700 keV (see Table II).Minding the shape of the emerging distribution, especially coincidences in the tails of the distribution have been recovered to more minor time differences, indicating that the model can learn physical effects and correct those.This observation underlies the capability of this new approach and shows that the timing resolution can be improved beyond the usage of purely analytical calibrations.We used explainable AI (XAI) techniques to understand on which quantities the models are relying on.The analysis of the SHAP values of the model (18, 0.1) reveals that the reported timestamp difference ∆t meas mainly, but also timing and energy information is of great importance.This observation agrees with human intuition, since ∆t meas would represent a human's first estimator if one tried to solve the task given to the model.Furthermore, the results clearly indicate that the model is learning timewalk effects for the one-to-one coupled detector (see Fig. 12a), since for a given feature value ∆t meas , the SHAP value is increased or decreased depending on whether a high or a low number of optical photons has been detected.If a timestamp is affected by timewalk, the exact moment of timestamping is delayed due to low deposited energy.In conclusion, the importance of this timestamp has to be decreased since it would enlarge the reported time difference and worsen the CTR.This observation does not occur in the same clearness for the slab detector (see Fig. 12b).However, for the one-to-one coupled detector, the vast majority of information is contained in one channel, whereas, for the semi-monolithic case, the information is spread across multiple channels, making it hard to display the effect in the chosen visualization.There is still an indication that also for the slab detector, timewalk effects are caught by the model since the feature set using energy-related quantities shows a high absolute SHAP value (see Fig. 11) and that both tails of the time difference distribution are reduced.

VII. CONCLUSION & OUTLOOK
In this work, we demonstrated a new approach based on the combination of residual physics and machine learning to address real-world physics-based problems.We applied the concept to detector calibration.We hope the work highlights the potential for applications of learning systems along all computing steps of complex acquisition and processing systems and, thus, may inspire future research.Since the formalism settles on previously linear corrected timestamps [33], it can be seen as a first approach towards residual physics in timing calibration.All models could be trained successfully and are in a 3σ-agreement with the underlying physical relation.The first results indicate that this new calibration strategy has provoked a strong improvement in the achievable CTR reaching from 238 ps down to 198 ps for an energy window of 300 keV to 700 keV, and from 235 ps even down to 185 ps for a smaller energy window of 450 keV to 550 keV.The SHAP analysis offers a strong indication, that the proposed technique has the capability to build physics-informed models.All results are based on experimentally acquired data from two clinically relevant detector arrays.This work and the corresponding promising first results represent a proof-ofconcept for future time skew calibration techniques relying on AI.Nevertheless, several studies have to be performed before an application to a complete PET system is possible.The presented technique is currently implemented for a pair of detectors utilizing digital SiPMs.Research towards systems of multiple detectors will be addressed in future works.Besides this, we want to explore the performance of the concept in different environmental settings (e.g., higher measurement temperatures, different readout), potentially enlarging the learning system's importance.Furthermore, the reduction and study of the influence of the needed data acquisition time and the bias effects towards the edges of the training data is mandatory for a possible usage in a clinical scanner.A possible method to address this point would be an artificial enlargement of the available training data found on only a few measured data points.However, we expect the measurement time to increase weaker than linearly with the number of detectors since one source position can be used for many detectors.

Figure 2 .
Figure 2. Used scintillator topologies and photosensors.The incoming γ-photon, as well as a part of the optical photons are illustrated as red lines.The sensor tile consists of 4 × 4 digital SiPMs (DPC3200-22, Philips Digital Photon Counting), each one holding four pixels and a twin timeto-digital converter.A triggered SiPM reports four pixel count values and a timestamp.

Figure 3 .
Figure 3. Used coincidence setup for the acquisition of labeled data.The source mounting is connected to the translation stage system allowing motion along all three axes (indicated as red arrows).

Figure 4 .
Figure 4. Scheme of the labeling process to acquire data that can be used for supervised learning.The radiation source (red cube) is shifted to different positions zs along the centered coordinate system z.Varying source positions lead to different travel times t 1 and t 2 of the γ-photons.The expected time difference E[ {t 1 −t 2 }] is used as label for the learning process.

Figure
Figure5.Overview of the used features F to train the GBDT models.There are three different feature sets, given by purely slab detector-related features F s , purely one-to-one detector-related features F o , and features associated with both detector concepts F so .F so consists only of the difference ∆tmeas between the first timestamps from slab and one-to-one coupled detector, respectively.The sets F s and F o are symmetrical in their content and can again be grouped into the subsets timestamp information F s/o T , energy information F s/o E and position information F s/o Pos .Information about the processed timestamps (denoted as TS), the SiPM IDs of those timestamps, the timestamp spread (difference between first and last timestamp of a cluster), and the number of generated timestamps is given.The latter equals also the number of hits within the cluster.Besides this, the photon counts of the corresponding SiPMs, the calibrated energy value and the spatial interaction position are used.

Figure 6 .
Figure 6.Progression of the MAE for each source position zs contained in the test dataset.No energy filter or restrictions on the measured light distribution were applied.Models utilizing a small learning rate show the best performances.Predictions located in the grayish areas (z / ∈ [−75, 45]mm) are excluded from the linearity analysis since the MAE progression indicates the starting of the transition into the artifactdominated region for these points.

Figure 7 .
Figure 7. Depiction of the distribution of predictions of the model using the hyperparameters (d = 18, lr = 0.1) using all 47 source positions along the z-axis.The different source positions zs are encoded in color.The upper plot shows the different histograms and Gauss fits.The lower plot shows the goodness-of-fit per number of degrees of freedom (χ 2 /ndf ) value for each position.For a large central region, the predictions are in very good agreement with a Gaussian function.When moving toward the edges of the data, the distribution becomes skewed and deviates from the Gaussian shape.No energy filter or restrictions on the measured light distribution were applied.

Figure 8 .Figure 9 .
Figure 8. Linear regression and residual plot of the linearity analysis of model (18, 0.1) for (xs, ys) = (12, 0)mm.The Gaussian fitting procedure gives the uncertainty on µs, while the uncertainty on the zs position is assumed to be 0.1 mm.

Figure 10 .
Figure 10.Time difference distributions before and after using the proposed machine learning time skew calibration.The model (18, 0.1) was used.No energy windows or restrictions regarding the light distribution are applied.

Figure 11 .
Figure 11.The mean absolute SHAP values mean(|SV (F )|) estimated from a subset of the performance dataset for the different feature sets F explained in Fig. 5.The strongest contribution comes from the shared feature set F so , which consists of the difference between the first timestamps ∆tmeas.Furthermore, detector-specific information about the timestamps (F s/o T ) and energy information (F s/o E ) are of great importance.

Figure 12 .
Figure 12.Progression of the SHAP values SV (∆tmeas) in dependence of the feature value ∆tmeas itself.The number of optical photons detected on the SiPM providing the first timestamp (#OP s/o 0 ) is encoded in the color a) for the one-to-one coupled detector and b) for the slab detector.

ACKNOWLEDGEMENT
The work was funded by the German Federal Ministry of Education and Research under contract number 13GW0621B within the funding program 'Recognizing and Treating Psychological and Neurological Illnesses -Potentials of Medical Technology for a Higher Quality of Life'('Psychische und neurologische Erkrankungen erkennen und behandeln -Potenziale der Medizintechnik für eine höhere Lebensqualität nutzen').

Table II CTR
PERFORMANCE OF THE TRAINED MODELS BASED ON THE PERFORMANCE DATASET.THE RESULTS OF APPLYING ONLY THE ANALYTICAL TIMING CALIBRATION IS DENOTED AS 'BEFORE ML'.