An Investigation of Stochastic Variance Reduction Algorithms for Relative Difference Penalized 3D PET Image Reconstruction

Penalised PET image reconstruction algorithms are often accelerated during early iterations with the use of subsets. However, these methods may exhibit limit cycle behaviour at later iterations due to variations between subsets. Desirable converged images can be achieved for a subclass of these algorithms via the implementation of a relaxed step size sequence, but the heuristic selection of parameters will impact the quality of the image sequence and algorithm convergence rates. In this work, we demonstrate the adaption and application of a class of stochastic variance reduction gradient algorithms for PET image reconstruction using the relative difference penalty and numerically compare convergence performance to BSREM. The two investigated algorithms are: SAGA and SVRG. These algorithms require the retention in memory of recently computed subset gradients, which are utilised in subsequent updates. We present several numerical studies based on Monte Carlo simulated data and a patient data set for fully 3D PET acquisitions. The impact of the number of subsets, different preconditioners and step size methods on the convergence of regions of interest values within the reconstructed images is explored. We observe that when using constant preconditioning, SAGA and SVRG demonstrate reduced variations in voxel values between subsequent updates and are less reliant on step size hyper-parameter selection than BSREM reconstructions. Furthermore, SAGA and SVRG can converge significantly faster to the penalised maximum likelihood solution than BSREM, particularly in low count data.

subsequent updates. We present several numerical studies based on Monte Carlo simulated data and a patient data set for fully 3D PET acquisitions . The impact of the number of subsets, different preconditioners and step size methods on the convergence of regions of interest values within the reconstructed images is explored. We observe that when using constant preconditioning, SAGA and SVRG demonstrate reduced variations in voxel values between subsequent updates and are less reliant on step size hyper-parameter selection than BSREM reconstructions. Furthermore, SAGA and SVRG can converge significantly faster to the penalised maximum likelihood solution than BSREM, particularly in low count data.
Index Terms-Positron emission tomography, image reconstruction, stochastic optimisation, variance reduction algorithms, relative difference penalty.

I. INTRODUCTION
P ENALISED maximum likelihood image reconstruction algorithms for Positron Emission Tomography (PET) have demonstrated improved desired image properties due to the inclusion of accurate statistical models and a priori information for promoting sharp edges and preserving smoothness in uniform regions [1], [2]. To find the (unique) Penalised Maximum Likelihood (PML) solution for PET reconstruction, there are various classes of optimisation algorithms, e.g., preconditioned gradient ascent [3], primal-dual algorithms [4] and surrogate methods [5]. However, these algorithms remain computationally intensive for clinical data as a consequence of the repeated projection operations between the image and measured data spaces. Hence, there has been interest in accelerating these algorithms.
Iterative optimisation algorithms may be accelerated with the use of subsets, also known as blocks or batches, e.g., Ordered Subset Expectation Maximisation (OSEM) [6], Block Sequential Regularized Expectation Maximisation (BSREM) [7] and Ordered Subset Separable Paraboloidal Surrogates (OSSPS) [5] for PET reconstruction. There are several subset methodologies, e.g., list-mode events subsets [8], [9] and projection bin based subsets [6], [10]. In this study, we consider the latter methodology. In these algorithms at each image update, the projection operations are performed only over a fraction ("subset") of the data set, which greatly reduces the computational cost per update. In current practice, This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the aforementioned PET algorithms typically cycle through the subsets in a deterministic Ordered Subset (OS) sequence [6], [10]. Initial acceleration for these methods is almost linear with respect to the number of subsets employed, if subsets are reasonably "balanced" [6]. Nevertheless, the variance existing between subset data leads to image estimate voxel value variations between successive updates, which is commonly referred to as limit cycle behaviour [11].
This variance is especially pronounced when large numbers of subsets are utilised. With a constant step size, after many updates, the iterates may only oscillate in a neighbourhood of the PML solution, without convergence. The size of this neighbourhood depends on the step size [6]. While OSEM is generally stopped before a limit cycle is observed, this behaviour is undesirable for PML, and may incur practical consequences in quantitative imaging [12]. Subset algorithms can converge to the PML solution with the use of a relaxed step size sequence [3], [12]. However, in practice the choice of relaxation parameters may greatly impact convergence behaviour and convergence rate may be impeded if poorly selected [1].
We assess the application of two gradient-based Stochastic Variance Reduction (SVR) algorithms to PET reconstruction to reduce variations in successive image updates and to allow the image to quickly converge to the PML solution under suitable criteria [13], [14]. These algorithms reduce the variance of the gradient estimate by storing and utilising previously computed subset gradients in the determination of the update direction. A number of SVR algorithms have been proposed in recent years, including SAGA [14], Stochastic Variance Reduced Gradient (SVRG) [13], Stochastic Average Gradient (SAG) [15] and StochAstic Recursive grAdient algoritHm (SARAH) [16]; see a recent review for an overview of SVR techniques [17]. We elected to use the Relative Difference Penalty (RDP) in this work due to its smoothness and edge preservation properties coupled with its extensive use in clinical evaluation [12], [18], [19]. However, due to the lack of a convex conjugate or a surrogate function, we limit ourselves to gradient-based reconstruction algorithms and forgo comparison with other non-gradient-based algorithms, e.g., stochastic Primal-Dual Hybrid Gradient (sPDHG) [20] or Stochastic Variance Reduced Expectation Maximisation (SVREM) [21].
We aim to demonstrate that SVR optimisation algorithms are at least as fast and stable as the BSREM algorithm when applied to a 3D non-TOF PET objective function regularised using the RDP [2]. We investigate different algorithm implementations using various step sizes, number of subsets and preconditioner configurations. Aside from our own preliminary work [22], [23], to the best of our knowledge, this is the first study of this kind. In Section II, the problem of iterative PET reconstruction is formulated and properties of gradientbased SVR algorithms are explained. Section III describes the experimental setup used to evaluate the reconstruction algorithms. Section IV contains the results of these experiments and Section V provides a discussion of the key findings. The primary conclusions of this research are presented in Section VI.

A. Penalised Maximum Likelihood PET Reconstruction
PML PET reconstruction is founded on an affine transform that maps the unknown tracer distribution , where N v and N b are the number of image voxels and detector bins, respectively [1]. This is achieved via a system matrix operator , where the entry a j i models the probability of an emission from the i th voxel being detected by the j th detector element. Models of detector normalisation and attenuation may be included in A. The model is complicated by the presence of random and scattered events, which may be accounted for using an additive expected backgroundb = {b j } N b j =1 . Hence, the acquisition model, which is also known as the forward model, is given bȳ j =1 is the expected data for a tracer distribution given by x. In PET, a good approximation for the measured data y is that y j are statistically independent and follow a Poisson distribution with meanȳ j (x).
PML methods optimise an objective function: where L(x) measures the goodness of fit with the measured data y, R(x) imposes a penalty on the image and β ≥ 0 is a global scale factor between the two functions. Given the Poisson assumption, it is natural to use the Poisson log-likelihood function L(x), given by as the goodness of fit. The function is concave with respect tō y j (x)and x. PET PML methods typically optimise the objective function (x) with the tracer distribution x constrained to be a non-negative vector. For most common penalties R(x), the objective function (x) := L(x) − β R(x), β ≥ 0, has a unique maximiser, i.e., the PML solutionx, under mild conditions [11]. Gradient based algorithms aim to converge to the PML solutionx by iteratively adding a weighted objective function gradient ∇(x k ) = ∇L(x k ) − β∇ R(x k ) to the current estimate x k , and often take the form where k is the update number index, P x≥0 [·] is a voxelwise non-negativity projection operation, α k > 0 is a scalar step size and D(x k ) ∈ R N v ×N v is a strictly positive definite preconditioner. One example algorithm of this form is Maximum Likelihood (ML) Expectation Maximisation (EM) (MLEM) [24] written in the additive form [25], when α k = 1, where the operator diag[·] constructs a diagonal matrix from a vector and the vector division is element-wise [26].

B. Subset Methods
The computation of the log-likelihood gradient ∇ L(x) generally dominates the computational effort for iterative methods, even when utilising an efficient vectorised implementation [1]. This is due to the application of the forward and backward projection operators, A and A respectively. Image reconstruction may be accelerated by evaluating only sub-likelihood gradients at each update. This is achieved by projecting into and from a subset of the data. The measured projection data y may be divided into a set of M subsets {S m } M m=1 that are pairwise disjoint, that is, S m ∩ S n = ∅, ∀m = n, and complete, that is, The objective function (x) may be written as a sum of sub-objective functions m (x) as where L m is the log-likelihood of the m th subset. The gradients are given by The additive preconditioned subset update may be recast as where D m (x k ) is a preconditioner of the m th subset. Many PET reconstruction algorithms utilise this format, e.g., (additive) OSEM [6], BSREM [7], [11], OSSPS [5] and Row-Action ML Algorithm (RAMLA) [3], each of which differ by the choice of D m (x) and α k . These algorithms generally use a deterministic cyclical sequence to select the subset index m at each image update [6], [10].
Computing only a subset gradient at each update leads to almost linear acceleration during early iterations if the subsets are balanced [6], [11]. The subset balancing amounts to The diagonal preconditioner may be used to further balance the weighted subset gradients, e.g., OSEM's subset dependant EM preconditioners. If the subsets and preconditioners are chosen properly, the subset balancing condition holds when x k is far from the solutionx. However, the assumption of subset balance does not generally hold due to various physical factors, e.g., projection angle geometry, attenuation and data noise. Discrepancies between subset data will lead to variations between subset functions and gradients. Thus, while a subset reconstruction algorithm efficiently optimises (x) when the iterate x k is far from the solutionx, after a number of passes through the data (epochs), the performance generally suffers when approachingx. This is exacerbated when large numbers of subsets are utilised.
As an alternative to deterministic cyclic subset selection, stochastic algorithms select m randomly at each update, usually with uniform probability. Stochastic Gradient Ascent (SGA) is a popular stochastic algorithm used to optimise a finite sum of separable functions, in the form of (4), using an update equation comparable to (6) with D(x) = I ∈ R N v ×N v , the identity matrix [27]. This results in an unbiased estimate of (x) and ∇(x), i.e., E[M m (x)] = (x) and E[M∇ m (x)] = ∇(x), where the expectation is taken with respect to the random subset index m [27]. Note the inclusion of M in these expectations and (6). These are included to allow for comparable gradient magnitudes between algorithms implemented with varying numbers of subsets.

C. Step Size Relaxation
Subset methods, in the form of (6), may converge to the PML solutionx if a relaxed step size sequence {α k } ∞ k=0 is applied [3], [7], [11]. A common relaxation method is to diminish step sizes over the iterations k so that lim k→∞ α k = 0. The sufficient conditions on relaxation sequences for global convergence are given by [3], [7], [28] Deterministic algorithms, e.g., BSREM [3] and RAMLA [7], utilise this methodology. However, the selection of hyperparameters for the relaxation sequence can greatly impact reconstruction performance.

D. Stochastic Variance Reduction Algorithms
Variance between subset gradients may impede reconstruction performance as x k approachesx. Recently, a novel class of stochastic variance reduction techniques have been proposed to combat this. In this work, we investigated the application of SAGA (Algorithm 1) [14] and SVRG (Algorithm 2) [13] to PET image reconstruction. These algorithms approximate the full gradient ∇(x k ) with a low computational cost gradient estimate∇ k,m (x k ), but with a lower variance than the (randomly selected) subset gradient ∇ m (x k ) [17]. This is achieved by computing ∇ m (x k ) for a randomly selected subset index m at each update and utilising M previously computed subset gradients g m stored in memory. Under certain convexity assumptions, SVR methods converge linearly in expectation where as traditional subset algorithms, e.g., (6), are sub-linear [17].
Generally, SVR algorithms formulate the update directioñ where ξ = M for SAGA and SVRG [29], which results in an unbiased estimator of the gradient, i.e., ∇( [17]. Different choices for ξ result in a biased estimator, e.g., ξ = 1 for SAG [15]. Compared with SAGA, SVRG does not store the subset gradient g m k ← ∇ m k (x k ) at each iteration and instead periodically updates each of the g m terms from an anchor imagex. This periodic computation occurs every γ epochs, i.e., γ passes over the entire data set. For convex optimisation problems, γ = 2 is heuristically suggested [13]. This periodic computation of M g m terms is equivalent to the computation of the full gradient (i.e., an epoch) with a single image update applied. This allows for the evaluation of the objective function without additional projection operations and, to the best of our knowledge, this is a unique feature for a subset algorithm.
Note, the M factor in (9) is not present in the works of [13] and [14] because the objective functions in the said works are formulated as (x) = 1 M M m=1 m (x), whereas this work uses (4).

E. Adaptation to PET
The PET problem tends to be ill-conditioned. Thus, direct application of the aforementioned standard SAGA or SVRG algorithms may result in slow reconstructions. We replace the update step of Algorithms 1 and 2 with a preconditioned update, given by where the positive-definite diagonal preconditioner D k ∈ R N v ×N v can depend on k. Furthermore, this modification also includes the projection operation onto non-negative vectors and thus the optimisation problem is constrained [14]. The resulting update closely resembles (6). We will denote these preconditioned algorithms with their original abbreviations (SAGA and SVRG) in the remainder of this paper.
There are several possible choices for the preconditioner D k and step size α k , which can lead to different convergence behaviour. The simplest choice is to set the preconditioner D k to be constant, diagonal and positive definite, and the step size constant as well. Then SAGA and SVRGthe SVRG and SAGA will converge almost surely to the PML solutionx provided that the step size α is upper bounded by a critical value α L D , which depends on the global Lipschitz constant L D > 0 (that is assumed to exist) of the preconditioned system [21,Theorem 4.1]. 1 The convergence result remains valid if the preconditioner and step size are fixed after a finite number of epochs.
Due to the challenge of determining the optimal step size a priori, we also investigated a diminishing step size sequence of the form (8) for these stochastic algorithms. Building upon the convergence proof of [11, III.B], for a diagonally scaled incremental gradient method of the type (6), we expect that the aforementioned almost sure convergence remains valid for relaxed step size sequences satisfying (8) and a constant preconditioner, although a complete proof is still unavailable. Nonetheless, the experiments below confirm the convergence.

A. Synthetic Data Generation and System Modelling
To assess the applicability of SVR algorithms for PET reconstruction, a numerical OpenGATE (GATE) simulation [30] of the GE Discovery 690 scanner [31], with uniform crystal efficiencies, was performed via the STIR-GATE-Connection [32]. A photon emission simulation was performed using back-to-back 511 keV photon emissions from a voxelised XCAT torso phantom [33] with activity concentrations representative of an 18 F-FDG study (see emission and attenuation distributions in Fig. 1a and Fig. 1b, respectively). A 1 cm diameter, 1 cm long, cylindrical hot lesion, with 2.5:1 lesion to lung contrast, was inserted into the lung of the XCAT emission. Cardiac and respiratory motion, along with radioactive decay, were not modelled. The resulting list mode data were re-binned into sinograms with 288 projection angles.
Normalisation factors are required for an accurate system model. A cylindrical phantom, the size of the scanner's Field of View (FOV), was forward projected in both GATE and Software for Tomographic Image Reconstruction (STIR) [34] without attenuation using ray-tracing with 10 rays per bin. The normalisation was computed, as per component-based normalisation factors with crystal efficiencies and "geometric" effects, using maximum likelihood computation with STIR-GATE-Connection tools [32], [35]. However, as the GATE simulation did not contain any variability between blocks, the crystal efficiency aspect of the normalisation may be ignored.
An estimation of random coincidence events was also made with STIR-GATE-Connection tools. Singles were estimated 1 A modified likelihood function is required ifb j = 0 for any j [21], but for any realistic PET reconstruction (like those included in this work)b j > 0 ∀ j. The proof also requires ∇ m to be bounded upon and m (x) to be concave over this bounded region. from the delayed coincidence events (recorded by GATE) using a maximum likelihood algorithm and then multiplied to find the randoms rates [36]. However, we observed a discrepancy between the total number of delayed coincidence events and the true number of random coincidence events. Therefore, the estimated randoms were globally scaled to match the true randoms count [32].
An estimation of the scatter contribution was similarly computed using STIR's iterative single scatter simulation and estimation utilities [34], [37]. Image reconstructions were performed using custom reconstruction software, based on STIR.
To assess reconstruction algorithm performance, the list mode file was sampled to acquire sinogram data with 50, 250, and 1200 million events with a true-to-background ratio of 0.93:1, where background is random plus scattered events. This resulted in data sets with low to high Signal-to-Noise Ratio (SNR), respectively. The aforementioned data corrections were computed individually for these data sets. Attenuation factors and normalisation factors were incorporated within the system matrix A.

B. Patient Data
A 150 second 18 F-FDG static patient scan was acquired on a 5 ring GE Discovery MI (272 projection angles) [38] and 143 million events were recorded. Non Time Of Flight (TOF) quantitative data corrections were computed using vendorprovided procedures but image reconstructions were performed similarly to the aforementioned simulated data. The patient was part of a prospective study approved by the ethical committee (BASEC-Nr 2018-01012) investigating whole body dynamic PET. The patient gave informed consent for further use of their data. The patient was diagnosed with a bronchial carcinoma in the left upper lobe (Fig. 2).

C. Subset Methodology
Equally spaced rows of a sinogram were binned into a subset. Subsequent subsets were constructed similarly but were offset from one another by their respective subset index m. BSREM and OSEM utilise the cyclical and deterministic subset sampling methodology, as discussed in Section II-B. The stochastic algorithms select m uniformly at random at each update.

D. Algorithm Warm Starting
The proposed SVR algorithms have been observed to be sensitive to initial conditions [39]. This is because the initially stored subset gradients, which are computed when x k is far from the PML solutionx, are not representative of the subset gradients when x k approachesx. These gradients are retained in memory and can lead to undesirable update directions once the image has begun to resolve. Therefore, we initialised the investigated algorithms from an image computed by one epoch of OSEM (x OSEM ). A 24 subset OSEM was performed for the simulated data (Fig. 1c) while 17 subsets were used for the patient data (Fig. 1a).

E. Penalty
Ideally, the penalty function R(x) in (2) reduces the impact of noise in the data while promoting structural edges [1]. Due to its use in modern clinical PET scanners, we implemented a spatially-variable [40] RDP [2] to achieve this, given by where N i is a 3 × 3 × 3 neighbourhood about the i th image voxel, and ω = 2 is a parameter used to control edge preservation. δ i and δ l are spatially variable penalty strengths with the image κ = {δ i } N v i computed by The computation of κ requires two forward projections and one backward projection but this is ignored in our analysis. The RDP encourages smooth cold background (low activity) regions and can improve lesion detectability when compared with regularisers that are less edge-preserving, such as the quadratic penalty [12], [18].

F. Diagonal Preconditioner
The EM preconditioner D( is used in the MLEM and BSREM algorithms. However, this preconditioner is not positive definite and prevents a voxel with zero value (x i = 0) from updating, regardless of the gradient's value. The authors of [11] proposed a modification to the update equation to prevent this occurrence by projecting x i ≤ 0 to a small positive value. In this work, we implemented a different modification of the EM preconditioner, given by where δ is a uniform image of heuristic pixel value 10 −6 , which is included to allow voxels with value 0 to be updated.
The δ is small enough that it does not significantly impact reconstruction performance in this work. We investigated the impact of different preconditioner inputs on the convergence of SAGA and SVRG. These included D(x k ), freezing the preconditioner at D(x OSEM ), and freezing the preconditioner after 5 and 10 epochs, D(x 5M ) and D(x 10M ) respectively.

G. Step Sizes
As per Section II-E, SAGA and SVRG are guaranteed to converge for constant step sizes (if small enough) but the sharp upper bound on the step size magnitude is difficult to compute. Regardless, we investigated the application of α k = 1 with preconditioner given by (13), a configuration that has been explored in previous studies [22], [23]. We additionally investigated the application of a relaxed step size sequence that satisfies the conditions stated in (8), given by where the relaxation factor η is scaled by the number of subsets M so that step sizes of reconstructions, using different numbers of subsets, are comparable at similar epochs.

H. Converged Comparison and Performance Metrics
In PML, it is assumed that the penalty term R(x) encourages local smoothness and edge preservation so the PML solution x is desirable [1]. Therefore, we assessed reconstruction performance by comparing each algorithm updates x k to the converged imagex, which was obtained by reconstructing the data using (10) with∇ SVRG k,m k and the modified-EM preconditioner (13). The reconstruction algorithm's step size α k was reduced if (x k ) ≤ (x (k−γ M) ) when measured periodically at each epoch.
The algorithm was terminated andx = x k when the estimate satisfied the Karush-Kuhn-Tucker conditions (within numerical precision) [41].
The following metrics were employed to assess the reconstruction performance. A Region Of Interest (ROI) was drawn exactly over the XCAT data's inserted lung lesion. Similarly, a (20 × 26 × 20) mm 3 ROI was drawn over the bronchial carcinoma and Standardized Uptake Value peak (SUVpeak) values [42] were computed. The ROI contains the entire carcinoma hot spot. Furthermore, a (50 × 60 × 60) mm 3 ellipsoidal ROI was drawn within the patient's liver with a minimum 10mm distance was maintained from the organ's boundary. ROI percentage error was computed for each update k, with respect tox, given by where θ k andθ are mean ROI voxel values for the k th image estimate and converged image, respectively. Furthermore, the objective value (x) was computed after every epoch of the reconstructions. Unlike deterministic algorithms, SVRG and SAGA generate additional noise in the reconstruction performance metrics due to stochastic subset sampling at each update. Thus, a set of 15 independent reconstructions S were run for each stochastic algorithm configuration. The mean and standard deviation values of each metric were computed over the stochastic realisations. Additionally, the general convergence performance of these algorithms was assessed using a percentage Normalised Root Mean Squared Error (NRMSE) metric at each update k, given by

IV. RESULTS
Sections IV-A to IV-C, we evaluate the impact of the number of subsets, various preconditioner configurations and various step size relaxation parameters with the SAGA and SVRG gradient estimators. Reconstructions in these sub-sections utilise the noisy XCAT data with 50 million prompt events. The impact of noise levels on SAGA's and SVRG's performance is investigated in Sections IV-D and IV-E demonstrates the algorithms' reconstruction performance when applied to a patient data.

A. Varying the Number of Subsets
Standard PET subset algorithms (e.g., BSREM) are linearly accelerated but, at later iterations, limit cycle behaviour may be observed in the reconstruction sequence. An example of these variations is exhibited by the BSREM profile in Fig. 3a, even with step size relaxation. SVRG reconstructions using various numbers of subsets, the D(x k ) preconditioner (13), and no step size relaxation are shown in Fig. 3c. Note, SVRG's full gradient re-computation parameter γ was fixed at γ = 2. The usage of fewer subsets resulted in slower convergence of the mean performance but the standard deviations were greatly reduced. The mean values, for each of the configurations, converged to 0% lung lesion error with the deviations reducing dramatically at later epochs. SAGA reconstructions exhibited similar behaviour to SVRG, albeit with larger deviations. Based upon these results, Figs. 4, 5, 6 and 7 plot the performance of SAGA and SVRG using 72 subsets, which demonstrated a reasonable trade-off between convergence rate and stochastic variation.

B. Constant Preconditioner and Step Size
In Section IV-A, the SVR algorithms were implemented with the diagonal preconditioner D(x k ), which was updated at each iteration k. In Fig. 4 we investigate the impact of anchoring input x for the preconditioner D(x) after various periods of computation by plotting the NRMSE%.
The SVR reconstruction NRMSE% performances are comparable for the D(x k ), D(x 5M ) and D(x 10M ) preconditioners. However, the D(x OSEM ) lesion values do not converge to the solutionx. SVRG appears to converge closer to the PML solutionx than SAGA for similar configurations.
In Fig. 5 we plot ROI values of a region outside of the torso. The voxels in this region do not converge to zero when the preconditioner D(x OSEM ) is implemented without step size relaxation.

C. Relaxed Step Size and Constant Preconditioner
The global convergence of the SVR algorithms is not guaranteed when using a constant step size that is too large.
Here we investigate the impact of using step size relaxation when a poor performing anchored preconditioner is utilised. Fig. 6 indicates that the lesion ROI value converges to the PML solutionx with relatively little relaxation, even when the preconditioner is fixed at an un-optimised one.

D. Data Noise
Impact of data noise on the performance of the investigated algorithms is shown in Fig. 7. The stochastic algorithms were implemented with a combination of previously observed advantageous configurations. The SVRG reconstructions demonstrate similar performance when optimising (x), regardless of step size relaxation, for all data sets. The SAGA reconstruction profiles with η = 1.5 result in performance comparable to that of SVRG but the algorithm is sensitive to the selected step size parameters.

E. Patient Data
Multiple reconstructions of the selected patient data set were performed with various step size relaxation parameters η. Algorithm configuration performance was evaluated in terms of the average MSE obtained over the first 10 epochs (10M), i.e., 1 M 10M k=0 (θ k −θ) 2 , where θ is the mean bronchial carcinoma ROI value for this data. This was done in an attempt to identify optimal η values with fast SUV convergence within 10 epochs. Using this grid search, the best performing BSREM (17 and 34 subsets) and SAGA and SVRG (68 subsets) reconstructions are plotted in Fig. 8. The mean bronchial carcinoma values (Fig. 8a) converged quickly to an SUV peak value of 6.138 for both SAGA and SVRG with minimal deviations after 3 epochs. The liver mean ROI value converged to 7.82 kBq/mL using SAGA and SVRG within a similar number of epochs (plots not shown). Additionally, the standard deviation of voxel values within the liver ROI converged to a value of 0.83 kBq/mL, see Fig. 8b.

A. Impact of Number of Subsets
As observed in Fig. 3, increasing the number of subsets of the SAGA and SVRG algorithms results in greater variations in metric performance between stochastic realisations during early epochs. However, the magnitude of these variations are still comparable to the BSREM reconstruction inter-update variations, which utilised a step size relaxation sequence, while SAGA and SVRG used a constant α k = 1. SVRG demonstrated reduced variations between stochastic realisations but does exhibit slower convergence. This is likely due to periodically recomputing the g terms fromx and keeping much of the∇ k,m (x k ) computation constant, improving stability, but incurring the periodic additional computational cost.
The stochastic variance observed during early epochs of Fig. 3 and Fig. 4 appears greater for SAGA than SVRG. For SAGA, although it is expected that a subset index m ∈ {1, 2, . . . , M} is selected once every epoch, the expected number of epochs required to select each m once grows as O(log(M)) [43,Sec. 4]. Hence, a number of SAGA's stored subset gradients may not have been computed from recent iterates. Thus, for some m, g m ≈ ∇ m (x k ). This will result in increased variance between stochastic realisations. This issue is exacerbated by SAGA's initialisation of g m = 0, ∀ m because the variance reduction properties of these algorithms are obtained when g m approaches ∇ m (x k ) as k grows [17]. In contrast, SVRG recomputes all g m terms periodically and therefore g m ≈ ∇ m (x k ).
The 72 subset SAGA and SVRG reconstructions demonstrated a good balance between faster initial performance than BSREM and reduced stochastic influence after about 5 epochs (Fig. 3).fast initial performance (in mean) and reduced stochastic influence after about 5 epochs. This evidence suggests that while a larger number of subsets can be used for with SAGA and SVRG than BSREM, too great a number may result in issues, particularly for SAGA.

B. Convergence
In order to ensure convergence to the PML solution, we applied a modified version of the EM preconditioner, frozen at different epochs and investigated several step size relaxation parameters η. We observed a systematic error in the later epochs of reconstructions utilising D(x OSEM ) without step size relaxation, both in the lung lesion (Figs. 4a, 6b and 6b) and the region surrounding the patient (Fig. 5). We hypothesise that is related to a step size that is too large coupled with the non-negativity constraint. A large step size induces oscillations between updates of the voxel values. However, in the background region this leads to a positive overall bias due to the truncation to non-negative values. This positive bias then leads to a (small) negative bias inside the torso. Our results indicate that this is resolved by using a diminishing step size sequence (e.g., (14)) as well as by deferring the anchoring of the preconditioner until later epochs (e.g., D(x 5e )). Note that the latter achieves a smaller effective step size in the background, reducing the need for relaxation. The implementation of a delayed anchoring of the preconditioner might therefore allow for significantly improved performance. Our experiments indicate that the implementation of a moderately relaxed step size sequence, along with an anchored preconditioner, allows for fast and (almost sure) convergence, but this remains to be theoretically confirmed. We note that these observations are specific to EM-like preconditioners. Performance of these algorithms with a preconditioner that does not iteratively reduce effective step size near zero values, e.g. the OS-SPS "precomputed denominator" [11] and related row-sum of the Hessian preconditioners [26], remains to be investigated but would likely suffer due to the non-negativity constraint.
C. Impact of Data Noise Fig. 7 indicates that SVR algorithms are particularly effective in reducing the impact of subset imbalance for noisier data. The SAGA and SVRG reconstructions generally achieve higher objective function values faster than BSREM with the 50 million event data set.SVRG and SAGA reconstructions resulting in generally higher objective function values much faster than BSREM for 50 million events. Yet, as the number of events increased, BSREM reconstructions became competitive in optimising the objective (x). Moreover, as the data set SNR increases, all reconstruction algorithms appear less sensitive to relaxation parameter η.

D. Patient Data
Commonly, noise and bias trade-offs are evaluated in PML reconstruction analysis [12], [44]. This evaluation is sensitive to the reconstruction algorithm, epoch number and objective/ penalty function. Due to the criteria used to optimise η in Section IV-E, the bronchial carcinoma converged quickly with BSREM, although some inter-update variations were present (Fig. 8a). However, the liver standard deviation did not converge within 10 epochs in Fig. 8b. Both of the BSREM reconstructions exhibited standard deviation values that were consistently greater than the converged value and significant inter-update variations. Therefore, usage of SAGA and SVRG effectively removes a source of potential error due to selection of the reconstruction algorithm, number of subsets or reconstruction length (after 4 epochs).
SAGA's performance appears more sensitive to step size selection than SVRG's. This is evidenced by the discrepancy between SAGA's and SVRG's optimal η values and SAGA's larger standard deviations in Fig. 6. A plausible explanation for this phenomenon is the initial high frequency fluctuations in the SAGA reconstructions, which are not present for SVRG, that are likely caused by the lack of pre-set g m terms. Yet, even with the larger η value, SAGA converges with similar speed to SVRG and this indicates that the SVR algorithms are less sensitive to step sizes than compared to BSREM.

E. Practical Implementation for PET
For this study, we found that the memory requirement for storing M subset gradients g m terms was insignificant on a modern computing system with each g m corresponding to approximately 20MB in memory. Furthermore, the SAGA and SVRG algorithms do incur some additional computational cost in the computation of∇ k,m (x k ), compared to BSREM due to the additional image manipulations associated with (9). However, in an efficient software implementation, the impact of this should be negligible with respect to the computation of ∇ m (x), which requires subset forward and backward projection operations. Therefore, an epoch of the SAGA or SVRG algorithms is expected to require slightly increased, but comparable, computational effort to gradient-based algorithms (e.g., BSREM) [21].
An alternative SVRG implementation that benefits from reduced memory requirements can also be employed. Only the full gradient G = 1 M M m=1 g m and anchored positioñ x are stored in memory and ∇ m (x k ) and ∇ m (x) are computed at each update. We note that this implementation should generally be unnecessary for PET reconstruction due to the relatively small memory requirements for PET distribution volumes and the additional, but significant, computational cost of evaluating ∇ m (x) at every update.

F. Comparisons to Other Work
While the studied SVR algorithms were designed to utilise stochastic subset selection, results from a preliminary study indicated that the application of an OS selection methodology resulted in superior performance during early iterations but the convergence rate was diminished after 10 or so epochs [23]. In these works, the SAG algorithm [15] was also investigated. SAG utilises a similar algorithm methodology as SAGA but, as aforementioned in Section II, sets ξ = 1 in (9). This makes∇ SAG k,m (x k ) a biased estimator of ∇(x k ). SAG was not evaluated in this work because it was observed to be less stable than SAGA under the same conditions [22], [23].
The SARAH algorithm [16] is closely related to the previously described alternative SVRG algorithm (Section V-E). SARAH also periodically recomputes all g m terms from a common iteratex but only storesx and G in memory. The algorithm evaluates two subset gradients at each update, ∇ m (x k ) and ∇ m (x k−1 ). This additional computation is the reason that SARAH is not included in this work.
Another SVR algorithm is SVREM, which was developed and applied to PET reconstruction in a preliminary study [21]. This algorithm is a member of the EM family involving a closed-form formula for iterate updates based on parabolic surrogates and features non-negativity preservation.
Furthermore, the sPDHG algorithm has been recently developed [20] and applied to PET image reconstruction [45]. This algorithm guarantees (almost sure) convergence to the PML solutionx by solving a saddle point problem and has been applied to PET image reconstruction [45]. When combined with preconditioning, sPDHG reconstructions can utilise a large M and have shown great promise for 3D PET reconstruction [46]. However, the selection of sPDHG step sizes is generally heuristic and may impact performance.
This study focused exclusively on optimisating (x) with the RDP, which enjoys differentiability but the associated proximal operator and surrogate function are currently unknown. Consequently, neither SVREM or sPDHG were compared to SAGA and SVRG in this work.. In the future, we aim to compare each of these recently proposed stochastic algorithms for clinical PET reconstruction, although an alternative regulariser may be required.

G. Future Work
Clinical reconstructions often only require a few epochs of computation before the algorithm is terminated. In this work, an epoch of the OSEM algorithm preceded all the PML reconstructions, which provided a warm start for SAGA and SVRG (see Section III-D). During this first epoch, when initialising from x 0 = 1, standard subset gradient algorithms (e.g., OSEM and BSREM) are highly effective and demonstrate almost linear acceleration [1]. However, in future work, it may be beneficial to investigate alternative strategies for warm starting, e.g., performing a fraction of an epoch of initial OSEM computation. Alternatively, one may consider pre-populating g m terms for the SAGA algorithm to improve its stability, akin to SVRG's initial computation. This has been discussed in [27].
Regarding alternative step size relaxation methodologies, an adaptive step size selection methodology based on the periodic evaluation of objective function values could be beneficial. Such a methodology would be particularly suitable for SVRG-like algorithms that can compute the objective function value, almost for free, every few epochs.
In this work, we observed that the performance of stochastic algorithms was impacted by the choice of step size and preconditioner. The optimal preconditioner is the inverse Hessian of the objective. However, computation of this is infeasible for the 3D PET problem [1]. Yet, a number of limited memory stochastic second order methods have been proposed in the literature that are combinations of the Limited-memory-Broyden-Fletcher-Goldfarb-Shanno (LBFGS) algorithm [47], [48] and SGA [49] or SVRG [50]. These algorithms approximate the inverse Hessian based on the gradient information in the last few epochs. Alternatively, one might consider the use of nonuniform step sizes [51], the Barzilai-Borwein method [52] or momentum methods [53]. Using these methods, one may be able to further improve the convergence rate.
While in this work we focused on an azimuthal projection angle subset methodology, other methods of dividing the data exist and could be considered for SVR algorithms. List mode event based sampling and subset construction may yield additional reconstruction benefits, e.g., improved subset balance with large numbers of subsets by constructing subsets that are not limited by the geometry of the scanner [8], [9] or as an alternative to sparse projection data [54]. Therefore, construction and usage of subsets over alternative sinogram data dimensions (e.g., TOF bins, oblique angles ("segments"), or axial positions) may be viable with the presented SVR algorithms. In addition, in the context of motion compensated reconstruction, motion gates can also be used as subsets [55].
An initial investigation using sPDHG demonstrated promising results for stochastic algorithms [56]. Moreover, the acceleration that these SVR algorithms realise and the reduced sensitivity to subset selection indicates the potential benefit of their application to other low event reconstructions, such as and motion compensated parametric reconstructions. However, the impact of sparsity or greater subset imbalance may lead to additional challenges.

VI. CONCLUSION
We have shown that SVR algorithms SAGA and SVRG can be successfully adapted (10) for 3D non-TOF PET reconstruction. The investigated lung lesion, inserted into the XCAT volume, and patient data bronchial carcinoma converge to the respective PML values within 20 epochs in the majority of tested configurations cases. We observed that increasing the number of subsets accelerates convergence. While this increases stochastic variability during early epochs, this variability reduces to virtually zero within 10 epochs in most cases.
Step size and preconditioner selection have dramatic impact algorithm performance, particularly in the region surrounding the emission object. Best results were obtained using moderate step size relaxation and allowing the image in the EM preconditioner to vary for a short period before anchoring.
In the tested configurations, SVRG generally optimises the objective function faster than SAGA and, when compared to the BSREM reconstructions, the investigated algorithms demonstrated superior convergence properties by every investigated metric.
Implementation of these algorithms in future studies, involving the variation of objective function parameters (e.g., priors and penalty strength), may realise improved quantification results, at lower computational cost, especially for low count data reconstructions.