A Joint Row and Column Action Method for Cone-Beam Computed Tomography

The inversion of linear systems is fundamental in computed tomography (CT) reconstruction. Computational challenges arise when trying to invert large linear systems, as limited computing resources mean that only a part of the system can be kept in computer memory at any one time. In linear tomographic inversion problems, such as X-ray tomography, even a standard scan can produce millions of individual measurements and the reconstruction of X-ray attenuation profiles typically requires the estimation of a million attenuation coefficients. To deal with the large data sets encountered in real applications and to efficiently utilize modern graphics processing unit based computing architectures, combinations of iterative reconstruction algorithms and parallel computing schemes are increasingly applied. Whilst different parallel methods have been proposed, individual computations currently need to access either the entire set of observations or estimated X-ray absorptions, which can be prohibitive in many realistic applications. We present a fully parallelizable CT image reconstruction algorithm where each computation node works on arbitrary partial subsets of the data and the reconstructed volume. We further develop a nonhomogeneously randomized selection criterion which guarantees that submatrices of the system matrix are selected more frequently if they are dense, thus maximizing information flow through the algorithm. We compare our algorithm with block alternating direction method of multipliers and show that our method is significantly faster for CT reconstruction.


A Joint Row and Column Action Method for Cone-Beam Computed Tomography Yushan Gao and Thomas Blumensath
Abstract-The inversion of linear systems is fundamental in computed tomography (CT) reconstruction.Computational challenges arise when trying to invert large linear systems, as limited computing resources mean that only a part of the system can be kept in computer memory at any one time.In linear tomographic inversion problems, such as X-ray tomography, even a standard scan can produce millions of individual measurements and the reconstruction of X-ray attenuation profiles typically requires the estimation of a million attenuation coefficients.To deal with the large data sets encountered in real applications and to efficiently utilize modern graphics processing unit based computing architectures, combinations of iterative reconstruction algorithms and parallel computing schemes are increasingly applied.Whilst different parallel methods have been proposed, individual computations currently need to access either the entire set of observations or estimated X-ray absorptions, which can be prohibitive in many realistic applications.We present a fully parallelizable CT image reconstruction algorithm where each computation node works on arbitrary partial subsets of the data and the reconstructed volume.We further develop a nonhomogeneously randomized selection criterion which guarantees that submatrices of the system matrix are selected more frequently if they are dense, thus maximizing information flow through the algorithm.We compare our algorithm with block alternating direction method of multipliers and show that our method is significantly faster for CT reconstruction.
Index Terms-CT image reconstruction, parallel computing, gradient descent, coordinate descent, linear inverse problems.

I. INTRODUCTION
I N TRANSMISSION computed tomography (CT), standard scan trajectories, such as rotation based or helical trajectories, allow the use of efficient analytical reconstruction techniques such as the filtered backprojection algorithm (FBP) [1], [2] and the Feldkamp Davis Kress (FDK) [3], [4] method.However, in low signal to noise settings, if scan angles are under-sampled or if nonstandard trajectories are used, then less efficient, algebraic reconstruction methods can provide significantly better reconstructions [5]- [8].These methods model the Manuscript received April 23, 2018; revised July 4, 2018; accepted July 12, 2018.Date of publication July 18, 2018; date of current version November 6, 2018.This work was supported in part by EPSRC under Grants EP/K029150/1 and EP/R002495/1, in part by the University of Southampton PGR scholarship, in part by the Faculty of Engineering and the Environment Lancaster Studentship, and in part by the China Scholarship Council.The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Ali Bilgin.(Corresponding author: Yushan Gao.) The authors are with the Faculty of Engineering and Environment, University of Southampton, Southampton, SO17 1BJ, U.K. (e-mail:, yg3n15@ soton.ac.uk;Thomas.Blumensath@soton.ac.uk).
Digital Object Identifier 10.1109/TCI.2018.2857446 x-ray system as a linear system of equations [9]- [11]: where y = [y 1 , . . ., y r ] T , x = [x 1 , . . ., x c ] T and e = [e 1 , . . ., e r ] T are projection data, reconstructed image vector and measurement noise respectively.The system matrix A ∈ R r ×c , with non-negative elements a cd (1 ), can be computed by Siddon's method [12].In reality, industrial CT is often used in cases where y and x can have millions of entries each [13] and where A, even though it is a sparse matrix, can have billions of entries.In these situations, solving Eq. 1 directly by calculating A −1 is infeasible.More feasible approaches use A and A T directly (instead of their inverse) to iteratively find an approximate solution of Eq. 1.
There are many mature and efficient algorithms to solve Eq. 1, including the conjugate gradient (CG) method and LSQR.These methods can be applied in small scale CT reconstructions when the system matrix A can be stored in computer memory [14], [15].When A is too large to be kept in memory, it is often more efficient to re-compute it on the fly during each iteration, which can be done relatively efficiently using modern graphical processor unit (GPU).However, GPUs have limited internal memory.A thus has to be broken into smaller subsets so that the GPU only operates on a subset of the data at a time.Whilst it is possible to sequentially process all data in this fashion, this requires constant data transfer to the GPUs internal memory.An alternative is the use of optimisation algorithms that only work on subsets of the matrix A at any one time.The CG and LSQR methods, whilst having their own block forms [16], [17], do not work with a single subset of A per iteration.As a result, other algorithms for CT reconstructions have been proposed.Currently, most of these methods can be divided into two categories: row action methods, which operate on subsets of the observations y at a time and column action methods, which operate on subsets of the voxels x at a time [18]- [22].
Row action methods divides the matrix A into several row blocks and the system equation thus becomes ⎡ ⎢ ⎣ y I 1 . . .
where A I i ∈ R m i ×c is the row block of system matrix A and number is M and M i=1 m i = r.The general iteration scheme in row action method can be summarised as where x k is the k th iteration result, R is a relaxation matrix to control the iteration step length and I ∈ {I i } M i=1 .In CT reconstruction, most mature methods are row action methods.These include the classical Kaczmarz family of algorithms (also known as the Algebraic Reconstruction Technique (ART)) [7], [23]- [25], the Simultaneous Algebraic Reconstruction Technique (SART) [19], [26], [27], the Simultaneous Iterative Reconstruction Technique (SIRT), [28]- [31] and component averaging (CAV) and its block form (BICAV) [32]- [34].
Column action methods, as a counterpart of row action method, divide the system matrix A into several column blocks.Eq. 1 thus is divided as where A J j ∈ R r ×n j is the column block of system matrix A and x J j ∈ R n j ×1 is the block of reconstructed vector x.The total block number is N and N j =1 n j = c.The general iteration scheme in column action methods is of the form where J ∈ {J j } N j =1 and the initial r = y.The application of column action methods in CT reconstruction can be traced back to 1990s [35]- [37].The preliminary column action method updates one voxel each time and it is easy to extend the preliminary method into group form [38]- [43].Similar to row action methods, the selection criteria on which column blocks to update can be randomised either uniformly [44], [45] or based on specific probability [46], [47].
Row and column action methods also allow for parallel computation.For parallel row action methods, each processor (or node) needs to store the whole reconstructed image vector x since each update is on the entire image, whilst for column action methods, each node needs to store all of y.As a result, the largest volume they can reconstruct is limited by the storage capacity of the computation nodes.An important exception to this is discussed in [48], where a row action method SIRT is discussed in which each node only requires parts of the reconstructed image vector x.However, this method only works for circular scan trajectories and its scalability is limited due to the requirement that overlap between projections of adjacent image blocks should be small.
General sub-block methods are proposed in machine learning (ML).There has been interest in the development of methods that use more general updates, updating subsets of x with only subsets of y at a time [49], [50].In this paper, these algorithms will be called "stochastic block coordinate descent" (SBCD).In particular, [51] and [52] independently proposed similar SBCD algorithms and both explored the application of variance reduction technique [53] to further accelerate the convergence rate.[54] proposed a semi-stochastic coordinate descent method to combine the stochastic gradient method and coordinate descent method to minimise a strongly convex problem.[55] mathematically proved the convergence rate of SBCD when the step length is decreasing and when the update strategy adopts a Gauss-Seidel type approach (updating the current column block depends on the previously updated column block), showing that the SBCD method has the same convergence rate as stochastic gradient methods when the objective function is convex.[50], [56] separately proposed the optimal sampling method in the SBCD method by randomly selecting column blocks and selecting row blocks based on a calculated probability.
The partition of A in SBCD is the same as our method.To define this, we partition A into M × N blocks.Let I i (1 ≤ i ≤ M ) be an index set that indexes m i ( M i=1 m i = r) rows in A and J j (1 ≤ j ≤ N ) be an index set that indexes n j ( N j =1 n j = c) columns in A .The matrix A J j I i thus is a sub-matrix of A with row indexes I i and column indexes J j .Thus we can divide the linear system into M × N blocks: Note that the index sets can be arbitrary partitions of the columns and rows and do not necessarily have to be consecutive.To facilitate the latter discussion, we also define residual r = y − Ax and let block residual r I be the subset of r defined as y I − A I x, where I ∈ {I i } M i=1 .With this notation, the general update scheme is where J ∈ {J j } N j =1 , μ is the step length and f I (x) = y I − A I x 2  2 , where the • is the l 2 norm of a vector.Thus the gradient ∇ J f I (x) = −2(A J I ) T (y I − A I x).There are two difficulties when applying SBCD in large scale CT reconstructions.The first one is that the step length μ is determined by calculating the Lipschitz constant of the gradient ∇ J f I (x), which is computationally challenging.The second issue is that the computation of the gradient in SBCD requires the calculation of the block residue r I = y I − A I x, which needs the whole of x.In other words, the system matrix A is not actually separable in column direction due to the need to calculate r I .This drawback makes the SBCD algorithm difficult to apply in a totally distributed network where each computation node only has partial access to both x and y.
Another method that can be used for linear systems is the multisplitting (MS) method [57].If MS method only partitions the A into column blocks [58], it always converges as long as A is of full column rank.However, If MS method partition the A into both row and column blocks.it becomes only applicable when A meets certain conditions.For example, when A is an "H-matrix" [59] or is a positive definite matrix [60].The system matrix A in CT reconstruction does not meet these requirements and MS dividing A in both dimensions did not work in initial experiments we conducted for CT reconstruction.
Our CT reconstruction setting leads to a convex optimization problem.The alternating direction method of multipliers (ADMM) is a popular tool in the parallelization of convex optimization probelms as it allows the decomposition into several smaller sub-problems [61].Several versions of ADMM were designed to work on small subsets of either x or y [62].In [63], a block ADMM method was proposed that works by breaking the problems into small subsets of x and y.This version of block ADMM introduces three sets of additional nuisance variables and each sub problem requires the solution to a smaller linear inverse problem, that can often be solved approximately using a small number of CG iterations.The new nuisance variables increase the overall storage requirements and the requirement for each node to solve a linear inverse problem in each iteration leads to relatively slow convergence and a high computation burden in large scale CT reconstruction, which will be illustrated in our paper.

II. INTRODUCTION OF CSGD
In this paper, we consider large scale CT reconstruction problems in a distributed networks, where each computation node only has limited access to both y and x.Inspired by SBCD and block ADMM, our goal here is to develop a parallelisable algorithm that works with generic x-ray tomographic scanning trajectories and that is faster and more memory efficient than block ADMM.Our novel algorithm, called coordinate-wise stochastic gradient descent (CSGD), also introduces nuisance variables, but we use fewer variables than block ADMM.Furthermore, CSGD simplifies the step length calculation as well as the residue update scheme of SBCD type algorithms and thus enables a full column decomposition for A. The algorithm is scalable so that it can be run on a range of computing platforms, including low memory GPU clusters and high performance CPU based clusters.

A. Derivation of the Algorithm
The proposed CSGD is similar to SBCD.The main goals of CSGD are to find a simpler strategy to compute the step length μ and to efficiently approximate the residue r I without having to compute the product A I x in each iteration.
Similar to SBCD, after selecting a row block I ∈ {I i } M i=1 , CSGD operates on the object function with gradient A coordinate descend update scheme is adopted and only those elements whose indices are in the set J ∈ {J j } N j =1 are updated, so that the descent operator is Along with this new modified direction, the update on voxels becomes where μ is the gradient step length, J is the complement to the set J and g J is a sub set of − 1 2 g.The steepest descent idea is to make the direction of ∇f I (x k ) perpendicular to the direction of g, i.e.
Use the fact that x k = x k −1 + μ g and A I g = A J I g J , the μ leading to the maximum descent is This calculation does not require the corresponding computation node to have access to the whole row or column block of matrix A and does not have to calculate the Lipshitz constant to determine the step length.When the matrix A cannot be stored and need to be generated on the fly, CSGD iteration only requires to use a sub matrix A J I and its transpose.Furthermore, GPUs can calculate the forward projection A J I g J and backward projection (A J I ) T r I very efficiently to achieve GPU-accelerated projection operations [64], [65].
The step size derived from Eq. 13 is chosen to reduce r I instead of r .We observed experimentally that our choice of μ can lead to instability in the algorithm.One method to avoid this is to introduce an additional relaxation parameter β < 1, which led CSGD to converge to a precision level similar to that of SIRT and CAV.The alternative of using a constant step length μ also led to convergence, but required a careful choice of step length (thus making parameter tuning difficult) and led to relatively slow convergence.r I plays an important role in CSGD since it determines the update direction of x.If we had access to all x j , then we could simply compute r I = y I − j x j .In our setting, the node only has access to one x j , thus the above computation is not possible.Instead, each node computes a quantity z j I = A J j I x J j .However, as discussed later, we might not compute all z j I in each iteration.In this case, we use an older versions of z j I to approximate r I = y I − N j =1 z j I .For smoothly varying cost functions, using old residual information when calculating gradients is similar to the use of old gradients in parallel methods that can often be shown to converge [66], [67].
The pseudo-code of the basic computation blocks is shown in the Fig. 1.
To parallelize the algorithm, we do not have to update all sub-blocks before updating r.We define percentages α and γ to specify the faction of row and column blocks to be updated.The algorithm is shown in Algo.2.In the following discussion, when operations from line 4 to line 16 are performed for one time, the algorithm is said to perform for "one epoch".Initialization: select system matrix's row index CSGD Algorithm which Parallelize Both Row and Column Blocks.1: Input: y, α, β, γ, {I i } M i=1 and {J j } N j =1 .2: Initialisation: x = 0, (i.e., {x J j } N j =1 = 0), {x i } M i=1 = x, {z j } N j =1 = 0 and r = y.3: while stopping criterion is not met do 4: for j = 1,2, . . ., γN in parallel (J loop) do 5: randomly draw J from {J j } N j =1 with replacement 6: xi (1 ≤ i ≤ M ) = 0 7: for ĩ = 1,2, . . ., αM in parallel (I loop) do 8: randomly draw I from {I i } M i=1 with replacement 9: g J = (A J I ) T r I 10: end for 14: end for 15: r = y − j z j 16: for all updated J, x J = i (x i J )/(αM ) 17: end while 18:

B. Comparison CSGD With Block ADMM
Block ADMM has the same parallel computing architecture as CSGD.To facilitate the comparison, we briefly present the basic operation of block ADMM algorithm [63], as shown in Algo.3.Here avg is the element-wise averaging operator.The Algorithm 3: Block ADMM Iteration.
for j = 1,2, . . ., N in parallel (J loop) do 7: randomly draw J from {J j } N j =1 with replacement 8: for ĩ = 1,2, . . ., M in parallel (I loop) do 9: randomly draw I from {I i } M i=1 with replacement 10: end for 12: end for 13: where I d is an identity matrix whose size is determined by the size of A J I .To solve this equation, the CG method can be used.Standard techniques to speed up block ADMM include the early termination of the CG iteration and a variable ρ-update scheme [61].The exchange operator exch(c, {c j } N j =1 ) is given by When applying block ADMM and CSGD in a distributed network, they share the same architecture.For M = N = 2, parallel block ADMM is shown diagrammatically in Fig. 1.Here, each computation node (ovals) stores one image block x J (J ∈ {J j } N j =1 ) and one data block y I (I ∈ {I i } M i=1 ).There are two types of master node (square boxes).One type performs the "avg" procedure for column blocks and the other type performs the "exch" procedure for row blocks.One set of master nodes J and xk J and the other set of master nodes stores y k + 1 2 I , y k I and ỹk I .The same distributed architecture of Fig. 1 is also suitable for CSGD with α = γ = 1.For each J, CSGD requires the exchange and summation of z j I (i.e. over columns in Fig. 1) and for each I, CSGD requires the exchange and summation of xi J (i.e. over rows in Fig. 1).These operations Fig. 3.A standard 2D scanning geometry with a Shepp-Logan phantom, where the P is the x-ray source, O is the centre of the object and the rotation centre.D is the centre of the detector.Source and detector rotate around the centre and take measurements at different angles.The linear detector is evenly divided into to sub-areas DE and DF , which will be used in importance sampling discussed later.
are similar to the "avg" and "exch" procedure of ADMM respectively.Note that these operations can be implemented efficiently using a message passing approach.The storage demand is different between block ADMM and CSGD.Overall, CSGD requires storage of vectors y ∈ R r ×1 , r ∈ R r ×1 and x ∈ R c×1 and of the set of N vectors {z j } N j =1 ∈ R r ×1 , so the total memory requirement for CSGD is (N + 2)r + c.Block ADMM requires more storage, as it requires storage of x, x, z, z, y and the M vectors {x i } M i=1 and {x i } M i=1 and the N vectors {z j } N j =1 , leading to a total storage demand of (N + 3)r + (2M + 2)c.

C. Partition Methods in CT Case
In our CT reconstruction problem, the partition of A along columns is equivalent to the partition of the image.Two examples of column partition are 1) to cut the image along one dimension and 2) to cut the image along all dimensions.Two 2D examples are shown in Fig. 2.
To understand how partitioning row blocks of A relates to the tomographic imaging setting, we take the 2D scanning model shown in Fig. 3 as an example.When detector and source are at a given location, we obtain projections along a range of paths from the source to the different detector elements.For one source/detector location, these measurements will be called one "projection".We could form the row blocks of A with random projections, but then keep them fixed during different epochs.A non-random version of this will be called "deterministic partitioning" in which one row block contains sequential projections from successive projection angles.We can further divide the detector into several sub-areas and treat projections from each sub-area as a "sub-projection".An illusion of this in 3D is shown in Fig. 4, where one projection is divided into 4 sub-projections.When forming the row blocks, different sub-projections from different projection angles can be grouped together.This will be seen to be advantageous in the importance sampling strategy discussed next.

D. Importance Sampling
When slicing the detector into several sub-areas, we can also form row blocks dynamically.It means that the sub-areas forming row blocks are changing for different epochs.Based on the proposed algorithm, it is straightforward to develop a random sampling strategy that goes through all projection views and all detector sub-areas arbitrarily.Looking at Algo.2, for one image block x J j (1 ≤ j ≤ N ), we could randomly select sub-areas to form one row block I i (1 ≤ i ≤ M ), with each sub-areas being chosen with equal probability as long as the sub-area receives x-rays passing through x J j .This sample method will be called "uniform sampling".Considering the sparsity of A, we further develop an importance sampling strategy.Fig. 4 illustrates that the projection of a sub-image block only intersects a small part of the detector.When sampling sub-projections for each sub-image block x J j (1 ≤ j ≤ N ), a vector P j representing the probability of choosing each detector sub-area is calculated.The calculation is based on projection areas of x J j on each sub-detector area at different projection angles.Then α * 100% of the subareas are sampled with probability P j and are grouped into αM row blocks.These row blocks are assigned to different nodes to perform line 9-12 in Algo.2.In this case, the detector sub-areas receiving more projections from the current sub-image block x J j are more likely to be chosen.

E. Computational Complexity
There are several important aspects when comparing computational efficiency of the methods.The methods are designed to allow parallel computation.We envisage this to be performed in a distributed network of computation nodes 1 .Computation nodes produce two outputs, as displayed in line 11 and line 12 in Algo.2.These are then either sent to larger, but slow storage or directly to other nodes, where they are eventually used to compute line 15 and line 16 in Algo.2, which can be performed efficiently using message passing interface reduction methods.
Three aspects affect performance: 1) Computational complexity in terms of multiply-adds.
2) Data transfer requirements between data storage and a processing unit as well as between different processing units.3) Data storage requirements, both in terms of fast access RAM and in slower access (e.g.disk based) data storage.Each of these costs are dominated by different properties: 1) Computational complexity is dominated by the computation of matrix vector products involving A J I and its transpose, especially as A is not generally stored but might have to be re-computed every time it is needed.The computational complexity is thus O(|I| * |J|), though computations performed on highly parallel architectures, such as modern GPUs, are able to perform millions of these computations in parallel.
2) Data transfer requirements are dominated by the need for each of the parallel computation nodes to receive r I and x J and transmit xi J and z j I .Note that the size of the required input and output vectors are the same, the data transfer requirement is thus O(|I| + |J|).
3) Central data storage requirements are dominated by the need to store the original data and the current estimate of x.We also need to compute and store averages over xi J and z j I .These computations can be performed efficiently using parallel data reduction techniques.Our approach would mean that each node would thus require O(|I| + |J|) local memory.

III. SIMULATIONS
The simulations are divided into two parts, the first part explores the performance of CSGD in CT reconstruction.The second part compares CSGD with block ADMM.

A. CSGD in CT Reconstruction
We used two criteria to evaluate the performance.1) signal to noise ratio (SNR): 20log 10 ( x tr ue / x tr ue − x est ), 2) relative distance (RD): 20 log 10 ( x lsq / x lsq − x est ) .The x tr ue , x est and x lsq are the true phantom image vector, the reconstructed image vector and the least square solution respectively.
In the following simulations, we used the Shepp-Logan phantom and, unless stated otherwise, K in Fig. 3 is 16 and the side length of each pixel is 1mm, the length of OP and OD are always 100 mm.The rotation interval for source and detector is 10 • .The detector contains 30 pixels of size 1mm.The e in Eq. 1 is Gaussian white noise with variance σ of 0.1 to the simulated observations.The SNR of projection data y (20log 10 ( y / e )) is 25.8 dB.The default partition method of projection data y uses the static type "deterministic partitioning".The default partition in the image domain uses the method shown in Fig. 2a.

B. Comparison to Other Methods
We start by comparing our method to a range of other algorithms popular in CT reconstruction.For CSGD we used M = 8, N = 4, β = 0.23 and α = γ = 1.We compared our method to CG, steepest gradient descent (GD), SIRT, ART and CAV.The results are shown in Fig. 5.It can be seen that for noisy observations, the CSGD and the other method (except for ART method) all obtains nearly the same SNR.However, whilst CG and steepest GD converge to the least square solution, CSGD, SIRT, ART and CAV do not.This indicates that similar to SIRT, ART and CAV, CSGD also iterates the x towards to a weighted least square solution rather than the least square solution itself (see our fixed point analysis in the Appendix).

C. Influence of Algorithm Parameters
Our algorithm has three parameters that have to be set, α, β and γ.Due to space constraints we only discuss α and β and set the γ ≡ 1.Smaller γ lead to similar results to those observed with smaller α.We start by an empirical evaluation of β.To determine the range of suitable parameters β, we explored the performance by varying M , N and β.To show the largest value of β that can be used for different values of N and M , we run the algorithm for 800 epochs for different values of β.The results are shown in Fig. 6 where we see that the largest β value leading to highly accurate solutions is approximately 1 N .Increasing N reduces the β range, and there is also a relationship on M which is less clear.We next turn to the influence of the α.Reducing α reduces the computation of CSGD within one epoch as only some of the updates are computed.We set M = 12 and N = 2 and generated the data as before.Simulation results show that reducing α increase the acceptable β range, as shown in Fig. 7.
Convergence speed is influenced by α, β, M and N .In general, convergence is slower for larger values of N and M and for smaller β and α.The comparisons of β and M are shown in Fig. 8.It can be seen that in general, increasing β was found to lead to faster convergence, up to a point where the algorithms   started to diverge.When the β is constant, increasing the M (or N , which is not provided here) slows down the convergence speed.We here also provide the steepest GD results as a reference to show that CSGD can obtain the same precision level as steepest GD does.Before discussing the influence on convergence speed caused by α, we introduce a new concept "effective epoch".The number of sub-matrices A J j I i (in Eq. 6) that CSGD used per epoch is proportional to the parameter α.As the computational speed is dominated by matrix vector products, when we compare the difference in convergence speed for methods using different αs (for example, α = 1 and α = 0.5), we took account of the reduction in computational effort when using smaller α.The epoch count is multiplied by α and is called as the effective epoch.As a result, for different α, the computation amount after one "effective epoch" is the same with each other.We need to point out that the "effective epoch" does not consider the data transfer influence.When α is small, one "effective epoch" includes more Fig.9. (a) Shows that when β is fixed, reducing the α to around 0.5 increases the convergence speed.However, when the α is reduced too small as 2  12 , the convergence speed is slowed down.(b) Shows that from "effective epoch" point of view, reducing the α is always helpful to increase the convergence speed and the increased convergence speed can be similar to the steepest GD when α ≤ 0.5.Fig. 10.Comparisons of different sample methods.The M = 24, N = 2, σ = 1."DP" is deterministic partitioning (defined in II-C), "US" is uniform sampling method (defined in II-D) and "IS" is importance sampling method (defined in II-D).The uniform sample or importance sampling both show faster convergence speed and wider acceptable β range than previous deterministic partitioning.Importance sampling methods outperform the others when α is epochs, which means more data transfer between nodes.Here we mainly focus on cases when the data transfer is much more efficient than calculating matrix-vector multiplications and thus is negligible.The comparison of convergence speed with different α is shown in Fig. 9. Interestingly, by looking at effective epochs, we see that CSGD can converge faster than gradient descent for smaller α.

D. Partitioning the Image and the Detector
In this section, larger simulations are conducted.The CT scanning geometry still uses the form shown in Fig. 3, but with K = 64.The length of OP and OD is set to 115 mm.The rotation interval for the point source and the detector is 3 • .The detector contains 130 pixels.The e in Eq. 1 is Gaussian white noise with variance σ of 1 to the simulated observations.The SNR of projection data y (20log 10 ( y / e )) is 33.8 dB.In  Under different α, the uniform sampling strategy always obtains higher SNR after the same epochs than deterministic partitioning.Besides, the uniform sampling method has an even wider β range than deterministic partitioning, which is easier for parameter tuning.The uniform sampling method can be further improved when the detector is sliced into two sub-areas.This method avoids choosing the sub-detector areas which do not receive any projections from the currently selected sub-image block.However, when α = 1, the two sub-detector situation does not increase the convergence rate.This is because that those sub-detectors which do not receive projections from the current sub-block still have to be selected since the α is too large.When the α decrease to a value (e.g. 6  24 ) that guarantees all selected sub-detectors are those who receive projections, slicing the detector area into two sub-areas obtains higher SNR than not slicing situation.Furthermore, when α is small, the importance sampling strategy can obtain higher SNR than uniform sampling method at the initial iterations.

E. Comparison of CSGD and Block ADMM
As discussed previously, block ADMM allows for the same partitioning of A. To compare CSGD and block ADMM, a random system matrix A 256×128 and a random vector x 128×1 are used to reduce the computation requirements in our simulations.The linear system is a noise free model here.
As each CSGD iteration and each block ADMM iteration require different amounts of computational effort, we here do not compare SNR after each epoch.Instead, we plot SNR against the number of times the algorithm has computed a matrix-vector product involving the sub-matrices of A, as this is the dominating computational cost here.Both CSGD and block ADMM use all subsets of the matrix A and are stopped when their solutions reach 80 dB SNR.For block ADMM, we determined the best parameter values for variable ρ and also stopped the CG method after as few iterations as possible to allow algorithm convergence to the required level.
The convergence comparison for CSGD and block ADMM are shown in Fig. 11.The results demonstrate the significant speed advantage offered by CSGD, which requires significantly fewer matrix-vector multiplications compared to block ADMM.Furthermore, the fully distributed form of CSGD used here, i.e. α = 1, is not the optimal use of CSGD.According to the previous simulations, setting α < 1 can further increase convergence speed.
We have also compared CSGD and block ADMM on the CT simulation of the previous simulation data with K = 64 and a detector with 130 pixels.The image was partitioned again as in Fig. 2b.Both methods were stopped once the SNR of the reconstructed image had reached 20 dB.The reconstructed images under noise free situation are shown in Fig. 12.Although the SNR of both images is the same, the ADMM reconstruction in Fig. 12 b shows much clearer artefacts along with the boundaries of adjacent sub-image blocks whilst the CSGD results in Fig. 12 a do not show these effects.

IV. CONCLUSION
CSGD was designed for a distributed reconstruction of cone beam CT data under arbitrary scan trajectories.In the distributed network, each node is assumed to have limited storage capacity and thus all nodes operate with limited access to the projection data and reconstructed volume.Whilst the method does not converge to the least squares solution, the solution is found empirically to be comparable in quality to those found with other common tomographic reconstruction algorithms, such as SIRT and CAV, but at a significant computational advantage.The parallel architecture is the same as that of block ADMM, which is a general algorithm for separable convex optimization.However, for large scale CT reconstruction, block ADMM is less attractive compared to CSGD.One advantage of CSGD is that it requires less storage compared to block ADMM.Another significant advantage of CSGD is that it converges with significantly fewer matrix vector products as it avoids the calculation of matrix inverses.This means that CSGD converges much faster compared to block ADMM.
We have furthermore developed an importance sampling strategy, that has been shown to further increase initial convergence.A theoretical analysis of the algorithm's convergence properties is ongoing and so is the inclusion of regularisation terms in the method.APPENDIX Whilst we do not have a formal convergence proof of CSGD yet, it is instructive to analyse the fixed points of the algorithm.We here look at the deterministic version of the algorithm with α = 1.
The algorithm updates two quantities, x and z.Let (x k +1 , z k +1 ) = T (x k , z k ) define one iteration of the algorithm.Let x and z be fixed points of the operator T (x, z) defined by (x , z ) = T (x , z ).Similar results to the once derived here for the deterministic algorithm can also be obtained for the randomised versions and for α < 1 if we look at points for which (x , z ) = E{T r (x , z )}, where E{•} is the expectation with respect to the random iteration operator T r (x, z), given the current state.In the following demonstration, I and J are arbitrarily selected from {I i } M i=1 and {J j } N j =1 .The deterministic version of the algorithm computes updates of the form Eq. 18 can be expanded into the whole z z = (I + S T ) −1 Ax + (I + S T ) −1 S T y, (20) where S T is a block diagonal matrix: Note that (I + S T ) is a positive semi-definite matrix if the μ i j are positive.Define a diagonal matrix D j with diagonal entries μ i j where D j (h, h) = μ i j if h ∈ I i .Eq. 19 can thus be written as (D j A J ) T (y − z ) = 0 (22) Combine Eq. 20 and Eq.22 gives 0 = D j A J T (y − (I + S T ) −1 Ax − (I + S T ) −1 S T y) = (D j A J ) T (I + S T ) −1 (y − Ax ).
This equation has to hold for all J.As D j is a diagonal matrix, this implies that which shows is a weighted least squares solution.

Fig. 2 .
Fig. 2. (a) Shows the partition that along with one dimension for a 16-pixel image.(b) Shows a partition that along both image dimensions.

Fig. 4 .
Fig. 4. (a) Shows a 3D model of the cone beam setup with one block of the volume being projected on a detector plane.(b) Shows the projection area of the volume block.

Fig. 5 .
Fig. 5. Comparison of CSGD with other methods.The performance of CSGD, including the final SNR and RD level, is similar to the SIRT and CAV methods.

Fig. 6 .
Fig. 6.SNR value after 800 epochs for different values of β and for different N and M .Increasing the M or N reduces the acceptable β range and the range is influenced more by N value instead of M .In this simulation, the α ≡ 1.

Fig. 8 .
Fig.8.(a) Shows that within the acceptable β range, increasing the β is found to lead to faster convergence, up to a point where the algorithms is about to diverge.(b) Shows that when N is fixed, increasing M slows down the total convergence rate.

Fig. 11 .
Fig. 11.Under noise free assumption, CSGD uses much fewer matrix-vector multiplications to achieve predefined SNR.The β used in CSGD is 1 2 N .

Fig. 12 .
Fig. 12.(a) Comparison of reconstruction results using CSGD and (b) block ADMM.Although the SNR of both images is the same, (b) shows much clearer inner artefacts than (a).

=
A I x k + S I y k I − z k I ,.This implies that, at the fixed point x and z , we havez I = (I + S I ) −1 A I x + (I + S I ) −1 S I y I J I i ) T (y I i − z I i ) = 0.