Scatter Correction in X-Ray CT by Physics-Inspired Deep Learning

Scatter due to interaction of photons with the imaged object is a fundamental problem in X-ray Computed Tomography (CT). It manifests as various artifacts in the reconstruction, making its abatement or correction critical for image quality. Despite success in specific settings, hardware-based methods require modification in the hardware, or increase in the scan time or dose. This accounts for the great interest in software-based methods, including Monte-Carlo based scatter estimation, analytical-numerical, and kernel-based methods, with data-driven learning-based approaches demonstrated recently. In this work, two novel physics-inspired deep-learning-based methods, PhILSCAT and OV-PhILSCAT, are proposed. The methods estimate and correct for the scatter in the acquired projection measurements. Different to previous works, they incorporate both an initial reconstruction of the object of interest and the scatter-corrupted measurements related to it, and use a deep neural network architecture and cost function, both specifically tailored to the problem. Numerical experiments with data generated by Monte-Carlo simulations of the imaging of phantoms reveal advantages over our implementation of a recent purely projection-domain deep neural network scatter correction method.


I. INTRODUCTION
C T is widely used for imaging internal structures of the body, for pre-clinical imaging, and for non-destructive evaluation [1].However, scatter occurring due to the interaction of radiation with the imaged object degrades the reconstruction causing streaks, cupping, shading artifacts and decrease in contrast.These severe artifacts due to scatter make its prevention or correction a critical component in any CT system.
De-scattering methods fall into two main categories: hardware-based; and software-based.Hardware-based methods include collimation, introducing a bow-tie filter in front of the X-ray source, increasing the distance between the detector and scattering object, the use of an anti-scatter grid, etc. [2].These methods are successful in particular settings.However, they are either costly to implement or they subject the patient to a greater X-ray dose, which may pose a health risk.This motivates the great interest in software-based scatter estimation and correction methods, which in turn, fall into two classes: (i) methods that estimate the scatter for an initial, scatter corrupted reconstruction of the object using a forward solver, and subtract out the computed scatter contribution ; and (ii) methods that work directly on the total projection data.
We review Class (i) methods and their forward solvers first.One of the main approaches to model scatter in X-ray CT are Monte-Carlo (MC) solvers, which stochastically sample photon propagation to estimate the scatter for a given object.However, despite their potential to produce gold-standard estimates of scatter when a large number photons is used [3], for clinical purposes the resulting computational costs and runtimes of MC-based methods are prohibitive.
Unlike the MC stochastic modeling of scatter, the linear Boltzmann transport equation (LBTE) specifies a solution for the expected value of the scatter.To mitigate the computational cost of solving the LBTE, which is an integro-differential equation in a seven dimensional space, analytical-numerical forward solvers use simplifying assumptions to find an approximate solution [4], [5].Although faster than MC, this approach involves a trade-off between discretization and the maximum order of scatter that is modeled, and the accuracy of the scatter estimate, therefore it too can be computationally expensive.
The third method for the forward problem is the sliceby-slice method [6], which models the scatter as a distancedependent incremental blurring effect represented by analytically computed kernels.This method has a reduced computational cost compared to the first two.However, it is unable to keep track of the angular distributions of the incoming photons to a slice and does not account for any multiple order scattering events in the medium, which limit its effectiveness.
In contrast to the forward-solver based methods, the Class (ii) methods work directly on the total, scatter corrupted projection data.They can be further classified, in turn, into two categories: kernel-based scatter estimation; and data-driven scatter estimation in the projection domain.
Given scatter-corrupted total measurements τ (t, θ), kernelbased scatter estimation methods attempt to determine the scatter in τ (t, θ) by convolving its weighted version τ (t, θ) with a specific function of t -a "kernel" [7]- [9] for each view θ.These estimation methods are computationally efficient but prior assumptions such as neglecting the contribution of scatter to τ and pre-defined kernels with few degrees of freedom restrict their effectiveness.A hybrid method combining a MC and a kernel-based approach is studied in [10].
Instead, data-driven approaches utilize neural networks to estimate scatter.Image domain methods that correct scatter using scatter-corrupted reconstructions include a method with two-step registration of CT-CBCT pairs [11], and a method that estimates and subtracts out the scatter corruption in the image domain using a deep residual CNN [12].Projection domain methods that estimate the scatter component from total scatter-corrupted measurements include a CBCT method using a scatter library of breast CT to estimate scatter [13], a twonetwork approach that learns scatter in the projection domain by separating it into high and low frequency components [14], and a method called Deep Scatter Estimation (DSE in short) [15], [16] that operates on the projection domain and uses a modified U-net [17] architecture with an additional average pooling path for better extraction of features.A similar approach can also be found in [18].The image-domain methods do not have direct access to the scatter-corrupted measurements, nor do they estimate the scattered X-rays directly, and therefore are not interpretable, and are difficult to relate to the physics of the problem.The projection domain methods are less subject to these limitations, but because they use little or no information of the 3D object structure, which ultimately determines the scatter, their effectiveness is limited.
Contributions: In this work (see also [19], [20]), unlike previous Deep Neural Network (DNN)-based approaches, we present scatter correction algorithms for X-ray CT based on a deep CNN that use both the raw projection data and an initial reconstructed image simultaneously.The data processing pipelines, network architectures, and loss function design for training of the proposed methods, are all inspired by the physics of X-ray scatter.The tailored loss function expresses the norm of a reconstruction domain error in the projection domain, avoiding the need to compute gradients (backpropagate) across the filtered backprojection algorithm for every training sample, resulting in efficient network training.This loss function may therefore be of independent interest in other work on deep-learning methods in tomography.As a benchmark for comparison, we use the projection domain DSE method, which is also physics inspired, in that it can be interpreted as a learned kernel-based scatter estimation method [15], [16].
We study not only the widely used polychromatic Xray CT, but also the monochromatic case, for two reasons.First, it provides substantial insight into the problem and its solution, because in the monochromatic case, the only deviation from the ideal (post-log) linear measurement model is due to scatter.Hence, the effect of scatter and its mitigation can be clearly evaluated.In contrast, in the polychromatic case, beam hardening (another nonlinearity) confounds the problem and the interpretation of the results.In fact, the distinct subject of mitigating beam hardening has received much research attention.Second, the monochromatic case is of great independent practical interest, thanks to the unique applications it enables, and the recent availability of compact low-cost monochromatic sources [21], [22].Likewise, while the cone-beam geometry is common in 3D CT, the parallelbeam geometry arises in monochromatic synchrotron CT imaging with its unique and important applications [23]- [26], and is therefore of considerable practical interest.We present a specialized version of our scatter reduction algorithm for the parallel-beam geometry, which takes advantage of additional structure in the scatter physics, to further mitigate the scatter.
The paper is organized as follows.We set up the scatter correction problem in Sec.II, describe the proposed algorithms in Sec.III, and provide the methodology and framework for the numerical experiments in Sec.IV, with the results of the various numerical experiments in Sec.V. We conclude and indicate possible future directions for research in Sec.VI.

II. PROBLEM SETUP
1) X-Ray CT: In a 2D setting with a parallel beam source, shown in Fig. 1-(a), let f denote the object (i.e., the desired image) with f (x) the linear attenuation coefficient at position x ∈ R 2 .The line integral of f along the ray parametrized by offset (detector position) t and angle θ is denoted by g(t, θ).For fixed θ, the function g(•, θ) is a projection of f at view angle θ, and the mapping from f to the complete set of line integral projections g {g(t, θ), θ ∈ [0, π), t ∈ R} is the 2D Radon transform, a linear operator denoted by R, Using the standard setup, we assume projections are measured at a finite uniformly spaced set of view angles, θ ∈ Θ = {2kπ/K, k = 0, 1, . . ., K−1}), and at a finite set of ray offsets T {t = i∆ t , i = −(d − 1)/2, . . .0, . . .(d − 1)/2} per view, resulting in a discrete set of projections, g {g(t, θ), θ ∈ Θ, t ∈ T }.We use g θ ∈ R d to denote the discrete projection at angle θ -the vector of d uniformly-spaced samples of g(t, θ) along the detector position coordinate, t.
The reconstruction problem is to compute the inverse Radon transform f (x) = (R −1 g)(x).In the discrete data case, with the usual assumption that f is essentially bandlimited and supported on a bounded set, and the sampling in θ and t is dense enough, the discrete-index filtered backprojection (FBP) is a good numerical approximation to the inverse of the Radon transform [27].We denote the FBP by R −1 , to emphasize that we assume that the conditions for accurate reconstruction by FBP are satisfied, and focus on the error due to scatter.To account for the dependence on the energy spectrum the source, we consider the two types of sources: polychromatic (emitting photons with a broad range of energies), and monochromatic (emitting essentially monoenergetic rays).
Consider a 2D object with a parallel beam source.Denoting the energy-dependent linear attenuation coefficient [28] of the object at source energy E by f E (x), its projection is Using an energy-integrating detector, the primary measurement is, by Beer's law [29], where I 0 is the vacuum (or bright field) fluence measurement, and c(E) is a function of the source spectrum and energydependent detector response, with c(E)dE = 1.
With a monochromatic source with photon energy E 0 , (2) reduces to (1) and (3) reduces to p(t, θ) = I 0 e −g(t,θ) , ( where the dependence on E 0 is suppressed to simplify notation, and the line integral projection is readily extracted from the measurement by a logarithm, g(t, θ) = − ln[p(t, θ)/I 0 ].
On the other hand, in the polychromatic case, the mapping (3) from g E to the primary measurements p involves another nonlinearity in addition to the exponential, which cannot be inverted by taking the logarithm.This nonlinearity, unless corrected, may manifest as beam hardening artifacts in the reconstruction.To focus on scatter correction only, we limit the discussion in the remainder of this section and in Secs.III-IV to the monochromatic case, where the only deviation from the ideal measurement model is due to scatter.However, the general approach can be extended to the polychromatic setting to handle scatter and beam hardening simultaneously.This is demonstrated in the numerical experiments in Sec.V. Given the primary measurements p, the projections g determined by inverting (4) suffice to obtain an accurate reconstructions by FBP.However, as discussed next, due to X-ray scatter, the primary measurements are corrupted by an additive scatter component, which unless blocked in the first place by physical means, or corrected, results in artifacts in the reconstruction.
2) X-Ray Scatter: The only significant source of scatter at X-ray energies of 30 keV -450 keV used in pre-clinical and medical CT (<140 keV) and non-destructive tomography (NDT) (<450 keV), is Compton scatter, in which an incident photon is scattered (Fig. 1-(a)) by an electron, with both energy and the propagation direction of the photon significantly modified.With many such scattering events, the total measurement (detector reading) at angle θ that is obtained is where s(t, θ) is an additive scatter term, which is a nonlinear function of the object.It is the contribution of this additive term that leads to artifacts in conventional reconstruction where FBP is directly implemented using the total, scatter corrupted, measurement τ instead of the ideal primary measurement p to obtain an estimate f * ∈ R d 2 of the image.While any practical CT measurements also include random noise due to finite number of photons and electronic noise at the detector, in our discussion we assume sufficient photon counts and sufficiently small electronic noise in the measurements that the reconstruction error is dominated by the deterministic bias due to scatter.
3) Problem Statement: We assume that we are given a set of total measurements τ {τ θ , θ ∈ Θ} which, in the absence of the scatter component s, would suffice for accurate reconstruction of the object f by FBP.Our goal is to produce a reconstruction f * that approximates the FBP reconstruction f = R −1 g that would be obtained from g(t, θ) = − ln[p(t, θ)/I 0 ], where p is the, scatter-free, set of primary measurements.

III. PHYSICS-INSPIRED SCATTER CORRECTION
To provide invariance to source intensity or exposure time, the problem and method are formulated in terms of the normalized quantities, τθ τ θ /I 0 , sθ s θ /I 0 , and pθ p θ /I 0 .We denote the set of normalized total measurements by τ {τ θ , θ ∈ Θ}, with τθ ∈ R d .Given τ , the problem is to produce an estimate p * of the scatter-free primary, which is then used to reconstruct the estimate f * using FBP.

A. PhILSCAT: Physics-Inspired Learned Scatter Correction AlgoriThm
The key idea in the proposed approach [19], illustrated in Fig. 2, is that because scatter in any one direction depends on the entire object in a nonlinear fashion, the measurement in one direction cannot be used to fully determine the scatter in that direction.Instead, information about the entire object, which aggregates the information of all views is required.Thus, given a set of total measurements, τ , an initial reconstruction estimate f ∈ R d 2 of the image is computed by FBP using the initial estimate of the line integral projection where τ (t, θ) is used as a surrogate for the primary measurement p(t, θ).By the inherent physics in (3) or (4), p(t, θ) must be smaller than the mean bright field fluence I 0 .Hence, this physical constraint is used in (7) to improve the initial reconstruction, by clipping of τ (t, θ) to 1.
The deep CNN (DCNN), N γ with network parameters γ, operates view-by-view.It accepts two different inputs: (i) A normalized post-log total measurement − ln τθ at view angle θ.Here, unlike (7), τθ at the input to the DCNN is not upper bounded by 1, because values greater than 1 are physically possible after the normalization by I 0 and they provide useful information for the estimation of the normalized scatter term sθ .(ii) A version fθ ∈ R d 2 of the initial reconstruction estimate that is rotated by the same angle θ of the projection being processed.As a consequence of the rotation, a projection of fθ at zero angle yields the projection of f at angle θ, allowing the DCNN to be agnostic to θ.The DCNN returns an estimate s * θ ∈ R d of the scatter The normalized primary measurement is estimated as where is a small constant to prevent photon starvation artifacts.The clipping of p * is again due to the physical constraint imposed by (3) or (4).Finally, an estimate g * θ of the projection is obtained as g * (t, θ) = − ln(p * (t, θ)).Once every view has been processed, the reconstructed image is obtained from g * by FBP with Shepp-Logan filtering [30], B. OV-PhILSCAT: Opposite-View processing PhILSCAT This algorithm [20] is a variation on PhILSCAT, utilizing one more physical aspect of the tomographic measurement.We take advantage of the following property of the 2D Radon transform: the projections in opposite directions (π-opposite projections) coincide up to a sign reversal (flip) in t: g(t, θ) = g(−t, θ + π) ĝ(t, θ).Combining with ( 3) and ( 4) we have It follows that the difference ∆s θ of π-opposite scatter components can be determined exactly from the available (scatter-corrupted) total measurements.Thus, estimating the average bθ (s θ + ŝθ+π )/2 suffices to fully determine the individual scatter components.This approach is motivated by the following hypothesis: H1 Because the average bθ is typically smoother than sθ and ŝθ+π , bθ should be easier to learn by a neural network.
In contrast, the difference ∆s θ typically has higher bandwidth, however, we do not need to estimate it, since it can be extracted exactly from the total measurements (the measured data).The smoothness of bθ relative to ∆s θ can be observed in Fig. 3, which depicts the normalized cummulative energy in the respective power spectral estimates computed over 27 different phantoms used in the 3D monochromatic parallel beam experiments (Sec.IV-A).As seen in Fig. 3, ∆s θ has a power spectrum close to that of white noise, with cumulative energy fraction increasing linearly with frequency.In contrast, b θ has a large fraction of its energy concentrated in the first few frequency components: the DC component alone accounts for about 45%, and the first 11 components capture about 75% of the energy in b θ .
As in PhILSCAT, the DCNN N γ , operates view-by-view, but instead of K times, , it is used here only K/2 times, for The network takes two inputs: (i) the average (τ θ + τθ+π )/2 of normalized pre-log πopposite total measurements at view angle θ, (ii) the initial reconstruction estimate rotated by θ, fθ ∈ R d 2 .Different than PhILSCAT, f is obtained by using The normalized primary projection pθ is then estimated as and, as in Section III-A to reflect the physical constraints, we constrain pθ to be non-negative.Similarly, τθ is not clipped to 1, since τθ > 1 is physically possible.Finally, the reconstruction estimate is obtained using FBP with Shepp-Logan filtering [30], as in (10), this time only using estimates of the projection g * θ from half of the angular range Θ = {2kπ/K, k = 0, 1, . . ., K/2 − 1}.
Regardless of scatter correction, in scenarios with unavoidable subpixel misalignments, calibration and correction are required to obtain reasonable reconstructions.We propose that these corrections be applied as a first step, prior to the scatter correction.If successful, the initial (scatter-corrupted) and the scatter-corrected reconstructions can be obtained without calibration problems.However, the difference of conjugate projections ∆s θ could be more sensitive to residual subpixel misalignments than the sum signal b θ , or a regular reconstruction.Such misalignments could result in imperfect cancellation of the primary component in ∆s θ leading to "spike-like" artifacts at locations with sharp edges in the projection.We propose to add a step of median filtering of ∆s θ to mitigate these effects in such scenarios.Because without misalignment ∆s θ should contain only scatter signal, which would be relatively smooth, the median filtering will not affect it.

C. Loss Function
The networks in both algorithms are trained by minimizing a common loss function with respect to the network parameters γ.In other work [15] on learning-based scatter estimation, the loss function is expressed in terms of estimation error p − p * of the primary measurements.However, because the image produced by FBP depends logarithmically on p, it is the relative error in p * that matters.Therefore, an error with the same magnitude in p * has greater impact on the reconstruction f * for smaller p than for larger p.It follows that measuring the loss in terms of the error p * would not reflect the error in the reconstructed image, with the effect manifesting most strongly for rays passing through highly attenuating regions.
Instead, because our goal is to approximate the scatter-free FBP reconstruction f by f * , a reasonable choice for a loss function to represent the relevant deviation averaged over M training samples would be where In ( 14) g * θ (γ) = − ln p * θ (γ) and g θ = − ln pθ are the estimated and the true line integral projections of the object, with the dependence on γ arising from ( 8)-( 9) for PhILSCAT, or (12)-( 13) for OV-PhILSCAT.Operator Q represents possible perceptually-oriented weighting of the reconstruction error to emphasize visually salient image features in the image domain.In particular, to account for the perceptual significance of edges in CT images, we consider Q that corresponds to an image-domain filter with radially symmetric high-pass frequency response Q(ω) (in polar coordinates).
A drawback of the loss function ( 14) is that to compute its gradients, it requires back-propagation across the FBP operator R −1 for every training sample, which can be computationally expensive.Instead, we show in the next subsection that the loss ( 14) can be expressed directly and exactly in terms of the projections, without the need for an FBP.This leads to the following modified loss function The filtered l2-norm term is exactly equivalent to the loss in (14).The filter h is the simple two-tap filter resulting in a very efficient implementation of the loss function and its gradients.Since h is a high-pass filter, the l 1 -norm term (with a small constant, λ > 0) is used to recover the zero frequency component of the projections.The specific choice of filter h is explained in detail next.

D. Image 2-Norm in Projection Space
In this subsection we derive a simple expression for the reconstruction error in image space, in terms of the error in projection space.This enables to express the image-domain loss for training the neural network in terms of the estimated projections, and to obtain the gradients needed for training while avoiding the need to back-propagate across the FBP.The result in this subsection may therefore be of independent interest for other applications of deep learning in tomography. Assuming ) is a finite energy image having line integral projection at angle θ denoted by g(t, θ), let G(ω, θ) denote the one-dimensional Fourier transform of g(t, θ) with respect to the first variable.Then by Parseval's relation for the Radon transform [27], the 2-norm of the image f is expressed in terms its line integral projections Note that although the inverse Radon transform to compute f from g involves applying the ramp filter |ω| to the Fourier transform G(ω, θ) of the projections, Parseval's relation in (17) involves applying the same ramp filter to the square of |G(ω, θ)| instead.Now, we wish to express the 2-norm of an image f , obtained from the projections by FBP using a weighted ramp filter W (ω)|ω|.Because the only difference to the inverse Radon transform is in the additional weight W (ω) included with the ramp filter to filter G(ω, θ), it follows that f 2 2 is given by replacing G(ω, θ) in ( 17) by its weighted version W (ω)G(ω, θ).Finally, incorporating the perceptual weighting filter Q with radially symmetric frequency response in the image domain, is equivalent, by the Radon convolution Theorem [27] to filtering the projections G(ω, θ) by the filter with frequency response Q(ω).It therefore follows that In order to obtain an expression in terms of filtered projections, ( 18) can be rewritten as where can be considered as the frequency response of a filter.The phase of H has no effect on the result owing to the absolute value in (19).Thus, φ(ω) can be chosen arbitrarily -for example as identically zero.Using the standard Parseval's identity for the second integral in (19), the filtering implied by ( 19) can be expressed and implemented as with where h(t) is a filter with frequency response H(ω) and * denotes convolution in the t variable.The implication of these results is that a loss function defined in terms of a Q-weighted L 2 error of an image reconstructed using FBP with ramp filter weighting W (ω), can be expressed in terms of a loss function defined on the projections, using ( 21) - (22).
For practical implementation, we derive the discretized version of ( 21) - (22).When f has bounded support and W is bandlimited such that the reconstruction f is bandlimited, and P uniformly spaced view angles over [0, 2π] are used, ( 18)-( 22) can be replaced by their discrete analogues, yielding where and where the convolution is over n and ĥd is a discrete-time filter with frequency response Considering the special case of Q(ω) = |ω| 0.5 , we obtain Ĥd (ν) = W d (ν)|ν|.However, since we use the magnitude in (23), the phase of Ĥd does not matter, and we may use The advantage of this particular filter selection is that it removes the jump discontinuity in the derivative of Ĥd at the origin, and can result in a short filter ĥd [n].For instance, when W d (ν) = sinc(ν) is selected, which corresponds to a Shepp-Logan ramp filter [30], we obtain Ĥd (ν) = sin(ν), for which the impulse response ĥd [n] = 0.5δ[n + 1] − 0.5δ[n − 1] has only two non-zero values.This enables an easy time-domain implementation of the computation in (23) and (24).

E. Extension to 3D
In the 3D geometry (Fig. 1-b) the object axial coordinate z is the rotation axis, perpendicular to the cross-sectional (x, y) plane.In the case of a parallel beam source, we reconstruct a stack of 2D slices using 2D FBP, and for a divergent (cone beam) source we use the Feldkamp (FDK) reconstruction algorithm [31], both using the same Shepp-Logan ramp filtering.
While the derivation of the image 2-norm in projection space in Sec.III-D and the resulting formulation of the loss function in (15) hold strictly only for the parallel beam case, they are applicable, to a good approximation, to CBCT with sufficiently small cone and angles.In fact, for cone angle small enough to avoid artifacts using the FDK reconstruction algorithm, the same approximation as in the FDK algorithm, which uses the 2D weighting and filtering, can be justified.Furthermore, for sufficiently small fan angles, the same filter response can be used, to a good approximation [32].Accordingly, denoting the 2D projections by g θ (t, v), where v is the detector coordinate parallel to z, the loss function (15) is extended to 3D by performing the convolution with h along the transverse t coordinate, and computing the 2-norms and 1-norms over the respective 2D projections.
One input of the DNN is now the 3D initial reconstruction fθ ∈ R d×d×d axially rotated by angle θ.Coordinate u, which is perpendicular to the detector plane, selects coronal planes within fθ , each of which is indexed by detector coordinates t, v.In the case of the parallel beam geometry, the u coordinate coincides with the direction of photon propagation.
The second input to the DNN, indexed along t, v, is the 2D log total measurement − ln τθ ∈ R d×d at one view angle for PhILSCAT, or the average of π-opposite views for OV-PhILSCAT, which is only used for the parallel beam geometry.This specific structure is inspired by the slice-by-slice approach [6], which divides the object into layers perpendicular of the primary X-ray propagation, and based on the Klein-Nishina formula [33], models scatter by a distance-dependent incremental blurring effect at each layer with pre-specified kernels to obtain the scatter estimate for the next layer.Accordingly, because after rotating the initial reconstruction f by angle θ the slices of the input fθ (•, •, u) correspond to object layers (coronal slices) perpendicular to propagation of primary X-rays, the network performs 2D convolutions (blurring) in the plane of these slices, and contracts along the channel u.

F. Network Structures
At the m-th step, for m = 1, . . ., log 2 d, the network has a "ladder" of convolutions compressing the input information by diadic factors to a set of outputs with number of channels equal to {2 −m d, 2 −(m+1) d, . . ., 1} using different convolutional layers, and conveys the output of those to the following steps via skip connections, concatenating all intermediate outputs from previous steps that have the same number of channels (e.g.s 14 is concatenated with every s m4 , m < 4 and s 4 at the end of the 4th step in the diagram, each having 2 −4 d channels).After concatenation in the m = 2, 3, . . .step, another convolutional layer reduces the number of channels back to 2 −m d.

G. Computational Cost in the Inference Phase
The computation in the DCNN is dominated by the cost of convolutions.For 3D reconstruction, 2D convolutions with filters of size L × L are used.Assuming d × d detector panel readouts, convolutions in the network require about L 2 d 2 operations per input and output channel in the convolution.The total cost of the DCNN can be shown to be bounded by 3L 2 d 4 .The DCNN is used K times -once for each projection.This results in total costs of approximately 3L 2 Kd 4 and 3L 2 Kd 4 /2 for PhILSCAT and OV-PhILSCAT, respectively.The cost of FBP is cd 4 with c ≈ 1 since it is performed for each slice of the d×d×d dimensional phantom.Consequently, the total cost of PhILSCAT and OV-PhILSCAT in the 3D case can be bounded by 3L 2 Kd 4 and 3L 2 Kd 4 /2, respectively.

A. Data Generation and Training
In order to train the algorithms, total measurements τ of objects with known ground-truth primary measurements p are needed.Obtaining such data requires a fully characterized CT device and many different phantoms with known material composition and geometry.In our case, it was impractical to obtain such data.Instead, τ for training, validation, and testing data were generated by MC-based simulations.
For each phantom, 2D d × d detector panel readouts, i.e., projection total measurements τ θ , were simulated at K view angles θ uniformly spaced in [0, 2π), each having P photons.The simulated data was divided into training and test sets phantom-wise, grouping all measurements belonging to a phantom into the same set.Then, the total projections measurements and their corresponding initial reconstructions in the training set were used in a randomized order for training.The proposed algorithms were trained and tested using the loss function L in (15).

1) Parallel Beam CT Experiments:
To obtain the total measurements τ θ for training and validation for the parallel beam CT experiments, we used the GATE [34] software package.GATE encapsulates the GEANT4 MC [35] simulation libraries, which perform the simulation of particle propagation through matter.The modeling by GATE included scatter in the detector panel.On the other hand, the primary projections p θ of voxelized phantoms were generated by numerical forward reprojection to compute g(t, θ) and using Beer's law (4).
As illustrated in Fig. 1(b), for these experiments we used a 200 keV monoenergetic parallel-beam source with rays perpendicular to the rotation axis z, and a flat detector panel, both of width W = 128 cm and height H = 128 cm, with the detector panel perpendicular to the rays of the source, and divided into (d = 128) × (d = 128) pixels.The source was not collimated, extending over the entire detector.Simulated objects (mathematical phantoms) were generated as a composition of distinct object components, all contained in a cylindrical air volume of diameter D = 128 cm and height L = 128 cm.The parallel-beam projections of the objects were fully contained in the detector plane for each view, with a margin of approximately 24 cm from the boundaries of the cylindrical geometry in Fig. 1.
The object components in the phantoms were: (i) rectangular prisms; (ii) cylinders with their long axis aligned with the z-axis; and (iii) spheres; with material assigned randomly as water, aluminium, or titanium.Positions and dimensions were randomized, ensuring the components are contained in the phantom volume.
We found in initial experiments that high noise in the simulated training measurements could lead to the networks learning to denoise the projection data rather than just estimating, as intended, the scatter component.This came at the price of reduced resolution of the reconstruction.To mitigate this effect without an expensive increase of the photon counts used in simulation, we introduced a simple pre-processing step, whose details are described in the Appendix, to decrease the level of noise in the simulated measurements.The idea is to identify the areas in the obtained 2D total measurements τ θ that contain only stochastic noise and little scatter, and smooth them to reduce the noise, without affecting the areas that contain object or significant scatter information.
A parameter setting of λ = 5 • 10 −2 was used in the loss function L in (15) for training of both parallel beam settings, after observing the best performance among all values used in a log scale sweep.All FBPs used the Shepp-Logan filter [30].
2) Cone Beam CT (CBCT) Experiments: For the CBCT experiments, we used MC-GPU [36], which is a GPUaccelerated X-ray photon transport MC simulation code for CBCT.The code performs the transport simulation in a voxelized geometry and uses CUDA programming supporting multiple GPUs.The code provides access to the total measurement τ θ , scatter signal s θ , and primary measurement p θ for each view θ, eliminating the need to compute p θ separately.
For training and testing each method in the polychromatic CBCT experiments, we used R −1 g with g θ = − ln[p θ /I 0 ] as the "gold reference" image, and similarly used g, and g * for the corresponding reconstructions, treating the measurements as in the monochromatic case.Thus, the primary measurement (scatter-free) reconstructions R −1 g include some beamhardening artifacts, and in our experiments the scatter correction methods do not perform beam-hardening correction, instead focusing on estimating and correcting the scatter.
The geometry for the CBCT experiments is shown in Fig. 1(b) with a divergent beam point source at sourceto-detector distance, d sd = 180 cm, and source-to-origin distance, d so = 130 cm and with a 64 × 64 cm 2 flat panel detector gridded to 128×128 pixels.The source had a 90 keV monoenergetic spectrum for the monochromatic experiments, and the photons were sampled from a 120 keV tungsten spectrum with 4.3mm Al filter for the polychromatic case.
Thanks to the small fan beam span angle ξ = 0.1233 ≈ sin ξ = 0.1230 for the given CBCT geometry, we expect the extension of the loss function to 3D described in Sec.III-E to hold to a good approximation.
Again, λ = 5(10 −2 ) was used for the CBCT setting after conducting a similar sweep as in the parallel beam case.For each algorithm, the FDK method [31] with a Shepp-Logan [30] ramp filter was used for reconstructions, with a total of K = 360 view angles for each phantom.
The algorithms were compared on two types of phantoms.The first consisted of titanium rods with 2 × 2 cm 2 crosssections placed in parallel with the z-axis with random center locations in the x-y plane sampled from a uniform distribution U[−D/4 cm, D/4 cm], where D = 64 cm.Intersections between the objects were allowed with intersecting regions also consisting of titanium.The second type of phantoms were used for experiments on medical data.Thirty phantoms extracted from the CT Lymph Nodes dataset [37] from the cancerimaging archive [38] were tissue-mapped to five different materials and tissue types: (1) air, (2) lung, (3) adipose, (4) soft tissue and ( 5) bone.
Since the GPU-accelerated MC code allowed us to simulate photon numbers sufficient to obtain acceptable noise levels in the simulated measurements, a noise reduction scheme as in parallel beam case was not used for CBCT experiments.

B. Setup
The proposed algorithms were compared among themselves and with the recent data-driven projection-based correction method DSE [15], [16] as described in Section I. Since DSE was shown to perform consistently better than a classic kernel-based scatter correction method, direct comparison with such a method was not performed.The DSE method was implemented as described in [15] and trained using the same total measurements τ θ and primary measurements p θ used in the training of the proposed methods.The network N γ in DSE seeks to minimize the MAE (mean absolute error) between the estimated and ground-truth scatter signals, and estimates the primary measurement as where γ * = arg min Reconstruction quality, as compared to the reference images (FBP of numerically computed projections of test phantoms) is quantified using four metrics: PSNR (in dB) as the ratio of peak reconstruction value to root mean square error with higher values indicating better performance; MAE; PE (peak error), equal to the infinity norm of the error; and SSIM (structural similarity index) -higher values for greater similarity.
The networks for DSE and the proposed methods were implemented in Pytorch, and the Adam optimizer [39] was used for all methods for training.No additional regularization was used for training, as we observed close validation and training errors, indicating the absence of overfitting.Convergence for different algorithms was determined by flattening of training and validation loss curves as a function of training iterations.
Computations were performed on an NVIDIA GeForce GTX TITAN X GPU, and an Intel Core i7-4770K CPU with 32 GB RAM.For parallel beam reconstruction, both FBP and image rotations were implemented on the CPU, whereas for CBCT the FDK algorithm was implemented on GPU.The split of run-time averages are shown in Table I.Once both FBPs and image rotations are migrated to the GPU, the DCNN runtimes, which in these experiments account for only a small fraction of the total runtime, will dominate, resulting in total runtimes for the algorithms of 2 -4 seconds per reconstructed volume.V. EXPERIMENTS

A. Monochromatic 3D Parallel Beam CT
To compare PhILSCAT, OV-PhILSCAT, and DSE [15], [16], each algorithm was trained for 100 epochs and tested on the same 27 and 3 phantoms, respectively, randomly generated as in Sec.IV-A.Each phantom had 360 uniformly spaced views with 8 × 10 6 photons each.
The synthesized measurements display an expected strong correlation between attenuation along a (t, θ)-ray and the corresponding scatter-to-primary ratio s(t, θ)/p(t, θ): when averaged over the areas in the test phantoms with p(t, θ) < √ I 0 we observed AVG t,θ s(t, θ)/p(t, θ) ≈ 59% .The peak value over all detector pixels is max t,θ s(t, θ)/p(t, θ) = 956%.This shows that the higher values of scatter-to-primary ratio are not restricted to few detector pixels.The strong scatter scenario is also evident in the corruption of the τ -reconstructions from uncorrected total measurements with peak reconstruction error of 1557 HU, or 46% of the peak density of 3407 HU of the p-reconstructions from primary p.
Average reconstruction accuracies are reported in Table II for the three test phantoms.DSE improves on the uncorrected case as expected.However, consistent with our observations, which will be discussed next, PhILSCAT and OV-PhILSCAT perform significantly better than DSE in these experiments on all metrics.OV-PhILSCAT provides not only a slightly better PSNR than PhILSCAT, but is also twice as fast.
The FBP p-reconstruction of a test phantom and the absolute reconstruction errors in HU obtained by the algorithms are shown in Fig. 5. Line profiles extracted from the reconstructions at the same location as in Fig. 5 are compared in Fig. 6.
As expected from the high scatter/primary ratios in the projections, uncorrected τ -reconstructions displaying large magnitude scatter artifacts such as decreased contrast for the axial and sagittal slices, are the worst.DSE corrects these magnitude errors to a significant extent, which is reflected as an overall improvement on the metrics compared to τreconstructions.However, as seen in Fig. 5, with DSE there are remaining artifacts around the object edges.Also, DSE does not suppress the streaking artifacts, density errors over highly attenuating objects, and errors in the background very well.This is also seen in the line profiles in Fig. 6.
Both PhILSCAT and OV-PhILSCAT suppress the streaks better than DSE, with better reduction of errors around the object edges, which is especially visible for the titanium slabs in Fig. 5.As a consequence, they both achieve substantially better metrics than DSE.Finally, compared to the total measurement reconstructions, both proposed algorithms reduce the peak error by ≈ 70%, whereas DSE by only ≈ 20%.

B. 3D Cone Beam CT -Ti Rod Phantoms
In this subsection, we study using both monochromatic and polychromatic sources, a setting more typical of a nondestructive evaluation (NDE) application, with phantoms consisting of randomly placed titanium rods as described in IV-A.These phantoms include high object densities resulting in high ray attenuations.Moreover, the scatter signals corresponding to various views have higher frequency content and there is an increased dependence of the scatter signal on the angle of the measurement.These factors all have the potential to make the problem more challenging.Several authors (e.g., [15]) estimated scatter at reduced spatial resolution to take advantage of its characteristic smoothness compared to the primary.However, having high contrast sharp objects as in Sections V-A and V-B, we have chosen not to do so since smoothness of the scatter may not hold uniformly.DSE and PhILSCAT were each trained on 27 and tested on 3 such phantoms, respectively, using K = 360 uniformly spaced views with P = 10 8 photons each for each phantom.
As expected with this high density object scenario, the peak scatter-to-primary ratios s(t, θ)/p(t, θ) are extremely high due to photon scarcity along highly attenuating paths for both source settings.For the test phantoms, the overall average ratio AVG t,θ s(t, θ)/p(t, θ), and the corresponding average over areas in the projections where p θ < √ I 0 were found to be 10% and 54%, respectively for the monochromatic case, and 17% and 74%, respectively, for the polychromatic case.These high scatter levels are indicative of the difficulty of the reconstruction task in these experiments.
For the monochromatic setting, the observed peak error between the τ -reconstructions and the true p-reconstructions is 3711 HU, or 42% of the peak density of 8761 HU in the p-reconstructions.For the polychromatic setting, the corresponding values are 4229 HU or 40% out of the peak preconstruction density of 10560 HU.The HU reconstructions for all methods in the polychromatic setting were obtained after performing HU calibration to adjust the water linear attenuation level to 0 HU in the p-reconstructions, which were computed as explained in Sec.IV-A.
These numbers indicate a stronger effect of scatter in these reconstructions than in the other CBCT experiments.The average reconstruction accuracy metrics are reported in Tables III and IV   Reference reconstructions computed using primary measurements p, tighter HU window scatter-corrected reconstructions, and error magnitude plots are given in Figures 7 and 8, for the monochromatic and polychromatic cases, respectively.To facilitate comparison, line profiles for each method and maximum intensity projections (MIP) of the absolute error volumes for DSE and PhILSCAT are provided together in Figures 8 and 9 for the polychromatic case.A MIP in a particular direction is obtained by assigning to the projection the maximum absolute values the rays encounter in any voxel they traverse.Since they accumulate peak errors in many slices, it is possible that MIP values have considerable spatial variation, resulting in a noise-like map.
As could be expected from the strong scatter, the τreconstructions contain large magnitude errors of 1000 -2000 HU over the objects.The projection-domain DSE algorithm improves considerably on these magnitude errors.However, there still remain relatively large errors on the highly attenuating paths.PhILSCAT corrects scatter-related artifacts noticeably better over those regions while not suffering any setbacks on any other part of the reconstruction compared to DSE.Additionally, streaking artifacts are suppressed to a greater extent using PhILSCAT.This can also be observed in reconstructions with tighter HU window focusing on lower densities in Fig. 7. Finally, tighter HU window reconstructions for higher HU values in Fig. 8 show the better performance of PhILSCAT in reducing shading artifacts and increasing contrast.Thanks to these differences, PhILSCAT performs better than DSE in terms of PSNR, SSIM and the peak error, as revealed by the metrics reported in Tables III and IV.
The peak errors are over the entire reconstructed volume and a single voxel with a large outlier error can determine this quantity.To provide further insight into the larger error values, we note that the total number of voxels for the test phantoms that have error magnitudes larger than 500 HU is reduced by PhILSCAT by ≈ 8 fold from their number (3,900) in DSE.Also, thanks to the loss function described in Sec.III-C, which minimizes the error in the reconstruction domain, although PhILSCAT has 12% larger MSE in the estimates of scatter, the reconstruction PSNR of PhILSCAT is 1.8 dB better than DSE.

C. Anthropomorphic Phantoms
For the polychromatic 3D CBCT reconstructions, DSE and PhILSCAT were each trained on 27 and tested on 3 anthropomorphic phantoms as described in Section IV-A with the imaging geometry shown in Fig. 1-(b).
The peak error between the τ -reconstructions of the three test phantoms and the p-reconstructions was 1699 HU, or 84% of the peak density of 2034 HU in the p-reconstructions, indicating significant degradation due to uncorrected scatter.
The average and the peak scatter-to-primary ratios for the three test phantoms were AVG t,θ s(t, θ)/p(t, θ) = 29%, and max t,θ s(t, θ)/p(t, θ) = 212%, with AVG t,θ s(t, θ)/p(t, θ) = 79% over areas in the projections where p(t, θ) < √ I 0 .The closer scatter-to-primary ratios in areas with p(t, θ) < I 0 vs. those with p(t, θ) < √ I 0 indicate that the scatter signals tend to be smoother for these phantoms compared to the Ti rod experiments in Section V-B.Owing to this characteristic, the difference in reconstruction quality between PhILSCAT and DSE, as expressed by the average PSNR and SSIM results for the 3 test phantoms in Table V, is reduced compared to Sec.V-B.Still, PhILSCAT suppresses the peak error 20% better than DSE, and does better in terms of PSNR and MAE on average.
Consistent with the results of the other experiments, as seen in the comparison of the magnitude of the reconstruction errors in Fig. 10, PhILSCAT performs visibly better than DSE, especially in highly attenuating regions of the phantoms.To highlight differences in the sagittal slice reconstructions, they are shown with a tighter HU window, and the green-boxed regions are zoomed-in at the bottom left of Figs.10(d) -10(h).It is seen that in a highly attenuating region PhILSCAT recovers bone densities better than DSE.

D. Testing on Different I 0
To further test the generalizability, the algorithms trained on vacuum fluence I 0 for the polychromatic Ti rods experiment in Sec.V-B were tested on the same test phantoms imaged using I 0 /4 and average performances are shown in Table VI.
Although results deteriorate significantly for both cases, the performance gap between PhILSCAT and DSE enhances, indicating that PhILSCAT is better able to generalize for smaller photon count (lower SNR) settings.This observation is also consistent with the parallel beam CT results where the photon count is considerably lower than CBCT experiments.

E. Ablation Studies 1) Network Architecture:
To check the advantage of using the proposed network architecture of Sec.III-F over a U-Net architecture [17] as used in DSE [15], [16], two alternatives were compared.The initial 128 channel reconstruction fθ was compressed to a single channel "image" using two consecutive 2) Input: To evaluate contribution of the initial reconstruction to PhILSCAT without modifying the architecture of the network, the initial scatter-corrupted reconstruction was replaced with a fixed random input (which, just as the initial reconstruction, is rotated by the view angle for each training projection), and training and testing was performed on the polychromatic Ti rods CBCT setting.Table VI, showing an improvement in all metrics, indicates the advantage of using the initial reconstruction as an additional input.Also, we observed ≈ two-fold reduction (over the 3 test phantoms) in the number of voxels with errors larger than 500 HU.
3) Loss Function: To evaluate the contribution of the tailored loss function of Sec.III-C to PhILSCAT, we compared the performance (in the polychromatic Ti rods CBCT setting) to the same algorithm but without the application of filter h, that is, using a standard MSE loss on the scatter estimate.As seen in Table VI, the tailored loss provided an improvement on three of the four metrics.A more dramatic improvement was a ≈ two-fold reduction (over the 3 test phantoms) in the number of voxels with errors larger than 500 HU, similar to Sec.V-E2.

VI. CONCLUSIONS
We proposed two novel physics-inspired deep learningbased algorithms for scatter correction in X-ray CT images.The empirical results for the proposed methods demonstrate their advantage in various settings to another recent projectionby-projection-based data-driven de-scattering method.The proposed algorithms use scatter-corrupted measurements and an initial reconstruction of the object obtained from these measurements to estimate and correct the scatter in the projection domain.Unlike previous data-driven methods, the proposed algorithms incorporate constraints that are motivated by the physics of the CT imaging.The cost function for training the algorithms is tailored to express the norm of an imagedomain error in the projection domain, but without the need for using the filtered backprojection.The results of numerical experiments are promising, indicating the potential of the proposed algorithms.
Possible directions for future work include theoretical analysis of the factors limiting the performance of learningbased scatter reduction algorithms, and the use of a neural network architecture search to improve on the network architecture.It will also be of interest to compare the proposed approach to learning-based image-domain methods rather than the projection-domain.Evaluating the proposed approach on real experimental CT data is another important future study.Training and testing on real CT data would also involve scatter from other objects (e.g.detector and backplane), which would be learned and mitigated by the proposed approach (scatter from the detector is modeled in our monochromatic parallel beam experiments).It would also be of interest to evaluate (in addition to the experiments described) how well a trained network generalizes to testing (use) scenarios different from the training scenario -such as different objects, anatomies, or different scanner settings.Our experiments reported here, and experience from other deep learning tasks, suggests that limited generalizability is to be expected of any learning-based approach.However, as a practical matter, we note that it is standard practice to use different CT scanning protocols for different classes of subjects -e.g, male, female, or pediatric, even fine tuning the protocol to subject weight, let alone different animals or objects, or CT geometries and settings.We envision that similarly, different network models would be trained for such different classes (or even CT protocols) so that the scatter correction network does not have to generalize over an extremely wide range of scenarios.We do not regard this as a significant limitation of the proposed approach.

Fig. 3 :
Fig. 3: Cumulative fraction of the total energy contained in the frequency components [0, q].
The d × d × (d + 1) input to the DCNN (illustrated for d = 64) in Fig.4is the axially rotated initial reconstruction fθ concatenated along the channel coordinate u with − ln τθ (resp.− ln[(τ θ + τθ+π )/2] for OV-PhILSCAT).The DCNN provides the d × d 2D output s * θ (resp.b * θ for OV-PhILSCAT).The DCNN operates one view angle θ at a time, and has 2D convolutions in the (t, v) input planes, and a contracting structure in the channel coordinate u.The network architectures for PhILSCAT and OV-PhILSCAT are identical except for the input-output pairs as described earlier.

Fig. 4 :
Fig. 4: Network architecture for PhILSCAT for 3D reconstruction (illustrated for d = 64).Each red solid arrow represents two consecutive convolutional layers with ReLU nonlinearity modules.The number of channels in each intermediate output sij ∈ R d×d×d/2 j is provided next to its label in the diagram.

Fig. 10 :
Fig. 10: Polychromatic CBCT reconstructions of anthropomorphic test phantom (Numerical values in HU): (a), (b) p-Reconstruction; (c) Error magnitude and (d) reconstruction using total measurements τ ; vs. (e), (f) using primary measurements estimated by DSE; or (g), (h) estimated by PhILSCAT.(i) and (j) Sagittal maximum intensity projections (MIPs) of the error volumes using p * estimated by DSE vs. estimated by PhILSCAT, respectively.convolutional layers with 16 and 1 output channels, respectively, and then the single channel "image" was concatenated with the total measurement τ θ , resulting in a 2-channel input to the U-Net.Besides having a 2-channel input, the rest of U-Net was identical to DSE.The proposed loss function in Sec.III-C was used to train the network.The results, for Polychromatic CBCT Ti rod case shown in Table VI demonstrate the clear advantage of the proposed architecture.2) Input: To evaluate contribution of the initial reconstruction to PhILSCAT without modifying the architecture of the network, the initial scatter-corrupted reconstruction was replaced with a fixed random input (which, just as the initial reconstruction, is rotated by the view angle for each training projection), and training and testing was performed on the polychromatic Ti rods CBCT setting.TableVI, showing an improvement in all metrics, indicates the advantage of using the initial reconstruction as an additional input.Also, we observed ≈ two-fold reduction (over the 3 test phantoms) in the number of voxels with errors larger than 500 HU.3) Loss Function: To evaluate the contribution of the tailored loss function of Sec.III-C to PhILSCAT, we compared the performance (in the polychromatic Ti rods CBCT setting) to the same algorithm but without the application of filter h, that is, using a standard MSE loss on the scatter estimate.As seen in TableVI, the tailored loss provided an improvement on three of the four metrics.A more dramatic improvement was a ≈ two-fold reduction (over the 3 test phantoms) in the number of voxels with errors larger than 500 HU, similar to Sec.V-E2.
Generalization and ablation studies (on polychromatic CBCT reconstruction of Ti rod phantoms, averages and standard deviations for 3 test phantoms): (a) DSE and PhILSCAT trained on I0 but tested on I0/4; (b) PhILSCAT with U-Net; (c) Initial reconstruction replaced by a random image; (d) Scatter MSE loss (as in DSE) used for training; (e) Original PhILSCAT.Standard deviation on all SSIM entries smaller than next significant digit.

TABLE I :
Average run times.
, for the monochromatic and polychromatic cases, respectively.

TABLE III :
Monochromatic CBCT: average reconstruction accuracy results and standard deviations for different algorithms for 3 Ti rod test phantoms.