Artifact Reduction in 3D and 4D Cone-Beam Computed Tomography Images With Deep Learning: A Review

Deep learning based approaches have been used to improve image quality in cone-beam computed tomography (CBCT), a medical imaging technique often used in applications such as image-guided radiation therapy, implant dentistry or orthopaedics. While deep learning methods have been applied to reduce various types of CBCT image artifacts arising from motion, metal objects, or low-dose acquisition, a comprehensive review summarizing the successes and shortcomings of these approaches, with a primary focus on the type of artifacts rather than the architecture of neural networks, is lacking in the literature. In this review, the data generation and simulation pipelines, as well as artifact reduction techniques are specifically investigated for each type of artifact. We provide an overview of deep learning techniques that have successfully been shown to reduce artifacts in 3D, as well as in time-resolved (4D) CBCT through the use of projection- and/or volume-domain optimizations, or by introducing neural networks directly within the CBCT reconstruction algorithms. Research gaps are identified to suggest avenues for future exploration. One of the key findings of this work is an observed trend towards the use of generative models including GANs and score-based or diffusion models, accompanied with the need for more diverse and open training datasets and simulations.


I. INTRODUCTION
Cone-beam computed tomography (CBCT) is an imaging technique to acquire volumetric scans in medical domains such as implant dentistry, orthopaedics, or image-guided radiation therapy (IGRT).In particular, in the case of IGRT, onboard imaging mounted directly on radiotherapy machines is used to assess a patient's current anatomy before radiation treatment sessions.Changes in anatomy during the treatment period and since the acquisition of the planning CT (pCT) can lead to inefficiencies in the treatment process.Recent research has demonstrated that utilizing 3D or 4D (volumetric data with additional time dimension to track motion) CBCT scans in IGRT [2] improves patient positioning and dose calculation for radiotherapy sessions.
The quality of CBCT scans suffers from similar types of artifacts as for spiral/helical CT scans, including those arising from beam hardening and scatter effects, metal implants, and patient motion.In addition, new artifacts arise due to the cone-beam geometry.Further, minimizing the radiation dose in radiotherapy is important for the safety of the patients.However, reducing the imaging dose per scan, acquiring fewer Xray projections, or acquiring projection data from a limited angle can result in streak artifacts.
This paper provides an overview of the current body of research on artifact reduction in 3D and 4D CBCT with applications including, but not limited to, IGRT, aiming to improve scan quality while also minimizing the imaging radiation dose.The significant variation in the methods and techniques used to mitigate different types of artifacts suggests to organize the literature based on the type of artifact.For instance, sparse-view artifacts can be addressed in the projection domain by interpolating new projections, but refining the original projections is not beneficial; however, motion artifact mitigation is possible through projection refinement.Further, the survey aims to present a clear picture of all necessary steps in the artifact mitigation process for all relevant types  Incorporating acquired temporal and motion information provides the opportunity to apply a projection binning which can be used to reconstruct 4D CBCT images (3D images at various states of motion).During the course of CBCT reconstruction, several types of artifacts (e.g.arising from cone-beam geometry, low dose, sparse view or limited angle scans, scatter, metal or beam hardening) can be mitigated through DL-based optimization in the projection and/or volume domain, or by improving (parts of) the reconstruction algorithm itself using neural networks.The illustration of a commerical radiotherapy system is adapted from [1]. of artifacts individually.
In particular, we review the current state-of-the-art research which uses deep learning (DL) [3] to reduce various artifacts in CBCT scans, and we categorize the research based on the types of artifacts they address.While Ref. [4] focuses broadly on the use of DL methods in IGRT, the closest literature reviews to our work are presented in references [5]- [7].The first survey [5] is focused on synthetic CT generation from various types of input scans, including CBCT, with the aim to enhance the scan quality.Its content partially overlaps with what we present in Section III.However, it does not cover all the other artifacts which can degrade CBCT image quality as discussed after Section III.Ref. [6] discusses supervised, selfsupervised, and unsupervised techniques for artifact reduction in CT scans, and it covers unrolling the reconstruction, as well as optimization methods in both the projection (raw 2D X-ray images) and volume (reconstructed 3D images) domains.However, it is essential to note that Ref. [6] primarily focuses on CT scans, which differs from the main focus of this work, namely CBCT scans.The third survey [7] provides an in-depth literature analysis, considering criteria such as anatomy, loss functions, model architectures, and training methods for supervised learning specifically applied to CBCT scans.In our work, instead of dividing the literature based on the deep learning methods, we group the research based on the type of artifacts, discussing results employing projectionand/or volume-domain optimization, dividing the methods based on the type of supervision, and also including research addressing time-resolved 4D CBCT reconstruction.
Artifacts in CBCT images can principally be reduced by optimizations in the projection, volume, or dual-domain (both projections and volumes), as well as by DL-enabled reconstruction.This survey presents an overview of deep learning techniques able to reduce artifacts in 3D as well as timeresolved 4D CBCT using optimizations in the above domains, and through novel CBCT reconstruction methods.Furthermore, it addresses the challenges and limitations associated with these approaches and provides recommendations for future research directions.This survey organizes the literature according to the type of artifacts which is addressed, and presents and contrasts the methodologies used within each specific artifact group (see Figure 2).The remainder of this paper is organized as follows: Section II briefly summarizes the basic aspects of CBCT acquisition and the assessment of scan quality.Thereafter, the literature is discussed based on different types of artifacts (as outlined in [8], [9]) as follows: Section III presents methods attempting to improve CBCT image quality by reducing artifacts generated because of the cone-beam geometry and by bringing the CBCT quality closer to the one of CT scans.The subsequent sections focus on various methods to address artifacts resulting from reduced acquisition dose.Firstly, Section IV discusses techniques that lower the dose per X-ray projection to achieve dose reduction.This is followed by Section V, which explains methods for artifact reduction when acquiring fewer projections by uniformly dropping some of them (sparse-view reconstruction).Section VI explores artifact reduction methods specifically for CBCT scans acquired from a limited angular range.The paper then proceeds to discuss methods targeting scatter and beam hardening artifacts in Section VII.Section VIII is dedicated to research on reducing metal artifacts.Section IX focuses on motion compensation techniques for 3D and 4D CBCT.Further, the main trends in the recent literature on using deep learning-based architectures for CBCT artifact mitigation are presented in Section X, complemented with a discussion concerning the connections amongst the methods used for various types of artifacts and recommendations for future work.Finally, the paper concludes with Section XI.

II. PRELIMINARIES
This section briefly reviews the basics of CBCT reconstruction and evaluation methods employed in artifact reduction and scan quality assessment.

A. CONE-BEAM GEOMETRY RECONSTRUCTION AND DEEP LEARNING
CBCT scans are acquired by means of an imaging system consisting of an X-ray source and a flat-panel (2D) detector mounted on a gantry system which rotates around the body region of interest.Several hundred 2D X-ray images are acquired at various angles.These projections can be acquired from a limited angular range (so-called short scan) or a full 360 • trajectory (full scan).Following the acquisition, a volumetric 3D image is reconstructed from the 2D projection images.Several methods exist to solve this illposed inverse problem.The most popular one is based on an analytic method developed by Feldkamp, Davis, and Kress (FDK [10]) which provides a fast and reliable approximation of the inverse Radon transform.Alternatively, iterative algebraic reconstruction techniques (ART [11]) have become popular as well.Moreover, by tracking the patients' motion, e.g. by capturing an external or internal breathing signal, and dividing the projections based on the motion state, it is possible to reconstruct 4D (motion-resolved) volumetric images.4D scans include both the 3D volumetric information as well as their temporal dynamics.
In a nutshell, deep learning based approaches can be deployed at various stages of the CBCT reconstruction process.Firstly, deep neural networks can be trained to correct the acquired 2D projections (projection domain correction); secondly, they can be used to correct the reconstructed CBCT volumetric images (volume domain correction); and thirdly, the two approaches can be combined into a dual-domain correction.Another approach is to augment or replace (parts of) the 2D-3D CBCT reconstruction itself with deep learning based components.The components of the FDK algorithm were mapped into a deep neural network by means of a novel deep learning enabled cone beam back-projection layer [12], [13].The backward pass of the layer is computed as a forward projection operation.This approach thus permits joint optimization of correction steps in both volume and projection domain.An open source implementation of differentiable reconstruction functions is available [14].The networks are often trained in a supervised fashion by comparing reconstructed CBCT images with an artifact-free ground truth.Unsupervised [15], [16] and self-supervised [17], [18] learning approaches have been employed as well.
While datasets of 3D or 4D CBCT scans obtained from phantoms, animals or human subjects are available for training, they generally lack ground truth information required for deep learning based artifact mitigation employing supervised learning.To overcome this, artificial or simulated CBCT data is often used, obtained e.g. by means of forward projecting existing CT scans in a CBCT setup and manual incorporation of artifacts.For example, motion artifacts can be included by sampling CBCT projections at scan angles and time steps matching interpolated phases of a given 4D CT scan.
The general acquisition and reconstruction process of CBCT scans, including deep learning based corrections, is summarized in the visual abstract in Figure 1.

B. EVALUATION METRICS
Several metrics have been utilized in the literature to evaluate the quality of CBCT scans enhanced by deep learningbased techniques.The main qualitative evaluation metrics, computed between a reconstructed volume (with artifacts) and the ground truth reference, can be divided into two main groups as follows, according to [7]: • Image Similarity Metrics: These metrics compute the similarity between scans and include (mean) absolute error (ME and MAE), (root) mean squared error (MSE and RMSE), (peak) signal-to-noise ratio (SNR and PSNR), structural similarity (SSIM) [19], and Dice coefficient [20].• Dosimetric Similarity Metrics: These metrics measure the consistency in dosimetry using a pair of scans, such as dose difference pass rate (DPR); dose-volume histogram (DVH), and gamma pass rate (GPR).
In addition to the metrics mentioned above, metal artifact index (MAI [21]), and streak index (SI [22]) have been used in the literature to measure the level of specific artifacts in CT and CBCT scans.For motion, visual information fidelity (VIF) [23] or autofocus (sharpness) metrics have been employed, among others.

C. CLINICAL EVALUATION
The numerical evaluation metrics mentioned above compute the similarity of the improved CBCT compared with a reference, or report the level of the presence of artifacts, scan sharpness, or other quality criteria.Ideally, these metrics should reflect the scan quality; hence, they should correspond to the preference of the experts in using the scans in clinical routine.However, it is essential to note likely inconsistencies between simulated (where ground truth references exist) and real-world clinical data, so clinical evaluations are necessary to ensure the applicability of the presented methods for practical applications.A clinical evaluation can be conducted by completing surveys with experts such as medical doctors or radiation physicists to directly assess the level of artifacts and the performance of the artifact reduction techniques, and the applicability of the improved images in various clinical tasks such as dose calculation, soft-tissue segmentation, and patient positioning [24].

III. CONE-BEAM IMAGE QUALITY
Cone-beam geometry and the size of the flat-panel detector result in the coverage of larger body areas but at lower resolution and degradation in scan quality compared to fanbeam CT scan acquisition.Consequently, significant attention and extensive research has been directed at improving the quality of CBCT scans, often referred to as removing conebeam or geometry artifacts in the literature.One of the initial approaches to enhance CBCT quality involves employing supervised learning and training a 39-layer deep convolutional neural network (CNN) to map input CBCT scans to the corresponding planning CT as ground truth (reference) volumes [25].This mapping of CBCT images to match correpsonding CT images is often called synthetic CT (sCT) from CBCT.

Applications based on CNNs
Researchers have explored several CNN-based architectures with various supervised training objectives to enhance CBCT quality.For instance, denoising has been targeted through solving the multi-agent consensus equilibrium (MACE) problem and multi-slice information fusion techniques [26].CNN models have demonstrated the ability to reduce ring artifacts from flat-panel CBCT scans using pre-corrected and artifactfree scans as ground truth [27].Geometric artifacts caused by misalignment of the CBCT system were reduced using a modified fully convolutional neural network (M-FCNN), without using any pooling layers [28].A further approach used a 3D block-based residual encoder-decoder convolutional neural network (RED-CNN) architecture coupled with a bilateral 3D filter and a 2D-based Landweber iteration to successfully remove Poisson noise while preserving the image structure at tissue edges [29].Training 3D models using a multi-task learning objective improved the quality of CBCTs by producing high-quality synthetic CT (sCT) scans from noisy and artifact-ridden scans for segmenting organs-at-risk (OARs) [30].Lately, using InceptionV3 [31] as a backbone has proven beneficial in reducing the artifacts observed in CBCT short scans due to the misalignment of the detection plane around the z-axis [32].

GANs and conditional GANs
Researchers have used self-supervised and unsupervised techniques to eliminate the need for paired CBCT and CT scans in supervised learning and to consider anatomical changes between the acquisition of planning CT (pCT) and CBCT.These techniques mainly involve training auto-encoders, (conditional) generative adversarial networks (GANs [33]), and cycle-consistent generative adversarial networks (Cycle-GANs [34]).Combining auto-encoders and GANs as a complementary approach to reweighting in analytical and iterative reconstruction methods has improved the quality of CBCT scans [35].Training conditional GANs has shown promising results in enhancing the quality of CBCT through style transfer, effectively removing artifacts and discrepancies between CBCT and pCT for average tumor localization [36] and adaptive therapy [37].Moreover, a more advanced GAN variant called temporal coherent generative adversarial network (TecoGAN) also improves the quality of simulated 4D CBCT scans by considering the time dependencies and motion for quality enhancement [38], [39].

Cycle-GANs
Using Cycle-GANs for unpaired translation from CBCT to pCT has received significant attention among researchers.Notably, Cycle-GANs have successfully generated highquality synthetic CT scans from CBCT for various organs, including prostate [40], lung [41], and abdominal scans [42].
A novel architecture inspired by contrastive unpaired translation (CUT [43]), trained in an unsupervised manner, improves the quality of CBCT scans by addressing fringe artifacts and noise degradation for dose calculation in adaptive radiotherapy [15].The combination of binary cross-entropy, gradient difference, and identity losses with Cycle-GANs has further improved the quality of head and neck CBCT scans [44].
Introducing the residual block concept in the implementation of Res-Cycle-GAN has demonstrated advancements in the quality of sCT scans [45].Moreover, researchers have explored the combination of a Cycle-GAN with classical image processing techniques [46] and U-Net [47] architectures [16] in two-step approaches.These approaches aim to initially reduce artifacts and subsequently generate sCT scans to improve the quality.Ultimately, researchers demonstrated that trained Cycle-GANs enhance the quality of CBCT scans and achieve high accuracy in volumetric-modulated arc photon therapy (VMAT) [48].

Alternative methods
In addition to adopting mainstream trends and computer vision architectures for artifact reduction in CBCT scans, researchers have explored creative methods specifically tailored to CBCT reconstruction using deep learning and neural networks.For instance, U-Nets have been optimized for spectral blending of independently reconstructed sagittal and coronal views to enhance the CBCT quality [49].Neural networks have also been integrated into the core of the reconstruction algorithms in the Feldkamp, Davis and Kress (FDK) technique to introduce the NN-FDK technique for CBCT quality improvement [50].Another novel architecture, known as the iterative reconstruction network (AirNet), incorporates several variants in selecting projections based on randomphase (RP), prior-guided (PG), and all-phases (AP) for reconstruction [51].Geometry-guided deep learning (GDL [52]), and its multi-beamlet-based approach (GMDL [53]) are additional examples of leveraging deep learning to enhance the reconstruction geometry effectively.Finally, CNNs have been employed to predict the quality of the scans and accordingly dynamically adapt the C-arm source trajectory in the imaging acquisition process to avoid generating artifacts in the final scans [54].

IV. LOW DOSE
The reduction of the acquisition dose in CBCT scans, which leads to the increased presence of artifacts, has been addressed through various approaches such as adjusting the radiation dose per X-ray projection [55], increasing the acquisition speed or collecting fewer projections [56].Early research focused on low-dose artifact reduction primarily by removing artifacts in the volume domain using deep CNNs with U-Net architectures.The studies demonstrated the potential of decreasing the overall radiation dose through both dose reduction methods mentioned above [55], [56].Moreover, a combination of 2D and 3D concatenating convolutional encoder-decoder (CCE-3D) with a structural sensitive loss (SSL) was employed to denoise low-dose CBCT scans and remove artifacts in both projection and volume domains.This approach showed promising results in improving the quality of CBCT scans based on several metrics, such as PSNR and SSIM, and with greater improvements reported in the projection domain compared with the volume domain [57].In addition, a CNN-based iterative reconstruction framework was integrated with a plug-and-play proximal gradient descent framework to leverage DL-based denoising algorithms and enhance CBCT reconstruction [56].Training models inspired by self-supervised learning approaches for inpainting and denoising Poisson and Gaussian noise have shown promising results in removing low-dose artifacts [58].Similarly, models optimized for removing Gaussian noise and addressing view aliasing artifacts through 2D iterations with 3D kernels have been developed [59].Furthermore, researchers combined a non-subsampled contourlet transform (NSCT) and a Sobel filter with U-Net architectures, referred to as NCS-Unet, to improve the quality of low-dose CBCT scans by enhancing both low-and high-frequency components [60].

V. SPARSE-VIEW
This section summarizes research aiming at reducing artifacts in CBCT reconstruction occurring from using uniformly downsampled full-scan (360 • ) projections, primarily with the goal of dose reduction.Sparse-view artifact reduction is closely related to mitigation of artifacts caused by limited angle acquisition and breathing-phase-correlated 4D reconstruction, which will be reviewed in the upcoming sections VI and IX, respectively.While the underlying motivations for sparse-view (acquisition dose reduction), limited angle (geometric constraints), and 4D (time resolved imaging) acquisition are different, in all cases artifacts are created due to the lack of projections from various angles.Decreasing the number of projections and the resulting data insufficiency for the reconstruction algorithm results in artifacts appearing in the shape of symmetric and uniform streaks, as depicted in Figure 3.

Projection and volume domain optimization
The body of literature on sparse-view artifact reduction using deep learning has been consistently growing since 2019, when initial research demonstrated the opportunity to reproduce the original image quality with using as few as oneseventh of the projections with symmetric CNN's as postprocessing operation in the volume domain [61].Similarly, using a multi-scale residual dense network (MS-RDN) successfully improved the quality of CBCTs reconstructed from one-third of the projections [62].In addition to training in the volume domain, the intensities of under-sampled projections can be corrected using deformation vector fields (DVFs) to match the original data, resulting in negligible streak artifacts after reconstruction [63].Similarly, symmetric residual CNN's (SR-CNN) can enhance the sharpness of the edges in anatomical structures reconstructed from sparse-view projections with total variation (TV) regularization in half-fan scans [61].Furthermore, a counter-based total variational CBCT reconstruction using a U-Net architecture enhances the smoothed edges in lung CT reconstructed scans from halffan projections [64].In Ref. [65], a Reconstruction-Friendly Interpolation Network (RFI-Net) is developed, which uses a 3D-2D attention network to learn inter-projection relations for synthesizing missing projections, and then introduces a novel Ramp-Filter loss to constrain a frequency consistency between the synthesized and real projections.The authors of [66] developed a dual-domain attention-guided network framework (Dual-AGNet) which works in both projection and reconstruction domains, featuring spatial attention modules and a joint loss function.

Dual-domain optimization
Though interpolating missing data in the projections and removing artifacts in the volume domain are straightforward approaches to sparse-view artifact reduction, combining both and backpropagating the error through the reconstruction algorithm is not trivial.Despite the complexity involved, researchers attempted to unroll the proximal gradient descent algorithm for reconstruction and backpropagate the gradient through a U-Net architecture to reduce streak artifacts in [67].
Since optimization in the volume domain and projection interpolation are regression problems with different or the same data channels as input and output, autoencoder-decoder architectures have also gained popularity for artifact reduction [68].To avoid complications regarding backpropagation through the reconstruction (back-projection) algorithm, DEER is introduced as an efficient end-to-end model for directly reconstructing CBCT scans from few-view projections [69].Furthermore, DeepOrganNet could fine-tune the lung mesh by skipping the reconstruction step and avoiding sparse-view artifacts appearing on organ mesh [70].Furthermore, the recent deep intensity field network (DIF-Net) model uses the latent representation (feature maps) of the 2D projections coupled with a view-specific query for extracting information from the projections.This information is then fed through cross-view fusion and intensity regression models to reconstruct a volume without artifacts.[71].

Information fusion, prior-based and physical modeling
Recent research trends seek to minimize sparse-view artifacts by incorporating multi-slice [72] and scale [73] information fusion techniques, as well as combining information from different scan views (coronal, axial, and sagittal) [74].As the computational resources have become more powerful, deep learning for sparse-view artifact reduction has extended from 2D models for single slice processing to 3D models and processing of 4D CBCT scans [72].The use of prior (planing) CT and CBCT volumes to enhance the trained models, such as regularized iterative optimization reconstruction (PRIOR-Net [75]) and merge-encoder CNN (MeCNN [73]) have recently become popular for sparse-view artifact reduction.Researchers have also investigated using perceptionaware [76] and physics-based [75] methods.The learning paradigm has expanded beyond purely supervised learning to different tasks, such as denoising (DRUNet [77]), artifact reduction [78], self-supervised by dropping projections [18] and unsupervised learning through training conditional and generative adversarial networks (GANs) [79].

VI. LIMITED ANGLE
Besides lowering the imaging dose through uniformly downsampled projections, another approach to reducing the number of acquired projections and scanning dose is scanning the body from a limited angle.Such scan settings are especially common when using a full-fan acquisition technique in a short-scan, where reconstruction is performed using projections from an angular range covering less than 360 degrees.Although Parker weights [80] can be utilized to compensate for the loss of mass in the resulting CBCT scans, artifacts still appear due to the smaller number of acquired projections when scans are acquired from limited angles.One of the initial attempts used learnable Parker weights in the projection domain to address the mass loss in the angular range from 180 • + θ to 360 • (θ being the fan angle) [12].A subsequent study optimized a deep artifact correction model (DAC) using a 3D-ResUnet architecture to create high-quality scans and improve artifacts in limited-angle circular tomosynthesis (cTS), confirming the potential for quality enhancement in the volume domain [81].Further research demonstrated that combining FDK-based reconstruction with a neural network can achieve outstanding performance in 3D CBCT reconstruction from projections acquired from only 145 • [82].Supervised learning, frequently implemented through training U-Net architectures, for shading corrections in CBCT volumes with a narrow field of view (FOV) notably improved the quality of reconstructed CBCT scans, using CT scans as ground truth [83].Another approach involves using a prior based on a fully sampled CT or CBCT and training a 2D3D-RegNet, which demonstrates the effectiveness of using a patient-specific prior for limited-angle sparseness artifact reduction [84].A conventional method for 4D CBCT reconstruction is dividing the projections based on the breathing phases and then reconstructing the body volume in those phases.As a result of using only a subset of the projections for each motion state, sparseness artifacts are prevalent for this special case of limited angle acquisition.These artifacts have been addressed in the projection domain by interpolating the projections from different breathing phases [85].In the volume domain, transfer learning, layer freezing, and finetuning have been employed to adapt the trained DL models to individual patients and mitigate sparseness artifacts [86].

VII. SCATTER AND BEAM HARDENING
Large cone angles within the CBCT geometry setup have been observed to contribute to scatter artifacts, which have been addressed in the projection domain by leveraging Monte Carlo photon transport simulations to compute ground truth projections for supervised learning [89].A CNN-based deep scatter estimation (DSE [89]) architecture, as well as a scatter correction network (ScatterNet [87]) are the results of research endeavors using supervised learning for artifact correction in the projection domain.The DSE model has demonstrated the potential to accurately emulate scatter artifacts and reduce the computational burden of using Monte-Carlo simulations while being orders of magnitude faster [90].ScatterNet is considerably faster than the classical methods and might allow for on-the-fly shading correction [87].ScatterNet, in combination with shading correction, also showed satisfactory results for dose calculation using volumetric modulated arc radiation therapy (VMAT), but yielded unsatisfactory outcomes for intensity-modulated proton therapy (IMPT).Despite the abundant research work on scatter artifact corrections, studies tackling beam hardening are scarce.One such study involved training a U-Net-based architecture to predict monoenergetic X-ray projections from polyenergetic X-ray projections using supervised learning on Monte Carlo simulation-based ground truth in the projection domain [91].Simulated 4D CBCT scan at three distinct motion phases, without significant motion artifacts Sparse-view artifacts at various sub-sampling rates (from left to right: 1/6, 1/18 and 1/48) Limited angle artifacts [12] Scatter artifacts [87] Metal artifacts [88] Motion artifacts in simulated (left) and real (middle and right) CBCT scans [24] FIGURE 3: Examples of different kinds of artifacts appearing in CBCT scans.Shown are several artifact-free motion states obtained with a simulated 4D CBCT scan (1st row), sparse-view artifacts at various sub-sampling rates (2nd row), limitedangle, scatter and metal artifacts (3rd row), as well as motion artifacts (4th row).
Compared with the classical fast adaptive scatter kernel superposition (fASKS) scatter reduction technique [92], a U-Net-based architecture outperformed in scatter artifact reduction for both full-fan and half-fan scans based on several metrics [93].Additionally, a U-Net-based model trained on simulated CBCT projections has shown comparable performance to a validated empirical scatter correction technique in dose calculation for correcting the scatter artifacts in head and neck scans, computing the corrected volumes in less than 5 seconds [94].Besides classical approaches of scatter artifact reduction, CT scans have been used as ground truth volumes for training a modified U-Net architecture with a multiobjective loss function specifically targeting scatter artifact reduction in esophagus scans [95].
Apart from supervised learning methods, researchers have also trained Cycle-GAN models to improve the quality of CBCT scans, remove scatter artifacts, and generate sCT.In particular, Cycle-GAN has demonstrated superior performance compared to similar techniques using deep convolutional generative adversarial networks (DCGAN [96]) and VOLUME 11, 2023 progressive growing GANs (PGGAN [97]) [98].

VIII. METAL
Metal objects and implants in the patient's body result in scattered radiation reaching the detector, leading to streak artifacts.In the early research addressing metal artifacts, a CNN-based regression model has been trained to predict the detectability rank of metal implants to recommend out-ofplane angulation for C-arm source trajectories [99].Further research in this area has proposed predicting the X-ray spectral shift after the localization of metal objects to define the optimal C-arm source-detector orbit [100].The metal artifact avoidance (MAA) technique uses low-dose scout projections to roughly localize metal objects for the identification of a circular or non-circular orbit of C-arm source-detector to minimize variations in spectral shift and avoid metal artifacts [101].
Researchers have also employed supervised learning for reducing metal artifacts and estimating the deviation of the voxel values after inserting neuroelectrodes [102].Selfsupervised learning approaches, focused on training models for inpainting the regions affected by metal artifacts, have demonstrated improvements in simultaneously tackling metal artifact reduction while preserving the essential anatomical structures near the inserted implants [88].In addition to supervised and self-supervised techniques, various types of GANs have been employed in the literature for unsupervised metal artifact reduction.Optimized conventional GANs can reduce metal artifacts in high-resolution and physically realistic CT scans, with good generalization to clinical CBCT imaging technologies for inner-ear scans [103].Conditional GANs, inspired by the pix2pix-GAN [104], have successfully reduced metal artifacts in spine CBCT scans, enabling precise recovery of fiducial markers located outside the C-arm's field-ofview (FOV) [105].A Cycle-GAN has also been employed to efficiently reduce metal artifacts by generating synthetic CT (sCT) from Megavolt CBCT (MVCBCT) and improving the quality of CBCT scans [106].

IX. MOTION
Many of the state-of-the-art volumetric reconstruction techniques for CBCT rely heavily on the initial assumption that the projections are acquired from a stationary object.However, this assumption is often violated because of periodic respiratory and cardiac motions or non-voluntary and non-periodic movement of air bubbles in the abdominal area.When reconstructing CBCT volumes using projections acquired from various body states under motion, motion streak artifacts appear in the reconstructed volume, as shown in Figure 3.The severity of the resulting artifacts is positively correlated with the intensity of motion.The most common approach to tackle motion artifacts in CBCT scans is dividing the projections based on the motion state (motion-resolved [107]- [112]), periodic motion state (phase-resolved [111], [113], [114]) or acquisition time (time-resolved [115], [116]), and then reconstruct multiple volumes based on each batch of projections to generate a 4D CBCT.

Motion-resolved methods
A novel approach using CNNs to predict the missing projections in motion-resolved 4D-CBCT combined with a binsharing technique to accelerate the acquisition process, substantially removed streak artifacts compared with standard conjugate gradient reconstruction [107].Training a residual U-Net also reduces the streak artifacts appearing in 4D-CBCT by addressing the sparseness of the projections acquired in each breathing phase [108].Residual dense networks (RDNs [110]) have successfully improved sparseness artifacts using an in-house lung and liver dataset, as well as a public dataset of the SPARE challenge [117], [118].Similar research demonstrates that combining the information of the different breathing phases to train a prior-guided CNN can effectively reduce artifacts in motion-resolved 4D-CBCT scans [109].In addition to training single models, researchers attempted to optimize a cascade of spatial and temporal CNN models to combine spatial and temporal information for maximum artifact removal and to avoid errors in the tomographic information [112].A dual-encoder CNN (DeCNN) architecture simultaneously processes and combines the information of 4D motion-resolved volumes and the averaged volume, thereby improving the sharpness of the edges in moving and fixed tissues in 4D-CBCT [119].

Phase-and time-resolved methods
Phase-resolved CBCT is a specific case of motion-resolved CBCT, where projections are selected based on the different phases of body volume under periodic, respiratory, or cardiac motion.Motion Compensation Learning-induced sparse tensor constraint reconstruction (MCL-STCR) was shown to improve 4D-CBCT scans for all motion phases [120].3D-CNNs have shown to effectively mitigate sparse-view artifacts in motion-compensated 4D-CBCT scans reconstructed using FDK, thereby enhancing the overall quality [114].N-Net uses the prior volume reconstructed using all projections to remove streak artifacts.CycN-Net combines the temporal correlation among the phase-resolved scans to reduce streak artifacts that are caused by sparse-view sampled motionresolved projections [111].Furthermore, training a patientspecific GAN-based model on phase-resolved 4D-CBCT to reproduce CT quality using CBCT scans demonstrates improvements when applied to test set projections acquired from the same patient [113].In addition to motion-and phaseresolved methods, training a U-Net can remove sparseness artifacts from time-resolved 4D-CBCT without requiring any prior information [115].GANs have also demonstrated the capacity of estimating sCT scans from time-resolved 4D-CBCT and the average 3D-CBCT volume, resulting in a comparable improvement in dose calculation using both strategies [116].

Biomechanical and physical modeling
In addition to phase-, motion-, and time-resolved techniques, researchers have also explored targeting motion artifacts by physically modeling the motion using a deformation-vectorfield (DVF) and by optimizing an autofocus metric (i.e., maximizing some measure of sharpness).The Simultaneous Motion Estimation and Image Reconstruction (SMEIR) model, as well as its biomechanical modeling-guided version (SMEIR-Bio), are examples of models developed for motion effect prediction in lung 4D CBCT scans [121].These models have also been enhanced using a U-Net-based DVF optimization technique, leveraging a population-based deep learning scheme to improve the accuracy of intra-lung DVF prediction (SMEIR-Unet) in the same research work.By incorporating the reference phase in 4D CBCT as an extra channel to their model, training a 4D U-Net for motion estimation, with fine-tuning the estimated DVFs, the performance of SMEIR models increases for motion artifact reduction [122].CNN-based architectures have been optimized to estimate deformable motion and predict the motion intensity on 8 × 8 grids covering the axial slice, followed by a preconditioning technique to favor more likely motion intensities [123].CNNs have also been trained for motion compensation in CBCT scans to solve the high-dimensional and no-convex problem of optimizing the autofocus metric [124].

Alternative methods
The autofocus metric has also been replaced with the Context-Aware Deep Learning-based Visual Information Fidelity (CADL-VIF) image similarity metric to optimize multiresolution CNNs [125].This approach aims to improve motion degradation and compute sharp scans while preserving the tissue structures by optimizing visual information fidelity (VIF) without requiring motion-free ground truth.An alternative to the autofocus metric is using contrastive loss to train GAN architectures to enhance the quality of 4D-CBCT scans and to reduce streak and motion artifacts [15].To address the slow speed of reconstruction and to compensate for the errors of 4D-CBCT due to the severe intraphase undersampling, a feature-compensated deformable convolutional network (FeaCo-DCN [126]) model has been proposed.It achieves nearly real-time reconstruction and accurate CBCT, outperforming the previous method applied to the SPARE Challenge [117], [118].Besides the numerous research studies addressing motion in 4D CBCT, which requires recording the patient's breathing curve, researchers have also simulated motion in CBCT scans based on the estimation of DVFs according to 4D CT ground truth scans [127].They subsequently trained a dual-domain model to mitigate 3D CBCT motion artifacts in the projection and volume domains.The clinical validation on real-world CBCT images yielded positive feedback from clinical experts, demonstrating the effectiveness of their approach for motion compensation [24].In addition to all methods to reduce motion artifacts, researchers have successfully used an artifact-driven slice sampling technique to avoid artifacts caused by moving air bubbles in the segmentation of the female pelvis [128].

X. DISCUSSION AND RECOMMENDATIONS
The previous sections have outlined the methodology and the complete workflow employed for deep learning based mitigation of artifacts in CBCT scans, addressing each specific type of artifact separately.This section presents a summary, emphasizing the central role of various deep learning approaches.The objective is to offer a comprehensive review of the architectures employed for different artifact types, highlighting both the promising aspects and the limitations in the current literature.
In general, a trend is observed in shifting from conventional supervised learning with CNNs and U-Net-type architectures to exploring more modern learning paradigms such as GANs, and investigating self-supervised and unsupervised methods, leveraging e.g.Cycle-GANs, as depicted in Figure 4a.In particular, Cycle-GAN-based architectures offer the appealing feature of enabling model training without needing paired labeled data [131].However, they come with high data requirements, rising attention toward methods and projects for data collection, synthetical data generation, dataset merging from diverse sources, and data homogenization.This trend suggests the rise of research works attempting at the adaptation of generative models including GANs, Cycle-GANs, as well as scored-based models [132], [133], in upcoming re-  search endeavors.A recent example [129], which employs denoising diffusion probabilistic models [134], [135] for sparseview CBCT reconstruction, demonstrates a lot of potential for future research, however at the expense of tremendous compute resources (up to 128 GPUs, see also Table 1).On the other hand, less computationally intense, U-Net-based, architectures have demonstrated their merit in successfully addressing artifacts across all categories, making them a highly recommended and robust baseline approach for artifact mitigation.
In the context of this survey, the primary DL-based architectures used in the literature can be divided into four key categories: CNNs, U-Nets, GANs, and cycle-GANs.Here, we categorize architectures with multi-scale information fusion, i.e. including connections from the network's input (encoding) layers to output (decoding) layers (such as [67]) under the category U-Net, while those without such direct connections (such as autoencoders [136]) are categorized as CNNs.DL-based models generally require medium to large datasets for training, validation and testing through clinical evaluation.While medium-sized datasets, including multiple patients, can serve as starting points for training CNNs and U-Nets [83], GANs perform better using datasets containing at least dozens of patient scans [42].This trend generalizes to 3D and 4D reconstruction, where larger input sizes and a higher number of scans become essential, in particular for 4D [122].A review of the studies presented in Table 1 reveals that the majority of research was conducted with fewer than 50 patients.This relatively small number of patients can pose challenges for validating the approach across a diverse population.Consequently, the robustness of these models warrants further scrutiny to ensure their ability to generalize well across various human anatomies.CNN architectures, known for their stable convergence and versatility, demonstrate a wide range of applications for artifact reduction through adapting different vision backbones [32] and incorporating diverse architectural components such as attention blocks [24].However, in terms of multi-scale information fusion, they are inferior to U-Nets and their variants (e.g., U-Net++ [137]), which demonstrate a fast convergence in supervised learning due to the internal architectural connections between different layers enhancing the multi-resolution information fusion [7].Since CNNs and U-Nets are predominantly being trained in a supervised man-ner, their learning technique necessitates explicitly labeled data to define the task.On the other hand, generative models (GANs), incorporating an adversarial loss, also offer potential applications in generating high-quality synthetic scans to meet the data needs of the deep learning-based architectures [36].Moverover, Cycle-GANs compute the inverse path of artifact reduction automatically, using a cycle-consistent loss, thus being able to learn artifact reduction without the need for paired artifact-free ground truth [48].
Only four of the papers presented in Table 1 provide a public code repository to reproduce their results.This highlights a considerable shortage of open science practices, such as sharing code, to promote transparency and reproducibility in research.It is strongly recommended for researchers to share their code publicly to enhance the credibility and reproducibility of their work and accelerate scientific progress in this field.

XI. CONCLUSIONS
We presented a survey on the application of deep learning and convolutional neural networks to reduce various types of artifacts in CBCT scans.We categorized the existing literature based on the type of artifacts they address as well as the methodology employed.Figure 4b illustrates the amount of the recent research works based on the type of artifacts.It is observed that there has been considerable growth in artifact reduction research compared with focusing more generically on scan quality after 2021.The opportunity of reducing the imaging dose with the help of compensating for artifacts when using low-dose scans, sparse-view, and limited-angle acquisition techniques have gained substantial attention due to the ease of simulation and computing the ground truth, especially for sparse-view and limited-angle approaches.
However, metal and scatter artifacts have received less attention.This may also be due to the challenges involved in computing the ground truth for metal artifacts, or the high computational cost of Monte-Carlo simulation for scatter artifacts.We expect that the research community could profit from open-source accurate and fast artifact simulations for training models (as before with XCAT [138]).The development of such simulations could also serve as a driving force for physics-based artifact modeling or training physicsinformed neural networks (PINN) [139] for artifact reduction.These simulations would benefit from GPU implementations for data generation to enable on-the-fly integration into the training pipelines with neural networks.In addition to simulations, there is a research gap for open-source data augmentation techniques, such as [140], [141], also based on incorporating simulated artifacts into real datasets.
In addition to simulation and augmentation tools for modelling, the research community would benefit from the availability of open-source datasets.Researchers are still reporting results on phantoms and cadavers, indicating a need for more diverse and realistic publicly available datasets.Nevertheless, despite the lack of open-source 4D CBCT datasets with raw projections and breathing curves, there is an increase of motion artifact reduction research in recent literature.The collection and sharing of up-to-date benchmark datasets on a large scale, similar to the SPARSE [117], [118] and SynthRAD [142] challenges, would enhance the quality of many research works and provide the opportunity for fair and accurate comparison of different approaches.Furthermore, many studies suffer from a lack of clinical evaluation.The availability of open-source standard clinical evaluation platforms would be of significant help in addressing this issue.
In terms of methodology, there has been a noticeable trend of moving beyond supervised learning towards selfsupervised, unsupervised, and domain adaptation methods in recent years.Researchers have started incorporating more physically inspired ideas into the neural networks and utilizing prior patient knowledge to personalize the models for specific anatomies.One of the drawbacks often observed in the current literature is the absence of ablation studies.For example, in the case of approaches employing dualdomain optimization in both projection and volume domains, the performance gained in each domain should be estimated separately.Besides artifact reduction after the CBCT acquisition, adapting the acquisition process itself using neural networks, such as C-arm trajectory adjustments applied to metal artifact reduction, present a further exciting avenue for future research.
In summary, substantial progress has been made in recent years transferring state-of-the-art methods from deep learning based computer vision to the domain of CBCT imaging and in particular the amelioration of prevalent imaging artifacts, with a clear potential to improve diagnosis and treatment in clinical practice.

FIGURE 1 :
FIGURE 1: Visual Abstract: An illustration of the CBCT acquisition process in IGRT for lung CBCT and the application of deep learning for artifact correction.The diagram depicts the acquisition of 2D projections (initial corrections such as scatter corrections have already been applied), including (optionally) time-and motion-related information (e.g.breathing amplitude signal), standard CBCT reconstruction (typically 2D→3D), and DL-based components for image enhancement.Incorporating acquired temporal and motion information provides the opportunity to apply a projection binning which can be used to reconstruct 4D CBCT images (3D images at various states of motion).During the course of CBCT reconstruction, several types of artifacts (e.g.arising from cone-beam geometry, low dose, sparse view or limited angle scans, scatter, metal or beam hardening) can be mitigated through DL-based optimization in the projection and/or volume domain, or by improving (parts of) the reconstruction algorithm itself using neural networks.The illustration of a commerical radiotherapy system is adapted from[1].

FIGURE 2 :
FIGURE 2: Visualisation of the content of this survey and the literature covered.

FIGURE 4 :
FIGURE 4: A visual summary of the distribution of the covered research literature in CBCT artifact mitigation using deep learning, separately for two time periods, (a) based on three generic deep learning architecture categories given a broad categorization by artifact type, and (b) based on the distribution according to the type of artifact.
cycle-GAN-based image correction for quantitative cone-beam computed tomography[45] using a cycle-consistent generative adversarial network and unpaired training to enable photon and proton dose calculation[48] volume-based Denoising Diffusion Probabilistic Model for Cone-beam

TABLE 1 :
Summary of a subset of studies selected guided by recency and number of citations.The table provides details about artifact category, publication year, study title, anatomic site, model type, number of patients, GPU hardware, and whether the code was published.