Unpaired-paired learning for shading correction in cone-beam computed tomography

Cone-beam computed tomography (CBCT) is widely used in dental and maxillofacial imaging applications. However, CBCT suffers from shading artifacts owing to several factors, including photon scattering and data truncation. This paper presents a deep-learning-based method for eliminating the shading artifacts that interfere with the diagnostic and treatment processes. The proposed method involves a two-stage generative adversarial network (GAN)-based image-to-image translation, where it operates on unpaired CBCT and multidetector computed tomography (MDCT) images. The first stage uses a generic GAN along with the fidelity difference between the original CBCT and MDCT-like images generated by the network. Although this approach is generally effective for denoising, at times, it introduces additional artifacts that appear as bone-like structures in the output images. This is because the weak input fidelity between the two imaging modalities can make it difficult to preserve the morphological structures from complex shadowing artifacts. The second stage of the proposed model addresses this problem. In this stage, paired training data, excluding inappropriate data, were collected from the results obtained in the first stage. Subsequently, the fidelity-embedded GAN was retrained using the selected paired samples. The results obtained in this study reveal that the proposed approach substantially reduces the shadowing and secondary artifacts arising from incorrect data fidelity while preserving the morphological structures of the original CBCT image. In addition, the corrected image obtained using the proposed method facilitates accurate bone segmentation compared to the original and corrected CBCT images obtained using the unpaired method.


I. INTRODUCTION
Dental cone-beam computed tomography (CBCT) is being increasingly used for diagnosis and treatment in implant, dental and maxillofacial surgery [2], [12]. The three-dimensional (3D) models of the maxillofacial bones extracted from CBCT images are commonly used during preoperative analysis and planning. Therefore, accurate bone segmentation from CBC-T images is important to facilitate sophisticated treatment planning. However, most dental CBCT devices are designed to reduce the radiation dose by limiting the scan field of view (FOV) [35]. Moreover, they frequently have no pre-and post-patient collimation, which greatly reduces the amount of scattered radiation [8]. The photon scattering and data truncation due to the limited FOV of CBCT cause bright and dark shadows in the reconstructed image, significantly degrading the image quality of CBCT and interfering with accurate bone segmentation. In contrast, multidetector computed tomography (MDCT) images are considered nearly scatter-free [23] and void of data-truncation errors, as depicted in Fig. 1  (b). This suggests the possibility of developing an artifactcorrection function that maps CBCT images to MDCT-like images that contain negligible shading artifacts.
Several approaches for reducing shading artifacts in CBCT images have been reported in previous studies. Scattering correction involves the use of methods based on Monte Carlo (MC) simulations [5], [15], beam blockers [22], [30], [36], and scatter kernels [4], [29]. Meanwhile, truncation correction involves the use of methods based on projection extrapolation [9] and iterative reconstruction using various handcrafted priors [16], [31], [34]. Although these methods have demonstrated potential to reduce the shading artifacts, their use in clinical applications is faced with several limitations, such as high computational cost, additional scan time, and additional exposure.
Recently, deep-learning techniques have been successfully employed to perform CT-image enhancements, such as noise reduction [10], [18], [24], [32]. In clinical practice, acquiring (real-world) paired data is virtually impossible owing to difficulties associated with the simultaneous attainment of the patient CT data and corresponding ground-truth image. Therefore, some researchers have adopted the generative adversarial network (GAN) [6] to facilitate the learning of a correction function using unpaired data from artifact-free and artifact-affected images. In the variational framework, the unpaired model can be viewed as an energy minimization problem of a correction function. This function includes data fidelity and data-driven regularization (referred to as fidelity-embedded GAN) wherein the data fidelity refers to the difference between an input image (denoted by z) and the corresponding image (denoted by G(z)) generated by the GAN network [24], [32]. It might include additional cycleloss [7], [10], [18].
This study adopts a fidelity-embedded GAN that uses unpaired CBCT and MDCT images to reduce the shading artifacts. In general, it is difficult to infer the desired MDCT image priors using unpaired training data from two different imaging modalities (i.e., CBCT and MDCT). Therefore, learned data-driven regularization can be used to generate plausible structures from complex shading artifacts. The roughly enforced data fidelity (i.e., the sum of the difference between the input z and generated output G(z)) could lead to the problem of preserving the morphological structures of the input z. Consequently, the use of the fidelity-embedded GAN results in the unwanted distortion of some pixels. Fig. 1 (c) reveals that the fidelity-embedded GAN creates unwanted bone-like artifacts in the output. These artifacts vary with the input data. This problem is solved by retraining the fidelity-embedded GAN using a filtered paired dataset ({(z k , x k ) : k = 1, · · · , K}). In this dataset, x k = G(z k ), k = 1, · · · , K, are generated by selecting the correct output images produced via application of the fidelity-embedded GAN to unpaired data. The resulting change in fidelity (i.e., the sum of the difference between G(z) and labeled data x) dramatically improves the reduction of the bone-like and shading artifacts ( Fig.1 (d)). The proposed unpaired-paired learning approach was validated using a clinical CBCT dataset. Moreover, in this study, the feasibility of the proposed method was further investigated via application of the same to accomplish an image-segmentation task.

II. METHOD
In this paper, z represents the CBCT image and x represents the MDCT image. Let Z and X be distributions of CBCT images and MDCT images, respectively. The goal is to learn CBCT-to-MDCT transformation G : Z → X from unpaired training samples S z = {z i } N i=1 and S x = {x j } M j=1 , that are drawn from Z and X , respectively. Note that it is impossible to know the probability density functions p z and p x , corresponding to Z and X , respectively.
Specifically, the proposed method is a two-stage GANbased image-to-image translation method that trains the correction map G : z → x by minimizing the following loss model: where p G(z) is the density of G(z) in the distribution generated by the generator G; dist(p G(z) , p x ) is the metric that measures the distance between two probability density functions p G(z) and p x ; and F pixel-level (G(z)) represents the pixel-level fidelity of G(z), which is constrained to preserve the structure of the CBCT image z. Here, λ > 0 is the regularization parameter that controls the tradeoff between d Pearson (p G(z) , p x ) ≈ 0 and F pixel-level (G(z)). Without paired training data like (z, x), we cannot train G with the good pixel-level fidelity  (1) can be converted to the least squares GAN (LSGAN) framework [19], [20]; G is trained simultaneously with a discriminator D in an adversarial relationship to improve their mutual abilities; For the constraint of the pixel-level fidelity F pixel-level (G(z)) in (1), we use two different fidelities. First, we obtain an unpaired generator G u using the fidelity Using this subset as training data, we train a paired generator G p using the fidelity norm G(z) − G u (z) , which is equivalent to supervised learning. Now, we are ready to explain practical details of the proposed method, where finite training samples, S z and S x , are used to train G.

A. STAGE 1: UNPAIRED MODEL
The first step of the proposed method is to generate MDCTlike images from unpaired CBCT images using the following LSGAN model: corresponds to the constraint of the pixel-level fidelity F pixel-level (G(z)).
With this pixel-level fidelity G(z)−z , it is generally difficult to train G, resulting in generating the plausible artifacts that do not exist in MDCT images. To explain the reason, suppose that x * is a target MDCT image corresponding to the input z. If x * − z is reasonably small, then the method of (3) can generate G u (z) ≈ x * . On the other hand, if x * −z is not small, then the method of (3) may produce an unwanted G u (z) that is different from the target x * . Hence, the performance of G u depends heavily on the input z and the parameter λ. Fig. 2 depicts various G u (z) with different values of λ. The case corresponding to G u (z) with λ = 0 in (3) exclusively reflects the overall characteristics of MDCT images, whereas it loses the detailed structures of z, thereby resulting the generation of bone-like structures owing to the use of imperfect data-driven priors. In contrast, the case wherein G u (z) is generated considering a small value of λ = 10 −2 reveals reduced bone-like artifacts, albeit the morphological VOLUME 4, 2016  Fig. 2) are nonetheless lost. The case wherein G u (z) is generated considering λ = 10 1 demonstrates an effective reduction in the scattering artifacts while preserving the morphological structures of the tissues. However, in the case where z is severely corrupted by the shading artifacts (i.e., x * − z 0), the bone-like artifacts may remain in G u (z) (yellow arrows in Fig. 2). Hence, the model described in (3) may cause the problem in preserving the morphological structures of tissues and bones from the complex shading and plausible artifacts as well.

B. STAGE 2: PAIRED MODEL
Stage 2 uses the observation from Stage 1 that the performance of G u depends on the input z. For each z i ∈ S z , we examine whether G u (z i ) is a MDCT-like image or not. We choose a subset of S z , denoted by S ,G u z , such that {{G u (z) : z ∈ S ,G u z } is a set consisting only of MDCTlike images. The proposed method uses the set S pair := {(z, G u (z)) : z ∈ S ,G u z } as paired training data for supervised learning. Now, we train G using the selected paired dataset S pair := {(z, G u (z)) : z ∈ S ,G u z } through the following paired model: In our experiment, for the trained G u and G p , it follows that dist(p G u (z) , p x ) = 0.24, and dist(p G p (z) , p x ) = 0. 19. This implies that using the selected paired data, the trained G p generates images that effectively reflect the characteristics of the MDCT domain more accurately compared to G u .
Here, for trained G u and G p , the distances dist(p G u (z) , p x ) and dist(p G p (z) , p x ) are approximately calculated using finite samples S z , S ,G u z and S x [20], [25]:

respectively.
In summary, the operation of the proposed method involves the following steps.
1) Training of G u in (3) using the unpaired datasets S z and S x 2) Training of G p in (4) using the selected paired datasets S pair . The diagram of the proposed method is illustrated in Fig. 3.

C. DATASETS
This study was approved by the institutional review board of our institution, and the requirement for informed consent was waived. We collected the CBCT and MDCT images of 20 and 28 patients, respectively. The CBCT images were acquired using a circular-trajectory CBCT scanner (Xoran CAT, Xoran Technologies; USA) considering a pixel size of 0.40 mm × 0.40 mm, image size of 400 × 400, tube voltage of 120 kVp, and tube current of 48 mA. Note that several dental CBCT scanners, such as i-CAT [3], [21], KaVo 3D eXam [28], and ProMax 3-D Max [14], reconstruct images using a pixel size of 0.4 mm×0.4 mm. In contrast, the MDCT images were acquired using a helical-trajectory MDCT scanner (SOMATOM Definition Flash, Siemens Healthineers; Germany) considering a pixel size of 0.42-0.49 mm, image size of 512 × 512, tube voltage of 120 kVp, and tube current of 100-225 mA. The slice thickness of CBCT and MDCT were 0.4 mm and 0.6 mm, respectively. Before training, the MDCT images were adjusted visually to match the CBCT images. First, the MDCT image was rescaled to have a 4 mm pixel size along the horizontal and vertical axes. Second, the rescaled MDCT was cropped to a size of 400 by 400 centered on a specific point. Here, a center was manually selected based on anatomical knowledge. More specifically, we selected a point located approximately 2 mm anteriorly from the Sella [11] (see the red dot in the top right-hand image of Fig. 4). The selected point was also used as the center for all the CBCT images of each patient. Finally, ROI masking, the same size as the CBCT, was applied to the preprocessed MDCT image. The preprocessing of the MDCT image is shown in Fig. 4.
Overall, 11000 CBCT (S z ) and 11422 MDCT (S x ) images obtained from 18 and 28 patients, respectively, were used during the unpaired training (G u in (3)). Meanwhile, 1100 CBCT images obtained from two patients comprised the test dataset. For efficient learning, we randomly selected 1024 CBCT and MDCT images per epoch and used them to update the networks. Upon completion of the unpaired training, 6831 corrected CBCT images, which remained unaffected by the bone-like artifacts, and 6831 original CT images (S pair ) were selected and used for paired learning (G p in (4)).
Instead of learning entire images at once, the proposed networks learn individual images in a patch-by-patch manner. This is because the patch manifold is generally characterized by a low-dimensional structure compared to the image manifold [26], [27]. This allows the networks to learn the generated distribution efficiently. In this study, given a CT image, patches with a size of 128 × 128 were extracted considering strides of 16 along each direction in the image domain. Subsequently, the patches were corrected in the same manner as described in Sections II-A and II-B. Finally, the corrected CT images were synthesized using the corrected patches. The pixel values in the overlapping patch regions were averaged.

D. NETWORK ARCHITECTURES
Throughout this paper, we used the same architectures for both G u and G p , and for both D u and D p , respectively. For the generator, we adopted the deep convolutional framelets [33], a multi-scale convolutional neural network (CNN). For the discriminator, we adopted the standard CNN without a fully connected layer at the end of the network. The network architectures are exactly same as those of [24], [25].
In our experiments, the update of both generator and discriminator was performed by using the Adam solver [13] with a learning rate of 0.0002, min-batch size of 20, and number of epochs of 100. The network weights were initialized following a Gaussian distribution with a mean of 0 and standard deviation (SD) of 0.01. We chose λ = 10 in (3) and λ = 100 in (4), respectively. Training was implemented using Tensorflow [1] on a CPU (Intel(R) Core(TM) i9-9980XE, 3.0GHz) and a GPU (NVIDIA, Titan RTX 24GB) systems. It took approximately 29 hours to train each G u and G p . . 5 compares the performance of the proposed method with the unpaired learning methods for the test patients. Further comparisons were performed using the unpaired model with an additional cyclic loss [37] (called cycle-GAN). More specifically, we solved the following unpaired problem.

Fig
where T : x → z is the inverse function of G cycle satisfying During the experiment performed in this study, the parameter values were selected as λ = 10, µ = 10. The parameters with the best performance (maximum Dice coefficient(DC)) regarding bone segmentation were manually selected. Numerous recently published papers [7], [10], [17], [18] have adopted cycle-GAN or cycle-GAN version variants to train unpaired datasets. As shown in Fig. 5, both G u and G cycle substantially reduce the shading artifacts while enhancing the contrast between the bone and tissue. The intensity profiles depicted in Fig. 6 confirm the contrast enhancement. However, the weak data fidelity in both unpaired-learning models causes the appearance of bone-like artifacts in the VOLUME 4, 2016  (3)). Fourth column: results of unpaired learning with the cyclic loss (G cycle in (7)). Fifth column: results of the proposed method (G p in (4)). (WW/WL=3000/-500 for wide window, WW/WL=1500/250 for narrow window) images generated by both G u and G cycle . In contrast, the proposed method substantially reduces the bone-like artifacts while preserving the morphological structures of bones and tissues. In addition, the proposed method does not appear to introduce unexpected new artifacts. The fidelity using the paired dataset S pair selected in the second stage appears to prevent the creation of new unexpected artifacts.
We computed the mean and SD for bone intensity, where the bone mask was manually selected except for teeth. Table  1 presents a summary of the computed mean and SD values for CBCT and the corrected CBCT on the test dataset. As listed in Table 1, the mean HU values of the corrected CBCT images are increased over those of the uncorrected CBCT image. This is because the deep learning methods reduce shading artifacts while enhancing the contrast between the bone and tissue. Fig. 7 compares the 3D bone-segmentation results between the original and corrected CBCT images obtained using the unpaired and proposed methods. The 3D modeling was performed using the MIMIC software (MIMIC, Materialise; Belgium). The threshold value (tv) for bone segmentation was chosen to maximize the DC in the training dataset. Thereafter, the selected tv was used for bone segmentation in two test patients. Here, manual segmentation performed by a 3D bone-segmentation expert was considered as a ground truth. The DC value between two binary images (or sets) x and y was calculated as The chosen tvs for the uncorrected and corrected 3D CBCT images are illustrated in Fig. 7. The results are visualized using the Visualization Toolkit (Kitware Inc.). As depicted in Fig. 7, compared to the original CBCT image and its corrected version obtained using unpaired learning, more accurate bone segmentation can be realized in the corrected CT image obtained using the proposed method. Moreover, the segmentation quality has been objectively assessed by evaluating the mean DC for the two test patients. Table 2 presents the corresponding results. As illustrated in Table 2, the corrected image obtained using the proposed method achieves a higher DC value compared to the original CBCT image and that obtained using the unpaired-learning model. FIGURE 6. Intensity profiles of the corrected CT images along the dotted line in Fig. 5.

IV. DISCUSSION AND CONCLUSION
This paper presents an unpaired CBCT-to-MDCT translation method to alleviate photon scattering and truncation error, one of the main factors that degrade the quality of dental CBCT images. Although fidelity-embedded GAN approaches for unpaired learning have demonstrated promising results for denoising CT images, their performance has remained limited owing to the use of naive fidelity despite significant differences between CBCT and MDCT. This naive fidelity, which constrains the output image to match the corresponding input spatially and anatomically, demonstrates inconsistent performance, which varies depending on the input image quality. This creates unwanted bone-like artifacts during image-to-image translation. For similar reasons, despite the recent advances in image-to-image translation, the application of GAN-based techniques, such as cycle-GAN, in high-quality medical-imaging applications remains a challenge. The above-discussed problems can be resolved using paired learning. Based on the domain knowledge of the morphological structure of the human skull, a clinical expert constructed paired data by manually collecting the accurate results obtained using the unpaired-learning model. Using the selected paired data, the same GAN-based method with its fidelity term influenced by the pair difference demonstrated significantly improved performance. Fig 5 illustrates the empirical effect of data fidelity. Note that even the input images that produced bone-like artifacts during unpaired learning did not generate them during the secondary paired learning. It is very interesting to observe that the CBCT image correction, which failed in the first stage, succeeded in the second stage with the paired learning. In addition, the proposed approach provided more accurate results for 3D bone segmentation compared to unpaired learning. It takes approximately 200 ms to acquire a single 400×400 corrected CBCT image. This can be reduced by using, for example, parallel computation and a multiple GPU system.
The proposed method can be improved in the following aspects. First, in the fidelity-embedded GAN, the consideration of high-accuracy data fidelity considering CT physics can improve the network performance in terms of shading correction. For example, using data fidelity that considers photon noise in low-dose CT images can be combined with GAN [25]. Second, manual selection in the second stage of the proposed network can be time-consuming and subjective. Therefore, further research to improve the robustness of the proposed approach should be undertaken in the near future.

A. IMPACT OF DIFFERENT IMAGE RESOLUTIONS OF CBCT
In this section, we investigated the impact of the CBCT image pixel size. The training was performed using CBCT scans of 30 patients and MDCT scans of 28 patients, which were acquired using a circular-trajectory CBCT scanner (New-Tom, pixel size of 0.25 mm) and a helical-trajectory MDCT scanner (SOMATOM Definition Flash, pixel size of 0.42-0.49 mm), respectively. Note that the CBCT scanner has a smaller pixel size than the MDCT scanner used as a label. Fig. 8 shows the results of the proposed approach in axial, coronal, and sagittal views. As shown in Fig. 8, the proposed method effectively reduces the shading artifacts; even for the CBCT image with a smaller pixel size, the resolution degradation is rarely seen in the corrected image.