Regularized Three-Dimensional Generative Adversarial Nets for Unsupervised Metal Artifact Reduction in Head and Neck CT Images

The reduction of metal artifacts in computed tomography (CT) images, specifically for strong artifacts generated from multiple metal objects, is a challenging issue in medical imaging research. Although there have been some studies on supervised metal artifact reduction through the learning of synthesized artifacts, it is difficult for simulated artifacts to cover the complexity of the real physical phenomena that may be observed in X-ray propagation. In this paper, we introduce metal artifact reduction methods based on an unsupervised volume-to-volume translation learned from clinical CT images. We construct three-dimensional adversarial nets with a regularized loss function designed for metal artifacts from multiple dental fillings. The results of experiments using a CT volume database of 361 patients demonstrate that the proposed framework has an outstanding capacity to reduce strong artifacts and to recover underlying missing voxels, while preserving the anatomical features of soft tissues and tooth structures from the original images.


I. INTRODUCTION
Medical procedures such as diagnosis, surgical planning, and radiotherapy can be seriously degraded by the presence of metal artifacts in computed tomography (CT) imaging. Metal objects such as dental fillings, fixation devices, and other electric instruments implanted in patients' bodies inhibit X-ray propagation [1], preventing accurate calculation of the CT values during image reconstruction and yielding dark bands or streak artifacts in the CT images [2], [3]. To correct the images, missing CT values for the underlying anatomical features must be compensated at the same time as the artifacts are removed. Although doctors make clinical efforts to manually collect such artifacts, this is a labor-intensive and time-consuming task. Many researchers have studied image filtering or reconstruction methods [2], [4]- [7], but metal artifact reduction (MAR) remains a challenging The associate editor coordinating the review of this manuscript and approving it for publication was Cristian A Linte.
problem [8]- [10], and no standard algorithm for strong, complex artifacts with missing pixels derived from multiple metal objects has yet been established.
The MAR methods commonly applied after image acquisition are filtering or normalization approaches in the projection domain [4], [5], [8]. Traditional image interpolation and iterative reconstruction approaches require physical models during CT scanning, and do not achieve sufficient artifact reduction against various shape and material characteristics of metals. In recent decades, statistical compensation techniques using prior knowledge of the artifacts have been investigated [11]- [14]. The application of deep learning to medical images has gained significant interest, and has been actively studied in recent years [15]- [18]. Supervised learning for artifact reduction requires an artifact-free image that corresponds to the image with artifacts; in practice, the preparation of such paired images is clinically difficult. Thus, sinograms or CT images generated with simulated typical metal artifacts are used as training data [19]- [21]. The usage of synthesized VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ images enables high-quality image reconstruction under the condition that the three-dimensional shape and position of the metal are assumed to be known. However, this approach struggles to generate realistic artifacts that fully cover the complexity of real physical phenomena encountered during X-ray propagation, specifically in cases with multiple metal objects. Determining the variations of simulated artifacts from those of real ones in the CT images remains a challenge.
Recently, generative adversarial networks (GANs) [22] and their extensions [23], [24] have been extensively studied as a framework for unsupervised image-to-image translation. Unsupervised learning in the GAN framework does not require paired images, as a mapping function from the input to the target image domains is obtained by constructing a generator with the ability to transfer the image features. The generator is trained adversarially using a discriminator that attempts to distinguish whether the input is a synthesized image, leading to elaborate image-to-image translation. Extensive research on GAN-based medical image synthesis has been conducted for various clinical applications [25]. For instance, low-dose CT denoising [26]- [28], super resolution [29], cross-modality transfer [30]- [33], and an application to organ segmentation [34] have been actively studied.
The application of GANs to artifact reduction is a relatively new challenge, as technical difficulties mean that various low-quality images affected by strong metal artifacts exist in clinical CT images. Recent studies have applied GANs to MAR in small regions of CT images of the ear [35]- [38]. Du et al. [39] presented preliminary results from GAN-based MAR for images with dental fillings, while Liao et al. [40] proposed a CycleGAN-based artifact disentanglement network and compared quantitative evaluation results against existing supervised/unsupervised MAR methods using synthesized datasets. However, the performance of GAN-based MAR remains to be clarified when the network is trained with clinical CT images containing complex artifacts derived from multiple dental fillings. Image correction in MAR should target artifact-affected regions and recover the underlying image features, while preserving the other regions with the native anatomical structures of the patients. To address these issues, we have focused on adversarial training of real patient datasets and the importance of learning three-dimensional (3D) features from the CT volumes.
In this paper, we introduce MAR methods based on volume-to-volume translation learned from unpaired clinical CT images. The proposed methods are established based on an unsupervised learning scheme in the absence of synthesized images or simulated artifacts. 3D GANs are developed as an extension to the image-to-image translation framework of CycleGAN [24], and a mapping function for artifact reduction is trained using a patient CT volume database (see Fig. 1). The database is constructed from clinical CT volumes of 361 patients: metal-free CT volumes and those with various patterns of metal artifacts derived from multiple dental fillings. There are infinitely many mappings that translate artifact-free to artifact-affected volumes, and we demonstrate that an adversarial objective with cycle consistency loss often fails to preserve anatomical features. We therefore seek a regularized loss function that captures the characteristics of metal artifacts, thus addressing the main issues faced by GAN-based MAR: recovering the underlying image features while preserving the native anatomical structures in artifact-free regions. We compare the proposed method against existing unsupervised approaches to clarify the MAR performance. Quantitative and qualitative evaluations show that the proposed framework outperforms the baselines for a wide range of clinical images with artifacts. Experiments involving the opinions of expert surgeons confirm the clinical applicability of the proposed 3D adversarial nets.
The contributions of this study can be summarized as follows: 1) 3D generative adversarial nets are developed for unsupervised MAR in head and neck CT images, 2) a regularized loss function is designed for stable learning of the target features of metal artifacts, 3) a feasibility study of the volume-to-volume translation directly learned from clinical image database is conducted, and 4) the quantitative performance of the proposed framework is evaluated and clinically validated with expert surgeons.

II. METHODS
The goal of the proposed unsupervised learning method is to learn mapping functions between two domains: X , containing the CT volumes with artifacts, and Y , containing artifact-free CT volumes, given the training samples {x i } ∈ X and {y j } ∈ Y . This section describes the details of our training datasets, 3D adversarial training scheme, the regularized mapping function, and the volume-to-volume translation process.

A. CLINICAL CT VOLUME DATABASE FOR 3D ADVERSARIAL TRAINING
Deep learning and GAN-based approaches have mostly targeted 2D image slices for artifact reduction. As no clinical database of CT volumes is directly available for learning the characteristics of real metal artifacts, we originally constructed a CT volume database of dental fillings for 3D adversarial training. We corrected clinical CT volumes consisting of head and neck images from 257 patients from the cancer image archive (TCIA) [41] and CT images [42], [43] measured from 104 patients who underwent treatment in the Department of Oral and Maxillofacial Surgery of Nara Medical University, Japan. The CT images were scanned on a Siemens SOMATOM Definition AS CT scanner with 120 kV and 200 mAs. This study was performed in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of Nara Medical University Hospital (approval number: 2296).
To prepare the training data, slice images containing teeth structures from each patient's CT volume were visually checked, and the existence or otherwise of metal artifacts was determined. Artifact-free volumes were classified into image domain Y . The CT volumes partly containing metal artifacts were divided into 3D regions with or without metal artifacts. Artifact-free subvolumes were then classified into domain Y and those with metal artifacts were classified into domain X . Thus, artifact-free volumes were obtained from both metal-free patient volumes and other patient volumes by excluding the regions with metal artifacts.
A total of 56 volumes (12 artifact-free CT volumes and 44 volumes with metal artifacts) were randomly selected from the database and used solely as test data for evaluation. The other CT subvolumes consisting of 539 artifact-free volumes (10491 images) and 320 volumes (5655 images) with various patterns of real metal artifacts was used for adversarial training. Each volume consists of 5-43 image slices with 512 × 512 pixels. There are no paired data from the corresponding 3D regions that belong to both image domains. Adversarial training for all subsequent experiments was performed using this database under the unsupervised setting. As the proposed MAR targets a wide range of artifacts that appear in soft tissues and bones, [−1000HU, 1000HU] was normalized to [−1, 1].

B. 3D GENERATIVE ADVERSARIAL NETS
The proposed volumetric GAN was designed as an extension to CycleGAN's image-to-image translation framework [24]. There is a possibility that the 3D distributions of metal artifacts or anatomical structures would not be learned sufficiently well using the conventional 2D framework. We argue that the 3D features learned from unpaired clinical datasets are effective for MAR, and build a 3D generative adversarial net using local CT volumes for adversarial training. As illustrated in Fig. 1, our volume-to-volume translation model includes two mapping functions, G Y : X → Y and G X : Y → X . Two adversarial discriminators D X and D Y are also introduced, where D X aims to distinguish between volumes x and G X (y), and D Y aims to distinguish y from G Y (x).
Here, a training sample x or y is an unpaired local volume that consists of N spatially continuous image slices. Fig. 2(a) shows training samples x and y of clinical CT data in the case of two image slices forming one training unit (N = 2). The six image sets on the left show the X → Y translation, and the right-hand images show Y → X . It can be confirmed that the local volume contains a spatially continuous distribution of meal artifacts. Fig. 2(a) is a successful case in which the metal artifacts in the original volume x have been reduced in the translated volume G Y (x), and G X (y) translated from metal-free sample y include ''fake'' artifacts represented by the generator G X .
As there are infinitely many mappings G that induce the input volumes to the target domain, the objective should be designed to fit the mapping functions for the purpose of MAR; the translation should reduce the metal artifacts, recover underlying image features, and preserve native anatomical structures in the original image. The basic objective of CycleGAN can be described as Here, L adv refers to adversarial loss, defined as where G Y tries to generate volumes G Y (x) that are similar to volumes in the target domain Y , while D Y aims to distinguish between G Y (x) and real samples y. Adversarial loss measures the performance of D. L cyc refers to the cycle consistency loss, which is expressed as where the first term quantifies the reconstruction error between the original image x and the image G X (G Y (x)) generated through the translation X → Y → X . Similarly, the second term evaluates the cycle consistency of the translation Y → X → Y . The weight λ controls the relative strength of the adversarial and cycle consistency loss. G X and G Y are trained such that L cgan is minimized, and D X and D Y is adversarially trained by maximizing L cgan . The objective function in (1) and its extension can perform successful translations in a variety of medical applications [32], [33]. However, it fails to learn the effective translation that targets the metal artifacts, and often modifies native anatomical structures simultaneously. This tendency became significant in the volume-to-volume translations conducted in a preliminary study, and so we explored improved objectives for volumetric MAR.

C. REGULARIZED OBJECTIVE FUNCTION FOR METAL ARTIFACT REDUCTION
To fit the mapping functions targeting metal artifacts for translation, we consider the following two loss functions.

1) INTENSITY LOSS
As shown by training sample x in Fig. 2(a), the regions affected by the metal artifacts are often limited or sparse against the entire image space. Although strong artifacts affect a wide range of pixels in a 2D image slice, the 3D distribution is not dense in the volumetric space. We consider these characteristics of the metal artifacts and introduce an intensity loss to reduce the space of possible mapping functions. The intensity loss is defined as a regularization term using the L1 norm, which assigns a penalty to the difference in CT values between the original image x and the translated image G Y (x). This penalty is expressed as (4) Fig. 2(b) shows an example of the mapping functions G Y (x) and the role of regularization. Based on the intensity loss, the image x is not mapped to some distant point (yellow), but to a closer point (green), as the difference between x and G Y (x) should be kept small and sparse. Thus, this regularization induces an output distribution that is close to an input distribution in CT values, and aims to ensure that the generators translate the sparse artifacts rather than dense anatomical structures. Note that the intensity loss differs from the identity loss [32], [40], which penalizes |G Y (y) − y|. This loss regularizes the generator, giving an identity map if the sample y in the target domain is provided as the input to the generator G Y . The performance of the identity loss for MAR is investigated in the experiments.

2) FEATURE LOSS
The distribution of metal artifacts varies in the number of dental fillings, their locations, and other conditions of X-ray propagation. Similar feature patterns, such as dark bands, streaks, and bright areas around metal objects [2], can be observed. The generator G X should reduce such artifact-derived features from the original images, whereas the generator G Y should add similar features to the artifact-free images, as illustrated in Fig. 2(c). We consider these symmetric characteristics of the translation required for MAR and further constrain the mapping functions by introducing a feature loss. In this study, the feature loss is defined as an L2 norm penalty on the difference in feature space between the values subtracted in the X → Y translation and those added in the Y → X translation. This can be written as follows: where f is the function of a feature encoding that converts input CT images to deep features. In this study, f is given by the convolutional layer of the pretrained VGG16 network [45]. L fea evaluates the encoded deep image features and induces the generators targeting location-independent, visually similar artifacts for translation.

3) FULL OBJECTIVE
The full objective L is defined as where λ fea and λ int are the weights that control the importance of each loss function. The artifact reduction model G * Y is obtained by solving To implement the objectives, U-net [44] is employed as the generator and VGG16 [45] provides the discriminator. The training volumes were randomly selected from the clinical CT database and applied to the 2D U-net and VGG16 networks as spatially continuous N -channel images. These were applied to the developed framework at each epoch to adversarially train (G X , D X ) and (G Y , D Y ).
D. VOLUME-TO-VOLUME TRANSLATION Given a patient's whole CT volume with metal artifacts, the trained artifact reduction model G * Y translates it to the target domain Y . In the preliminary study, we found that relatively moderate or weak artifacts can be effectively reduced, but the correction of strong artifacts with a wide range of missing pixels was case-dependent. To obtain better image quality, we introduce an improved translation model that considers the geometric property of the metal artifacts, as illustrated in Fig. 3.
The translation starts from the top or bottom subvolume, and sequentially updates the volume of interest. In the first process, the first subvolume is replaced by the translated output in the original volume of interest. Here, the artifacts contained in the first subvolume are expected to be weak or sparse, as only a few of the N slices are affected by small parts of the dental fillings. In the second process, the next subvolume is translated. This subvolume will overlap with the previous subvolume, and therefore a new slice and modified N − 1 slices with reduced artifacts are used for the next translation. This sequential update process reduces the possibility of subvolumes consisting entirely of low-quality images with strong artifacts, which can occur in a single update process. Better image quality can be achieved from these sequentially modified subvolumes with moderate or reduced artifacts. The property of this volumeto-volume translation model is quantitatively analyzed in the experiments.

III. EXPERIMENTS
Three experiments were designed to investigate the performance of the proposed methods: a quantitative comparison, 3D property analysis, and clinical evaluation with expert surgeons. The overall framework was implemented using Python 3.6.8, U-net [44], VGG16 [45], and the Adam optimizer. A computer with a graphics processing unit (CPU: Intel Core i7-9900X, Memory: 32 GB, GPU: NVIDIA TITAN RTX) was used throughout the experiments.
For the regularization parameters, values of λ = 10.0, λ int = 25.0, and λ fea = 1.0 were used after examining several parameter sets. The purpose of the regularization parameters is to fit the mapping functions targeting metal artifacts for image translation and to prevent the over-correction of the native anatomical structures. When the regularization parameters take higher values, the effect of the adversarial loss is lower in the full objective function, and the diversity of image translation is restricted. The optimization of the regularization parameter sets represents a trade-off that is problem specific, requiring trial-and-error to determine the final values. A grid search was not thought to be a realistic approach because the 3D adversarial training for CT slice images with 512 × 512 pixels requires 1-2 days. The value of λ for the cycle consistency loss was fixed to 10.0 based on the original CycleGAN settings. As we have empirically found that the regularization parameter λ int has a significant influence on the MAR performance, we first obtained the results using λ int = 10.0, 25.0, 50.0, 20.0, 30.0, in that order, with λ fea fixed to 0.0. As λ int = 25.0 produced the best performance, this was selected as the optimal value. The translation results using λ fea = 0.0, 1.0, 10.0, 0.5, 5.0, 2.0, in that order, were then obtained, and the optimal value was determined to be λ fea = 1.0.

A. METAL ARTIFACT DATABASE
For a quantitative evaluation of the MAR performance, paired CT volumes (that is, artifact-free volumes as ground truths and corresponding volumes with metal artifacts) are required. To obtain such paired test data, CT-image artifacts were simulated for each metal-free clinical patient volume. To synthesize complex patterns of metal artifacts generated from multiple dental fillings, we manually created volumetric binary labels by extracting 3D regions of eight teeth from the metal-free CT volumes. As shown in Fig. 4(a), the first and second were randomly selected from the back teeth. The third and fourth were also extracted from the back teeth VOLUME 8, 2020 close to the first/second teeth. This situation is often seen in real patient data, where two dental fillings adjacent to each other yield strong artifacts due to photon starvation. The fifth and sixth images were selected from the front side teeth, and the other two teeth were randomly chosen from the remaining side teeth. Consequently, eight volumetric metal labels representing 1-8 dental fillings were prepared by combining the selected eight teeth in order.
The metal artifacts were simulated based on the same procedure and parameters used in [19], where metal-inserted volumes were reconstructed using filtered back projection from simulated sinograms. The main differences lie in the volumetric datasets and low-quality images that are simulated from the multiple dental fillings often found in clinical images, especially in elderly patients. Fig. 4(c) shows examples of the metal artifacts generated from the volumetric labels with eight virtual dental fillings. Image slices A-F correspond to the 3D region indicated in Fig. 4(b). The appearance of the metal artifact changes continuously sliceby-slice, and missing pixels or black bands are generated according to the density of the metal regions. The volumetric distribution of the synthesized artifacts is visualized in Fig. 4(d). In this study, a total of 96 volumes containing different patterns of simulated artifacts were created from 12 metal-free CT volumes. The aim of the experiments was to investigate the 3D GAN-based MAR results relating to 3D anatomical structures and complex metal artifacts.

B. QUANTITATIVE EVALUATION
The first experiment was designed to enable quantitative and qualitative comparisons between the image quality produced by the proposed methods and that given by existing methods. The artifact-free patient volumes were used as references, and the paired metal artifact database with 1-8 virtual dental fillings were used as the original volumes for this experiment. The root mean square error (RMSE) and the structural similarity (SSIM) index [46] were calculated as quantitative error metrics between the reference and the MAR results. SSIM is a good error metric for evaluating the recovery of anatomical structures and the remaining strength of artifacts [19], [40]. The SSIM index ranges between 0 and 1, with higher values indicating better image quality. We refer to the proposed methods as 3DGAN, with a suffix indicating the number of image slices used. For instance, 3DGAN5 means that a local volume with five images was used for volume-to-volume translation. 3DGAN1 means image-to-image translation using the regularized objective proposed in this paper.

1) BASELINES
Liao et al. [40] recently reported extensive evaluation results from a quantitative comparison between their unsupervised MAR (called the artifact disentangle network, ADN) and existing supervised/unsupervised methods using simulated metal artifact images. As ADN produced comparable performance to the supervised methods [19], [35], we compared the proposed 3DGAN with the following existing CycleGAN-based methods, including ADN, in the context of an unsupervised learning framework. The main difference between the proposed 3DGANs and the three existing methods is the regularization term in the loss function design.
CGAN [24] An unsupervised image-to-image translation method using cycle consistency loss for adversarial training.
CGAN+ID [23], [29], [32] This model uses an identity loss that regularizes the generator to be an identity map when images of the target domain are provided as input.
ADN [40] Similar to the CGAN+ID model, but with an artifact consistency loss in the image space to constrain the artifact difference.  difference between the corrected results and the reference. Although the artifacts have been corrected in the results of CGAN and CGAN+ID, the subtraction images show that the edges of the soft tissues and mandibular structures were also modified. This could yield inadequate deformation of the anatomical shape. ADN and 3DGAN1 preserve the CT values of soft tissues; however, some teeth and mandibular structures were wrongly corrected, and residual artifacts remain in Fig. 5(a). 3DGAN5 achieved better image quality against different artifact patterns while preserving anatomical structures.  Table 1 lists the median values of RMSE and SSIM for the original and corrected images with respect to the reference images. The results for artifacts generated from different numbers of metals (m = 1, 4, or 7) are listed. 3DGAN1 achieved slightly superior performance over the baseline values, and 3DGAN5 outperformed the other methods. There is a relatively large difference between the values given by 3DGAN5 and 3DGAN1, which may imply that volume-to-volume translation is robust to various patterns of the real artifacts. To further analyze the performance, the relationship between the SSIMs of the original images and the corrected results were investigated by introducing an error metric, the improvement rate of image quality R s . This is defined as

2) COMPARISON AGAINST BASELINES
This index takes higher values when stronger artifacts are corrected and the reference CT values are adequately recovered. Additionally, we applied the loss functions of the baselines to volume-to-volume translation and compared the resulting performance with our 3DGAN5. The baselines were originally developed for 2D images. Their 3D extension is worth investigating to clarify the loss function design. Figs. 6(a) and 6(b) plot the RMSE and SSIM values of the original images and the corrected results, respectively. Fig. 6(c) shows the R s values of the four image-to-image translation models with respect to the number of metals m. The performance of 3DGAN1 is slightly better when the CT volume has fewer than five metals. Figs. 6(d) and 6(e) plot the RMSE and SSIM values of the volume-to-volume translation models, and Fig. 6(f) shows their R s values. The difference in performance between 3DGAN and the baselines becomes significantly larger. Specifically, 3DGAN5 achieved better image quality than its 2D model, whereas the performance of the baselines became worse in the 3D setting. These results suggest that the loss function of the baselines wrongly corrects soft tissues and teeth structures, and appropriate regularization is required for volume-to-volume translation with higher-dimensional inputs. The regularization terms of the proposed method could contribute to proper image correction that targets metal artifacts while preserving anatomical structures.

C. 3D PROPERTY ANALYSIS
Subsequent experiments were designed to confirm the characteristics of volume-to-volume translation based on 3D adversarial training. Limitations on loading the multi-channel images into the GPU memory meant that we investigated the cases of N = 1, 3, 5, 7, 9, 11, 13 used for the local input volume of 3DGAN. Fig. 7(a) shows box plots of the SSIMs obtained by different 3DGAN settings (number of slices and update model). Fig. 7(b) summarizes the improved image quality with respect to the number of metals m. The box plots show that the proposed sequential update model outperforms the single update model in volumeto-volume translation. The mean value of the SSIM increased for N ≤ 9, and 3DGAN9 achieved the best SSIM and improved image quality. 3DGAN11 and 3DGAN13 produced worse performance, specifically in the cases of few metals. Fig. 7(c) shows two examples of visually different results obtained by 2D and 3D translation for clinical CT volumes. As 3DGAN1 learns the image-to-image translation for the CT image slices, it tends to mistranslate artifact regions and generate unnatural correction results around the teeth, tongue, and mandible, as illustrated by the yellow arrows.  In contrast, 3DGAN9 provides satisfactory correction results (green arrows) while reflecting 3D anatomical structures, even for voxels that are difficult to recover.    The results show that 3DGAN removed most parts of the appliances and associated metal artifacts while preserving 3D teeth structures. This example confirms the robustness and 3D property of the proposed 3DGAN, as such cases were not included in the training datasets.

2) EVALUATION BY EXPERT SURGEONS
We considered the clinical use of MAR for surgical planning, specifically in mandibular reconstructive surgery [42], [47], and compared the image quality of two MAR approaches: manual correction by a dental technician (with over 20 years of experience) as the current clinical protocol, and 3DGAN-based MAR. Manual correction is widely employed for strong artifacts with missing pixels, which cannot be recovered by conventional MAR functions implemented in commercial CT devices. (For instance, radiation oncologists also manually correct images for radiation dose planning in radiotherapy applications.) The results of the manual and 3DGAN-based MAR were displayed in random order. Each participant checked the volume-rendered image sets of the front, side, and bottom views obtained from the two results and compared their image quality. To evaluate the clinical availability of the results, we defined three evaluation criteria: quality of artifact reduction (QOAR), structural accuracy (SA), and duration of the overall image correction process. QOAR scores the degree of metal artifacts for clinical use, i.e., whether metal artifacts were adequately corrected. SA scores anatomical correctness, i.e., whether the structure of the mandible and the teeth were accurately represented. Both metrics were assigned one of four grades: Excellent: 4, Good: 3, Fair: 2, and Poor: 1. Good was defined as a level with clinically sufficient quality for preoperative planning, and Fair was defined as having a few problems, but with acceptable quality. Fig. 9 shows volume-rendered images of the 3DGANbased MAR results and those manually corrected by the dental technician. Table 2 summarizes the scores obtained from the three surgeons and the duration required for the overall process. The proposed MAR scored higher than the manual correction results in terms of both QOAR and SA for all data, with an average of 3.5. Specifically, all participants considered the MA1 and MA4 results containing moderate artifacts to be excellent. Although MA2 and MA3 included strong artifacts, such as multiple dark bands and scattering noise, the scores show that 3DGAN achieves clinically sufficient image quality and outperforms manual correction by the dental technician. The comments from the surgeons suggest that the metal artifacts have been visually corrected for both images. 3DGAN generates visually plausible corrections of the 3D teeth structures, whereas manual correction produces an irregular appearance of the teeth surfaces. The average duration of image correction required for the two approaches was 11.9 s for 3DGAN and 33 min for manual correction. These results show that the developed framework rapidly provides corrected results for the real CT volumes with strong artifacts and contributes to improved productivity in preoperative or radiotherapy planning.

IV. DISCUSSION
Recent studies have designed 3D adversarial nets for metal artifact reduction in CT images of the ear. Wang et al. [37], [38] performed a large validation study VOLUME 8, 2020  that showed improved MAR performance and segmentation results. To the best of our knowledge, this study is the first to build 3D adversarial nets with a regularized loss function for metal artifact reduction derived from multiple dental metal fillings. The experimental results have shown that 3D adversarial training from unpaired patient CT datasets and volume-to-volume translation can achieve clinically acceptable MAR for clinical images with strong artifacts. To clarify the focus of this research, we have concentrated on quantitative comparison with unsupervised learning methods. For a quantitative comparison between supervised MAR using convolutional neural networks (CNNs) and state-of-the-art MAR methods [4], [11], refer to [17] and [19]. Reference [40] compares CycleGAN-based MAR and existing supervised/unsupervised methods.
To date, most artifact reduction approaches, including deep learning studies, have considered artifact reduction in images with few metal objects [19], [40]. Supervised or unsupervised learning based on synthesized images requires elaborate simulation of complex and abundant variations of artifact patterns, and the difference from real artifacts becomes problematic. There are no well-trained CNN models for the strong, complex artifacts that often appear in clinical CT volumes generated from multiple (for instance, more than four) dental fillings. The experiments using the metal artifact database showed that the proposed 3DGAN directly learned from unpaired patient images has robust artifact reduction ability, even for simulated artifacts not included in the training data.
The results of 3D property analysis showed that 3DGAN achieves better image quality than 2D translation; however, the performance was slightly worse when using 11 or 13 image slices. This may be because the increased number of slices reduces the number of volumes available for adversarial training, or because the dimension of the input volumes might be greater than required to learn 3D anatomical features, resulting in overfitting. In some cases when complex artifacts with dark bands and bright scattering effects are observed near the anterior teeth, the native teeth are affected by overcorrection of the artifact (see the lower-right of Fig. 7(c)). As correct translation is achieved for such complex artifacts around the back teeth, we believe this overcorrection derived from insufficient training of the artifact variation on the anterior teeth, and could be improved by increasing the number of training data.
To further improve the proposed method, the range of the Hounsfield units could be optimized for the focused regions. The exploration of adversarial training designs targeting specific artifacts and anatomical structures and self-supervised learning for MAR are interesting topics for future studies. In addition to medical CT images, there are many fields that require improved three-dimensional image quality. We believe that the developed volume-to-volume translation framework and experimental procedures can be directly applied to other volumetric images, such as industrial CT images [48], microscopic images [49], and hyperspectral images [50].

V. CONCLUSION
This paper introduced MAR methods based on unsupervised volume-to-volume translation learned from clinical CT images. The results of experiments using CT volumes from 361 real patients demonstrated that the proposed 3DGAN has an outstanding capacity to reduce strong artifacts. RMSE and SSIM were improved compared to three existing methods. The regularization terms and volume-to-volume translation design contribute to proper image correction that recovers underlying missing voxels while preserving the 3D features of soft tissues and tooth structures. In the clinical evaluation, the proposed MAR scored higher than the manual correction results in terms of two characteristic error metrics: structural accuracy and the quality of artifact reduction.
Unsupervised learning directly from unpaired clinical images has potential applications to various artifact patterns that are difficult to handle using filter-based or prior knowledge-based MAR approaches. Future challenges include improving the adversarial training and prediction framework and the application to other clinical fields, specifically radiotherapy planning.