Deep Learning Cross-Phase Style Transfer for Motion Artifact Correction in Coronary Computed Tomography Angiography

Motion artifacts may occur in coronary computed tomography angiography (CCTA) due to the heartbeat and impede the clinician’s diagnosis of coronary arterial diseases. Thus, motion artifact correction of the coronary artery is required to quantify the risk of disease more accurately. We present a novel method based on deep learning for motion artifact correction in CCTA. Because the image of the coronary artery without motion (the ground-truth data required in supervised deep learning) is medically unattainable, we apply a style transfer method to 2D image patches cropped from full-phase 4D computed tomography (CT) to synthesize these images. We then train a convolutional neural network (CNN) for motion artifact correction using this synthetic ground-truth (SynGT). During testing, the output motion-corrected 2D image patches of the trained network are reinserted into the 3D CT volume with volumetric interpolation. The proposed method is evaluated using both phantom and clinical data. A phantom study demonstrates comparable results to other methods in quantitative performance and outperforms those methods in computation time. For clinical data, a quantitative analysis based on metric measurements is presented that confirms the correction of motion artifacts. Moreover, an observer study finds that by applying the proposed method, motion artifacts are markedly reduced, and boundaries of the coronary artery are much sharper, with a strong inter-observer agreement ( $\kappa = 0.78$ ). Finally, evaluations using commercial software on the original and resulting CT volumes of the proposed method reveal a considerable increase in tracked coronary artery length.

. (a) CT images of the phantom data that imitates the coronary artery. The degree of motion artifacts varies from minimal, when the heart motion is slow, to severe, when fast. Ground-truth data, with no absolute motion, can be obtained when the phantom is completely stopped. (b) CT images of the coronary artery of clinical data. Ground-truth data cannot be obtained because the heart cannot be stopped.
To the best of our knowledge, previous image processing methods for CCTA motion compensation are based on motion estimation through image registration or minimizing a motion artifact metric. Methods based on 3D-3D nonrigid image registration [5]- [7] have demonstrated excellent motion compensation results. However, they require multiple phases of CT images, and it is possible that motion estimation is erroneous in the presence of motion artifacts, which in turn would lead to the degradation of motion compensation. Rohkohl et al. [8] presented the method of improving best-phase image quality using a single 3D reconstructed image; its motion estimation is based on minimizing motion artifact metrics. This method has produced superior results in more quiescent cardiac phases; in contrast, in more rapid cardiac phases, 3D-3D nonrigid image registration-based methods produce superior image quality because they use information from neighbor phases. Kim et al. [9] used a partial angle reconstructed (PAR) image, which is reconstructed using a smaller angular range than a short scan, to improve temporal resolution. They estimate motion using PAR images based on 3D-3D nonrigid registration, so they require multiple phases of CT images. Hahn et al. [10] presented a coronary motion-compensation method based on PAR images from short-scan data, that has benefits concerning dose efficiency. That method optimizes an image artifact measuring cost function for motion estimation, as proposed by Rohkohl et al. [8]. Using the motion vector field estimated by various approaches, such as registration and minimizing an artifact metric, the aforementioned methods require motion-compensated reconstruction from raw or projection data.
In this study, we present an image-based coronary motion correction method that does not require projection or back-projection steps and can be applied when raw data are not available. Our approach assumes that a physician can estimate motion artifacts and envision a motion-corrected image more accurately when he or she has more experience. Our hypothesis is that motion correction in CCTA can be learned with a large set of training images. Coincidentally, methodologies based on deep learning have demonstrated revolutionary performance in various domains [11], [12]. Deep learning-based methods effectively address relevant problems, including image superresolution [13], image denoising [14], and image deblurring [15]. These methods adopt a supervised learning approach where it is assumed that accurate ground-truth corresponding to the input data is available.
Critically, it is medically impossible to obtain precise corresponding coronary CT images devoid of motion artifacts, which are the ground-truth images required in supervised learning. This problem and comparison using phantom data are visualized in Fig. 1. This problem is among the many various situations when ground-truth is unattainable, which often occurs when learning from images, including urinary bladder segmentation [16], 3D pose estimation [17], the social behavior of honeybees [18], learning eye gaze direction estimation [19], and object detection in indoor scenes [20]. These methods offer effective ways to generate the ground-truth images or realistic synthetic images appropriate for their particular problems by utilizing Position Emission Tomography (PET) acquisitions for automatic urinary bladder segmentation in CT images [16], rendering various configurations of a single 3D model [17]- [19] or rendering a composite 2D image of various partial 2D images [20]. However, these processes are most likely insufficient for the motion correction of coronary arteries in 3D CCTA because a single 3D model cannot represent variations in the 3D structure, nor can a composite image of various CCTA slices be used to generate a 2D slice consistent with a realistic 3D volume.
We propose a novel method using a style transfer method, such as those proposed in [21], [22], to generate a synthetic ground-truth (SynGT). We apply style transfer to image patches from 4D CT volumes containing phases with large and small amounts of motion artifacts, as depicted in Fig. 2. Compared to using the patches directly, the local deformations that occur from the heartbeat motion can be considered using style transfer. Our aim is to suppress the effect of genuine appearance change and isolate only the effect of motion artifacts. Using the SynGT, we can subsequently learn to generate images with reduced motion artifacts from the corresponding training input images. Appearance of the motion artifacts of a coronary artery in different phases within a full heartbeat cycle, sampled from a 4D CT. Based on a 5-point Likert scale, as described in Sec. III-B4, the image patches are (a) completely unreadable or have (b) significant motion artifacts, (c) apparent motion artifacts, (d) minor motion artifacts, or (e) no motion artifacts.
The overall process of the proposed method is as follows. First, we compile a dataset of spatially corresponding 2D image patches, extracted from 3D CT volumes at different phases within a 4D CT. Next, we apply style transfer with the patch from the phase with significant motion artifacts as the source, and the patch from the phase with minimal motion as the target, to generate the SynGT. We then train the motion artifact compensation network (MAC-net) using the patches with significant motion and the SynGT. At the test stage, we assume a 3D CT volume with motion artifacts is given and that the centerline of the coronary artery has been annotated. We generate the 2D cross-sectional image patches along the centerline and feed them into the trained motion artifact correction network. The output motion-corrected 2D image patches are reinserted into the original 3D CT volume with volumetric interpolation to obtain the final motion-corrected 3D CT volume.
The three main contributions of our work are as follows: First, we propose a method for motion correction using deep learning, in which the ground-truth is synthesized using style transfer between corresponding 2D image patches of the coronary artery extracted at different phases within a 4D CT. Second, we propose a method to perform motion correction on the coronary arteries of the 3D CT volume by reinserting motion-corrected output 2D patches with volumetric interpolation. Third, we provide extensive quantitative and qualitative evaluations to demonstrate the degree of motion correction after applying the proposed method.
This paper is an extension of our previous work in [23], with extended quantitative and qualitative evaluations. We include a comparative analysis with other methods using the phantom dataset, which is publicly available, and the results are presented in Sec. III-A. In Sec. III-B, we evaluate the proposed method quantitatively by measuring motion artifact metrics [24] and image quality metrics for the clinical dataset. An observer study, a straightened curved planar reformation (CPR) example, and the tracking results of the proposed method are presented in Sec. III-B for qualitative analysis.

II. METHODS
Our approach is to solve the problem of motion correction using a deep neural network, such as a convolutional neural network (CNN). Thus, given a new 3D CT volume at the test stage, we want to generate a motion-corrected version of that volume as the output of our trained CNN. Accordingly, the proposed method is designed based on the following decisions: 1) The network input and output are defined in terms of corresponding image patches of the coronary artery with and without motion artifacts. 2) 4D CTs are used to achieve corresponding pairs of patches extracted from the same patient at relatively slower and faster heart motions within the heartbeat cycle. It is clinically impossible to achieve images with no motion artifacts because the heart motion cannot be arbitrarily stopped. 3) Style transfer is used to synthesize a patch with the corresponding phase, in which the appearance is preserved, but motion artifacts are reduced. The corresponding patch at the slower motion acts as a reference to guide artifact correction while preserving local appearance at that motion phase. This synthesized patch, rather than the patch at the slower motion, is defined as the ground-truth to learn motion correction. 4) A very deep CNN [13] is used as the deep learning network for motion correction. The deep structure helps to generalize the complex changes between the original and synthesized patches. 5) Motion-corrected patches are reinserted and interpolated into the original 3D CT volume to compensate for the motion artifacts of the entire coronary artery.
Based on these decisions, the overall framework comprises the following subprocesses: 1) extraction of corresponding coronary artery patches from 4D CT, 2) synthesis of morphed source patches using style transfer, 3) training and applying the motion artifact correction network, which we term MAC-net for the patch-wise motion artifact correction, and 4) reinsertion of motion-corrected patches to the 3D CT volume. It is important to note that the style transfer network for generating synthetic motion-corrected patches and MAC-net have different roles. Style transfer is performed between pairs of patches on corresponding points on the coronary arteries in different phases in a 4D CT and alters only the style (local texture), not the content (structure). Our assumption is that motion artifacts are closer to style and thus can be reduced using this process. In contrast, patches with motion artifacts are paired with the generated SynGT from style transfer for supervised learning of the MAC-net for motion compensation. While corresponding patch pairs from 4D CT are required for removing motion artifacts by style transfer, once the MAC-net is trained, any 3D CT patch with motion artifacts can be given as an input to generate a motion-compensated patch at test time. Fig. 3 provides a visual summary of the second and third subprocesses, which are technically the most critical. The subsequent subsections describe these processes and the original 4D CT data in detail.

A. EXTRACTING CORRESPONDING CORONARY PATCHES FROM 4D CT
We used 4D CT images acquired by retrospective gating using a dual-source CT scanner (SOMATOM Definition Flash, Siemens). All raw data 0% -90% were reconstructed in 10% increments of the heartbeat cycle (R-R interval). The images reconstructed at 40% and 70% usually have the highest image quality in terms of motion artifacts because they are around the end-systole and end-diastole, respectively. The images reconstructed at other phases mostly contain considerable motion artifacts of the coronary arteries. Without loss of generality, we will hereby specifically focus only on the middle of the right coronary artery (mid-RCA) as the region of interest, which generally has the most motion. Given the temporally sampled 3D CT volumes, the mid-RCA is manually annotated by an experienced reader in each volume using commercial coronary analysis software (QAngioCT, Medis Medical Imaging Systems, Leiden, the Netherlands). The first right ventricle (RV) branch and acute marginal branch are defined as the respective start and end points of the mid-RCA.
The mid-RCA centerline C φ of a 3D volume at phase φ is represented as a discretized set of ordered 3D coordi- c denotes the total of number of points within C φ . The exact centerline is approximated as a piecewise linear function between the points in C φ . Thus, the entire length of the mid-RCA centerline is defined as the sum of all distances between subsequent point pairs, and denoted, as To extract the corresponding 2D image patches on C φ , the corresponding points must first be determined. We assume that the start and end points for all φ will correspond because they correspond to the same anatomical landmarks: first RV branch and acute marginal branch. A fixed M are sampled between the start and end points of C φ . Because the mid-RCA centerline is being approximated as a piecewise linear function, we applied interpolation to compute the exact equidistant point coordinate. Finally, we define the normal directions n φ j for the planar patches centered at each q φ j as the tangential direction of C φ at q φ j . Fig. 4 visualizes this process of determining the corresponding points and the 3D CT volumes at different temporal phases.
The corresponding patches P = P φ j |0 ≤ j ≤ M − 1 are extracted by sampling the voxel intensities on an R × R discrete grid centered at q φ i with normal n φ j within the corresponding 3D CT volume. To align the spatial distribution of the grid points physically, we constructed a 2D grid (on the xy-plane as a reference) with 3D coordinates considering the physical dimensions of the CT, and applied translation based on the center point and rotation based on the normal direction, to obtain the projected grid coordinates. Because these coordinates are not integers, bilinear interpolation is applied when assigning intensity values to each pixel in the extracted patch.

B. GENERATING SYNTHETIC MOTION-CORRECTED PATCHES USING CROSS-PHASE STYLE TRANSFER
The corresponding patches extracted from coronary arteries within different phases of a 4D CT are not just different in terms of the severity of the motion artifacts. The motion during the heartbeat also causes differences in its local appearance. We want to obtain the corresponding patch with identical local appearance but without motion artifacts, because we would like to train a CNN to remove only the artifacts. Because this is clinically unattainable, we aim to synthesize this same-phase-no-artifact patch,P φ j , using style transfer to the source patch P φ j with a different-phase-noartifact patch as the target patch P φ j . We refer to this process as cross-phase style transfer, where φ denotes the phase within the heartbeat when the motion is the slowest, resulting in the minimum number of motion artifacts.
Style transfer is the process of converting only the style, not the contents, of a source image to the style of the target image. The aspects of the style include texture, color, and contrast, both local and global. Content, in contrast, generally refers to the outlines, textures, and colors required to recognize the scene, including the specific objects or persons. In our framework, we assume motion artifacts as part of the style but local appearance as content.
In the proposed framework, we applied a recent method for style transfer using deep neural networks [21], often referred to as the neural style transfer method. The central concept is as follows. First, a CNN-the Visual Geometry Group (VGG) network [25] pre-trained on the ImageNet database [26]-is used to compute local image features subsequently defined as the numerical representation of the content. If we denote the tensor of the CNN features at layer l as F 1 x and F 1 c for the synthesized image I x and content reference image I c , respectively, the loss function for the content is defined as Next, the numerical representation of the style is defined using the Gram matrix G l , where each element is the inner product between different CNN features at layer l, as where G l ij denotes the element at row i, column j of G l , F l i and F l j denote the i th and j th features, respectively, corresponding to the i th and j th convolutional kernels, respectively, at layer l. The loss function for style is subsequently defined as where G l x and G l s are the Gram matrices, and N l x and N l s are the number of features at layer l, for I x and style-reference image I s , respectively.
Finally, I x is determined by using a gradient descent to minimize the balanced loss, defined as where α and β are coefficients to balance the effect between the content and style loss terms. The process of optimizing the loss in Eq. (4) does not involve training the network, which is fixed. Instead, a modified version I x of the input images is generated. For further details, we refer the reader to [21].
From the review above,P φ j , P φ j , and P φ j correspond to I x , I c , and I s , respectively. Whereas the phase φ with the minimum amount of motion is determined manually, the patches from all other phases φ can be assigned as the source, i.e., the reference patch for content P φ j .

C. TRAINING AND APPLYING THE MOTION ARTIFACT CORRECTION NETWORK
We adopted the Very-Deep network for Super-Resolution (VDSR) [13] to our problem of motion artifact correction. The VDSR is a specific type of CNN, configured with deep layers and gradient clipping suitable for high learning rates in the gradient descent of the training process. The input and output dimensions are set to be identical, we apply supervised learning with our training dataset so that a motion-corrected version of the input image patch is given as the output. We chose VDSR because 1) our problem is primarily a noise reduction problem, and noise reduction is similar to achieving super-resolution, 2) the input data in VDSR is upscaled such that patch sizes of the input and output are assumed to be the same, which is the configuration of our case, and 3) it illustrates acceptable performance and fast convergence during training.
The acceptable performance is primarily due to the deep structure of the network, which combines the very deep CNN model of [25] together with the residual learning of [27]. Whereas skip connections were added at every other convolutional layer in [27], only a single skip-connection from the input to output is created in the VDSR network. This connection learns the difference between the input and output and prevents the vanishing gradient problem. Furthermore, there is a gradient clipping scheme [28] that is often used in training recurrent neural networks to expedite training convergence. We adopt an adjustable gradient clipping scheme [13], in which the gradients are clipped to − θ γ , θ γ to boost the convergence, where γ denotes the current learning rate and θ denotes gradient clipping. VDSR exhibits the fast convergence of a CNN with a 20-layer network using adjustable gradient clipping. Also, to keep the sizes of al feature maps the same, we pad zeros before convolutions.
The structure of the MAC-net follows the VDSR network, which comprises 20 convolutional layers and 19 rectified linear unit (ReLU) non-linear activation functions, as depicted in Fig. 5. We used 64 3 × 3 kernels for each convolutional layer. The corresponding cross-phase style transferred patch P φ j is assigned as the GT output for the input patch P φ j . The loss function is defined as the mean squared error 1  We can denote the training dataset as , VOLUME 8, 2020 FIGURE 5. Architecture of the MAC-network based on the VDSR network [13]. A pair of convolutional layers and an activation function are cascaded repeatedly. The last convolutional layer denotes a learned residual image. A single skip-connection from the input to output is applied. where j represents the index of the patch within a single coronary artery centerline, k represents the index of phase within the 4D CT volume, excluding φ , which is assigned as the phase without motion correction, and l represents the index for the 4D CT volumes. Thus, the total number of patches in our training data should be the product M × K × L of the M points in K phases in L 4D CT volumes.
We used the Caffe [29] framework for our implementation. The hyperparameters for the training are set as follows: a batch size of 64, a learning rate of 0.0001, an epoch of 100, and a weight decay of 0.0001, while using the optimizer ''Adam'' [30].
During testing, the MAC-net is applied after sampling M equidistant coronary patches based on a manually annotated artery centerline, as described in Sec. II-A. All patches are then fed independently into the trained MAC-net.

D. REINSERTION AND VOLUMETRIC INTERPOLATION OF MOTION-CORRECTED PATCHES INTO 3D CT VOLUME
The 2D patch outputs of the MAC-net are reinserted back into the CT volume, as illustrated in Fig.6, to apply motion correction to the entire 3D volume. Volumetric interpolation must be performed to propagate the motion correction to the 3D volume of the coronary artery and ensure a continuous appearance.
Because the center point q j and the normal n j of each patch are already known through the patch extraction process described in Sec. II-A, output patches P j are first reinserted into the 3D volume by the inverse of the known transform. We can denote the planar grid of 3D coordinates corresponding to the pixel coordinates, obtained from the projection as Q j .
Volumetric interpolation is performed for two adjacent patches P j and P j+1 . Within the bounding box enclosing the two reinserted patch coordinate grids Q j and Q j+1 , we define R × R vectors v k j,j+1 = q k j + t(q k j+1 − q k j ) defining 3D lines that pass through the corresponding voxel coordinates q k j and q k j+1 , respectively, where k, 1 ≤ k ≤ R 2 denotes the index for the reinserted 3D coordinate grid and t is the parameter for the line equation. For each voxel with coordinate q within the bounding box, and in the volume between the two patches, we determine the vector v k j,j+1 among {v k j,j+1 , 1 ≤ k ≤ R 2 }, that is closest with the voxel coordinate, i.e., that has mini- Then, we determine the two 3D coordinates q † j and q † j+1 which are on the line defined by q + t(q k j+1 − q k j ) and the planes corresponding to the grids Q j and Q j+1 . The final value for voxel q is the weighted average defined as is the intensity obtained from bilinear interpolation on P j at 2D non-integer coordinateq † j , and w q j = d p2p (q,Q j+1 ) d p2p (q,Q j )+d p2p (q,Q j+1 ) is the weight defined by the point-to-plane distances d p2p of q to the planes containing Q j and Q j+1 . We compute d p2p using the equation d p2p (q, Q j ) = |(q− q j )· n j | | n j | , where q j and n j are the grid center coordinate and normal vector of the plane containing the grid Q j . The definitions are similar forq † j+1 and w q j+1 .

A. PHANTOM STUDY
We evaluated our method on the C AVAREV platform [31], which is based on simulated dynamic projections based on the 4D XCAT phantom with contrasted coronary arteries derived from patient data. We used the dataset D c (cardiac motion only) for the proposed methods. Geometry calibration was obtained from a real-world clinical angiographic C-arm system.
To apply the proposed method, we reconstructed 10 C-arm CT volumes on a 256 3 grid with an isotropic voxel size of 0.5 mm with 10 different target reconstruction heart phases. Similar to the previous studies at the C AVAREV website [32], we used the reconstructed volume at heart phase 0.9, which shows quiescent motion, as the target phase for cross-phase style transfer, as described in Section II-B. Based on the description in Section. II-C, the training data are comprised of patches from M = 25 points in K = 10 − 1 − 2 = 7 phases in L = 1 4D CT volume. A total of 175 pairs of mid-RCA patches were extracted, which were augmented to 1,050 using vertical and horizontal flips and rotations. Because of the lack of data, we sampled the centerline points M as densely as possible and used seven volumes for training and two volumes for testing. The testing data are comprised similarly as the training data.
To evaluate our motion correction results, we used the 3D metric introduced as Q 3D as defined by C AVAREV [31]. We evaluated the similarity using the Dice similarity coefficients (DSCs) for the overlap of two binary images ranging from zero (no overlap) to one (perfect match). The motion corrected volume is binarized through thresholding and then evaluated with the ground-truth, which is the segmentation mask of the coronary artery within the volume reconstructed at the quiescent heart phase. The DSC of the proposed method is the mean of the DSCs of two test volumes. A comparison of the Dice score with other methods that are introduced on the C AVAREV website [32] is presented in Table 1. For use in urgent medical sites, the processing time is also an important consideration, so computation time is compared for the right coronary artery (RCA). The proposed method requires less than 1 minute of duration: 0.15 seconds for the test step of the MAC-net (Sec. II-C) and approximately 50 seconds for the reinsertion and volumetric interpolation (Sec. II-D). The computation time was obtained on Intel(R) Core(TM) i7-8700 CPUs (3.19GHz) with 16 GB of memory and a NVidia Titan XP GPU (12GB). The comparison methods [33], [34] are cost minimization-based approaches, and [35]- [37] are 2D-2D or 3D-3D registration-based approaches. In Schwemmer et al. [36], registration times were obtained on two Intel(R) Xeon(R) E5540 CPUs (2.53 GHz) with 16 GB of memory, and since [37] was their follow-up, the hardware specs would be the same or similar. When considering a trade-off between the DSC and computation time, the proposed method is comparable to other methods in terms of the DSC and outperforms in terms of computation time. Fig. 7 illustrates examples of the test volumes with minor (φ = 80) and severe (φ = 50) motion artifacts. Given that the image quality is not high, motion artifacts are reduced in the results of the proposed method. The proposed method, as a deep learning-based approach, demonstrates promising results quantitatively and qualitatively using the C AVAREV dataset.

B. CLINICAL DATA 1) DATASETS
Based on the description in Section II-C, the training data comprised patches from M = 10 points in K = 10 − 2 = 8 phases in L = 100 4D CT volumes. The number of phases K = 8 was determined by the number of temporal-phase quantizations, 10, minus the 40% and 70% phases designated as the targets for cross-phase style transfer. Several 4D CT volumes had K less than 8, in which phases with extreme motion artifacts were excluded because manual annotation of the coronary artery was impossible. The final dataset comprised a total of 5,868 pairs of mid-RCA patches, which were then augmented to 35,208 using vertical and horizontal flips and rotations. With R = 60, each patch sampled from the 3D volume was constructed to be of size 60 × 60 pixels. Validation and test sets, comprising 2,152 patches from L = 30 4D CT volumes and 734 patches from L = 10 4D CT volumes, were similarly constructed.

2) QUANTITATIVE EVALUATION: MOTION ARTIFACT METRICS
In quantitative evaluation, because the nature of the metric makes it difficult to consider plaque, we cover only normal coronary arteries and exclude diseased coronary arteries. To quantitatively evaluate the proposed method, we first measure the isotropy of the vessel region to quantify the level of motion artifacts for the coronary artery. Because the cross-section shape of arteries is generally round, the segmented vessel area in the image patches should have an isotropic shape if it has not been corrupted by motion artifacts. We measured the isotropy using the ratio of the two eigenvalues (λ 1 , λ 2 ) of the segment shape, as in [38].
We adopt the motion artifact metrics proposed in [24]. We present the results of three metrics having the highest agreement with three expert readers: fold overlap ratio (FOR), low-intensity region score (LIRS) and their product, the motion artifact score (MAS). Like isotropy, the FOR also measures the shape of the vessel region, but measures the extent of mirror symmetry, rather than the overall isotropy. Given the vessel segmentation, an axis v that passes the region centroid is defined that subdivides the region in two. Then, one subregion is folded along the axis onto the other subregion, at which point the ratio of the intersection and union of those two regions are defined as L FOR v . In our implementation, we define the two orthogonal eigenvectors (v 1 , v 2 ) of the region shape, corresponding to (λ 1 , λ 2 ), as two different axes and determine the final FOR L FOR = min(L FOR v1 , L FOR v2 ). The LIRS is defined as the mean of the low-intensity region intensity-score (LIR-IS) and low-intensity region area-score (LIR-AS). LIR-IS and LIR-AS respectively measure the intensity and the size of the low-intensity shading of the motion artifact that occurs on the myocardium. LIR-IS is defined as the ratio of the intensity values of the shaded area and the myocardium and LIR-AS are defined as the ratio of the area of the shading and the size of the artery. A visual description and the specific mathematical notations and equations are presented in Fig. 8.
To compute these metrics, the specific vessel region, regions where the low-intensity shading artifact occurs, and the myocardium region must be determined. To perform this process efficiently, we apply a simple seed-based region growing method to annotate the vessel and low-intensity artifact regions semi-automatically. Not only is the region growing more efficient, but it also aids in delineating the segment boundary and maintaining the consistency of the region intensity. Regions of the myocardium were manually annotated, in contrast, because it may have various appearances and shapes, it is thus unsuitable for region growing.
For comparative evaluation, the results of the four motion artifact metrics for the input and output of the MAC-net for the randomly sampled 100 patches of the test set are presented in Fig. 9. The isotropy, FOR, LIRS, and MAS, all of which increased, represent improved results. The median and interquartile range (IQR) of each metric are presented in Fig. 9

3) QUANTITATIVE EVALUATION: BACKGROUND PSNR AND SSIM
In the previous subsection, a quantitative evaluation for the coronary artery region is presented, and in this subsection, a quantitative evaluation for the background region is presented. To evaluate the degradation of the background of the 2D patches of the test set, we analyze two well-known image quality metrics: the peak signal-to-noise ratio (PSNR) and the structural similarity (SSIM) index, and the results are presented in Fig. 10. The mean of PSNR is 33.6 dB (standard deviation of 4.25), and the mean of SSIM index is 0.97 (standard deviation of 0.01). Both metrics illustrate that there was minimal to no image quality degradation.

4) QUALITATIVE EVALUATION
Two experienced readers evaluated the degree of motion artifacts based on a 5-point Likert scale [39], where 1 = completely unreadable, 2 = significant motion, 3 = apparent motion, 4 = minor motion, and 5 = no motion. We performed a blind test on this inter-observer study: the observer did not know whether the patches were from before or after applying VOLUME 8, 2020  [39] (1 = completely unreadable, 2 = significant motion, 3 = apparent motion, 4 = minor motion, 5 = no motion) and source information (case number and phase φ in 4D CT) are presented above. For all patches in the test set, motion artifact metrics are measured quantitatively in Sec. III-B2.

FIGURE 12.
Qualitative examples of the input mid-RCA patch (left) and MAC-net results (right). Samples represent specific cases that distinguish the primary vessel from the branch. The third sample pair also illustrates correction for the artery plaque. Expert-evaluated scores and source information are presented above. the proposed method. Both original and motion-corrected patches are depicted completely randomly to the observers without any delay. Table 2 presents the ratio of frequencies of each category for test patches before and after motion correction using the MAC-net. The proportion of images presented with  completely unreadable, significant, and apparent motions (Likert scale 1, 2, and 3) were 98.5% previously but decreased to 35% for the results of the MAC-net (p < 0.001). The mean±standard deviation based on the Likert scale was significantly improved from 1.43±0.66 to 3.80±0.87. (p < 0.001). The inter-observer agreement was calculated with the kappa (κ) statistics for the motion score and exhibited strong agreement as before κ = 0.85; 95% confidence interval (CI) 0.76-0.95 and after κ = 0.70; 95% CI 0.61-0.81.
The sample results of the motion-corrected patches are presented in Figs. 11, 12, and 13. Each pair of images represents those before and after applying the MAC-net. Changes in the Likert scores and the source information of each patch are presented above the patches. Fig. 11 provides a comparison of various levels of improvement. Improvements range from extremely positive (from completely unreadable [score 1] improved to no motion [score 5]) to moderate (significant motion [score 2] improved to minor motion [score 4]) to incremental (apparent motion improved to minor motion). In all cases, after applying the MACnet, the edge of the coronary artery more visible and motion artifacts reduced, while distortion of local appearance is limited. Fig. 12 demonstrates the robustness of the MAC-net for special cases where the coronary artery diverges or contains plaques. The MAC-net performs only a gentle modification on the artery appearance to improve visibility.  Samples of worst cases are depicted in Fig. 13, in which there was no change in the score. These cases may occur when motion artifacts are too severe or the intensity of the coronary artery is indistinguishable from the right atrium or the right ventricle. There were no cases where the score decreased. Overall, the MAC-net is highly likely to improve the image quality with a very small probability of causing harm. We also present straightened curved planar reformations (CPRs) of case #7 in Fig. 14. These include straightened CPRs of a right coronary artery (RCA) at phase (φ = 60) with motion artifacts and the MAC-net results at the same phase and at the best-phase (φ = 40). The boundary of the RCA is blurred by motion artifacts in Fig. 14(a), whereas in Fig 14(b), the border is clearer than before.
Furthermore, we present qualitative results of the volumetric motion correction, as described in Sec II-D, in Fig. 15. We collected five example cases in which the automatic tracking method [40] provided by the commercial software QAngioCT fails to track the RCA due to motion artifacts among ten test cases. After applying the proposed method, the RCAs were tracked 62% longer on average. Again, there were no cases where tracking was reduced, supporting our claim that even for the worst case, the proposed method is unlikely to cause harm.

C. ABLATION: HYPER-PARAMETERS FOR STYLE TRANSFER
For most of the hyper-parameters of components applied offthe-shelf in the proposed method, such as the pre-trained network used with style transfer or parameters for training the MAC-net, little or no tuning was required to obtain acceptable results. Thus, we present ablations regarding the hyper-parameters for iterative optimization within the style transfer process, including the β value and the number of iterations gradient descent for the style loss before termination, in Fig. 16.
We measure the amount of motion correction using the MAS and the amount of change in local appearance using the SSIM [41]. As described in Sec. III-B2, MAS is measured on the regions containing the vessel and motion artifacts, annotated manually, whereas SSIM is measured on the remaining background regions. Although the results are unexpectedly stable for different values of β, the MAS increases and SSIM decreases during the iterations as expected. In all experiments, we used β = 100 and terminated the style transfer after 100 iterations to balance the motion correction and appearance change.

IV. CONCLUSION
We have proposed a novel framework for motion correction of CCTA. With practicality and efficiency in mind, the proposed framework has the following characteristics: 1) actual learning is performed on 2D patches of the coronary artery, 2) corresponding patches are extracted from 3D volumes from different phases with strong and weak motion in a 4D volume, 3) style transfer [21] is used to generate synthetic motion-corrected ground-truth images, 4) a separate deep CNN is used to learn the motion correction from the synthesized ground-truth data, 5) during testing, patches are extracted from manually annotated coronary artery in an input 3D volume, motion-corrected, and reinserted and interpolated back into the volume.
We have performed a quantitative and qualitative evaluation that confirms the effectiveness of the proposed method using phantom and clinical dataset. In the phantom study, the proposed method shows acceptable results when considering the DSC and computation time. In the experiments using clinical dataset, in terms of the MAS-the product of fold-overlap ratio and low-intensity region score, as proposed in [24]-the proposed method resulted in a 13.0% improvement in the median compared to the uncorrected input. Qualitatively, for the motion-corrected 2D patches, the mean±standard deviation values of the 5-point Likert scale graded by expert readers was significantly improved from 1.43±0.66 (between completely unreadable and significant motion) to 3.80±0.87 (between apparent motion and minor motion). For the motion-corrected 3D volume after reinsertion and interpolation, commercial software QAn-gioCT tracked the RCA 62% longer on average. Furthermore, we provide an ablation study concerning the setting of the style transfer method [21]. We found that the results were consistent with significantly varying values of β-the ratio of scale between style and content loss-and that 100 iterations were sufficient to obtain acceptable results.
For our future work, we plan to explore alternative methods to generate ground-truth motion correction images. The recent success of using generative adversarial networks (GAN) in synthesizing realistic training data [19] upholds its potential to address the motion correction problem.