Investigation of in Vivo Human Cardiac Diffusion Tensor Imaging Using Unsupervised Dense Encoder-Fusion-Decoder Network

Diffusion tensor imaging (DTI) is currently the unique imaging technique that can detect the structure of in-vivo human myocardium without invasivity and radiation. However, it is particularly sensitive to motions, especially respiratory motion that results in serious signal loss in diffusion-weighted (DW) images. This makes it impossible to accurately measure cardiac microscopic structural properties. To cope with such problem, this paper proposes an unsupervised dense-encoder-fusion-decoder network (DEFD-net) to compensate for signal loss in cardiac DW images, which allows investigating in-vivo myocardium structure more accurately. The DEFD-net consists of three modules, namely dense-encoder, fusion module and decoder module. The dense-encoder and decoder are trained firstly with DW images acquired at different trigger delays in an unsupervised manner for extracting local and global features. A fusion strategy is then designed to fuse the extracted features. Finally, the well-trained decoder is used to reconstruct the fused DW image from the fused features. To validate the superiority of the proposed method, comparison with existing methods such as PCAMIP, WIF and U2Fusion is performed on both simulated and acquired datasets. The experimental results showed that the proposed method effectively compensates for motion-induced signal loss in DW images, thus leading to much better DW image quality with respect to existing methods. Moreover, the subsequently derived myocardium fiber structure is more regular.


I. INTRODUCTION
According to the latest WHO (World Health Organization) report, more than 11.7 million people died of cardiovascular disease each year, and 80% of cardiovascular deaths were caused by myocardial infarction and stroke [1]. The cardiovascular diseases are closely related to the structure of myocardial fibers. Therefore, investigating the structure of myocardial fibers can potentially make early diagnosis of cardiovascular diseases become possible and have The associate editor coordinating the review of this manuscript and approving it for publication was Yakoub Bazi . important significance in explaining the causes of cardiovascular diseases [2], [3].
Diffusion Tensor Imaging (DTI) is currently the unique imaging technique that can detect non-invasively the myocardial structure of in-vivo human heart without radiation [4]- [7]. It estimates myocardial fiber orientations by measuring the diffusion displacement distribution of water molecules in fibrous tissue from diffusion weighted (DW) images along different diffusion gradient directions. Diffusion metrics, including fractional anisotropy (FA), mean diffusivity (MD) and helix angle (HA) are often calculated from DTI for describing the myocardium structure [6]- [8].
At present, most researches on cardiac DTI are focused on ex-vivo human or small animal hearts [9], [10]. Since the essence of DTI is to infer the microstructural characteristics of tissues by detecting the diffusion motion of water molecules, this technique is particularly sensitive to motions, especially in the case of human free-breathing; the periodic heart beating and respiratory motion can cause serious signal loss in DW images, which prevents from correctly obtaining DW images and consequently inferring accurately myocardial microstructure. Therefore, how to effectively deal with the influence of heart beating and free-breathing on in-vivo cardiac DTI is of great importance. In order to remove the motion effects, in the early acquisition of in-vivo cardiac DW images, patients need to hold breath. This acquisition method is difficult to apply for patients who have difficulty in holding breath or have attention deficit hyperactivity disorder (ADHD). To solve this problem, researchers began to improve the imaging sequence to obtain high quality in-vivo cardiac DTI. As early as 1994, Edelman et al. proposed to acquire in-vivo cardiac DTI through stimulated echo imaging (STEAM) sequence to reduce motion effects. Although this method does not require to hold breath, the signal-to-noise ratio (SNR) of DW images acquired using this method is very low, which may affect the subsequent DTI reconstruction [4]. In 2007, Gamper et al. found that using single echo planar imaging sequence combined with flow compensation gradient and reduced imaging field of view (FOV), DTI of the in-vivo heart can be also be achieved, and the SNR is higher than that using the STEAM sequence [11]. However, narrowing FOV makes such technique limited to the in-vivo heart of small animals. To deal with this issue, researchers attempted to design high-order motion-compensated diffusion gradient waveforms to overcome signal attenuation caused by cardiac motion. For instance, Christian et al. proposed a spin-echo DTI acquisition strategy based on second-order motion compensation [12]. Compared with current spin-echo DTI, it can reduce the influence of cardiac motion, especially that in systole phase, on the imaging of in-vivo myocardial fiber structure. At the same time, other researchers also began to propose new acquisition sequences for correcting motion artifacts in in-vivo cardiac DTI, such as velocity compensation bipolar gradient spin echo (PGSE) sequence [13] and dual-gated stimulated echo sequence [14]- [16], etc.
Although the above methods based on imaging sequence technology can effectively correct motion artifacts, most sequences require in fact a longer echo time, which makes it impossible to obtain a high SNR. With the development of image processing methods, post-processing approaches were developed to perform motion correction in in-vivo cardiac DTI. Madore et al. proposed a method of averaging multiple acquisitions to reduce noise and improve the SNR [17]. However, the acquisition time is very long, which easily leads to attenuation of the signal strength. To deal with this issue, Liu et al. proposed a DTI denoising algorithm by applying the similarity theory to Riemannian geometric space [18], [19].
Although the denoising method can improve the image quality, it cannot generate satisfactory results if signal loss and noise coexist. To deal with signal loss and noise simultaneously, Pai et al. combined principal component analysis and temporal maximum intensity projection (PCATMIP) to improve the quality of in-vivo cardiac DTI [20]. Inspired by this work, Wei et al. used a method based on wavelet fusion to fuse DW images acquired at multiple time delays (TDs) to obtain a relatively complete in-vivo myocardial fiber structure [21]. They demonstrated that it is feasible to use post-processing methods combined with clinically available imaging sequences to correct motion effects in in-vivo cardiac DTI.
Traditional motion correction methods need to estimate the prior distribution, such as Bayesian Markov random field model [22], wavelet [23], independent component analysis [24], etc. Recently, with the successful applications of deep learning in the field of computer vision, and due to the fact that deep learning approach is more flexible and can process images from a large amount of data without any prior assumption, the use of deep learning for medical image processing has attracted more and more attention. For instance, in diffusion magnetic resonance imaging (dMRI), Golkov et al. proposed a q-DL model to reconstruct diffusion metrics from undersampled DW images with single step, which allowed them to obtaine high-quality diffusion metrics maps with reduced acquisition time [25]. Tian [30]. Inspired by the above works, in this work, we proposed a dense-encoder-fusiondecoder network (DEFD-net) for compensating for motion effects in in-vivo cardiac DTI, and attempted to restore high-quality DW images for describing in-vivo cardiac fiber structure more accurately. The main contributions of this work include: (1) Developed an unsupervised deep learning model to compensate for signal loss in in-vivo cardiac DW images, (2) Proposed a maximum intensity fusion rule to effectively fuse the feature maps of DW images, (3) Used an independent feature fusion layer to promote the flexibility of the proposed model.

A. DENSE ENCODER-DECODER NETWORK FOR REMOVING MOTION EFFECTS
To deal with the motion effects on cardiac DW images, we proposed a dense encoder-decoder network in this work. Encoder-decoder structure has been demonstrated outstanding performance in the field of image processing [33]- [37]. However, existing network structure is not suitable for our VOLUME 8, 2020 task, that is, how to fuse DW image information from multiple TDs to reconstruct more accurate in-vivo cardiac DW images. Therefore, a fusion layer is added in the middle of the encoder-decoder network to fuse the features extracted from the encoder. The detailed structure includes three modules: dense-encoder, fusion layer, and decoder, as illustrated in Fig. 2, in which the notation m × n above the modules means the multiplication of the number of input channels by the number of output channels.
The encoder is used to extract depth features of DW images. It is composed of a convolutional layer and a dense block, each of which contains three cascading convolutional layers. This means that the output of previous convolutional layer serves as the input of next convolution layer. For each convolutional layer in the encoder, the size of the convolution kernel is 3×3 and the stride is 1. Such design makes it unnecessary to limit the size of the input image during the network training process. To make the encoder retain detailed features as much as possible, dense blocks are used in the network. If defining the feature maps obtained from the i th convolution layer as f i , the (i+1) th layer receives all the feature maps from previous convolution layers, namely, the output of the (i+1) th convolutional layer is: where S(·) represents the activation function and [f 1 , f 2 , . . . , f i ] designates feature map concatenation. Such concatenation allows merging local and global features, thus enhancing the expression ability of extracted features and avoiding the problem of parameters explosion and gradient vanishing.
In the fusion layer, we design a specific fusion strategy based on the characteristics of DW images. As mentioned in the previous section, due to the influence of motion, there are signal loss in the feature maps extracted from the DW images acquired with different TDs. In order to compensate for the signal loss caused by motion, the following fusion rules are adopted in the fusion layer. Firstly, the feature maps extracted from DW images acquired at two consecutive TDs by encoder are compared pixel by pixel, and the smaller pixel value is replaced by the larger one to obtain the fused feature maps. Then, the fused feature maps and the feature maps extracted from the images acquired at next TD are further fused in the same way. This process is repeated until all the TDs are traversed. The final fused feature map will be used as the input of the decoder.
From the fused feature maps, the decoder attempts to reconstruct the DW images without signal loss or noise. It contains four convolutional layers, and for each layer, the size of the convolution kernel is 3 × 3. In both encoder and decoder, the activation function is ReLu. The detailed parameters of the network are given in Table 1.
In the DEFD-net, since there is a fusion layer, the training or testing process is a little different from traditional encoderdecoder networks, which will be detailed as follows.

B. TRAINING AND TESTING
In the training phase, the fusion layer is first discarded and only the encoder and decoder parts are trained, as shown in step 1 in Fig. 2. The advantage of this operation is that different fusion strategies can be designed according to the characteristics of different datasets to make the network more practical and flexible. Besides, since we do not have the ground-truth for in-vivo cardiac DW images without signal loss or noise, such operation makes the unsupervised learning-based reconstruction become possible. The input of the encoder is the DW images acquired at different TDs. The encoder extracts useful features from the input images, and the decoder reconstructs the original DW images from the features.
To achieve a better reconstruction and retain more textural and edge information. In the training process, the MSE (Mean square error) and SSIM (Structural similarity index measure) are combined as the loss function of the entire network, namely The MSE loss is expressed as: where n is the number of training samples, F(Y i ) represents the reconstructed image from the feature maps Y i using the encoder network, and X i represents the input image. MSE loss is essentially a pixel-by-pixel regression, which is not sensitive to the differences in image details and does not conform to human perception of image quality, so it cannot capture the internal characteristics of the human visual system. Therefore, this paper introduces SSIM loss to promote the similarity between the input and decoded images. The SSIM loss is expressed as: where, µ F(Y i ) and µ X i are the average value of output and input images, σ F(Y i ) and σ X i are respectively the corresponding variances, and σ F(Y i )X i represents the covariance of output and input images. C 1 = (K 1 L) 2 and C 2 = (K 2 L) 2 are two variables for maintaining stability; the default is K 1 = 0.01, K 2 = 0.03 and L = 1. The model in this work was implemented with TensorFlow. During the training phase, 5200 DW images with a size of 90 × 160 acquired from 5 healthy volunteers in 10 TDs were taken as inputs. The parameters of the model were initialized with Xavier method [38], and the optimizer is Adam [39] with hyperparameters as follows: β 1 = 0.9, β 2 = 0.999 and ε = 10-8. The mini-batch size is set as 2. The maximum epoch number is 700. The initial learning rate is 1 × 10.4, which is decreased by 1/10 every 18 epochs. The model starts to converge at 108 epochs.
After the network model training is completed, the fusion layer is added during the testing, as shown in Step 2 of Fig. 2. In the testing phase, 1300 DW images acquired from 10 different TDs of a volunteer was used. The feature maps are generated by the encoder, and then these feature maps are fused through the fusion layer. Finally, the fused features are input into the decoder for reconstruction.

C. EVALUATION CRITERIA
In order to further verify whether the method proposed in this paper can effectively compensate for signal loss caused by the heart motion, for the simulated DW images, peak signal to noise ratio (PSNR) and SSIM were used to evaluate the performance of different methods, and for the acquired DW images, we used several no-reference image quality assessment (NRIQA) criteria to evaluate the proposed method, including Brenner [40], Laplacian [41], SMD [42], Energy [42] and Tenengrad [41]. The higher the calculated value of these evaluation indicators, the better the image quality. Among them, Brenner gradient function is the simplest gradient evaluation index, which calculates the square of intensity difference between two adjacent pixels: where, f (x, y) represents the image intensity at the pixel (x, y), and D(f ) represents the result of image sharpness. The Tenengrad function uses Sobel operator to extract the gradient values in the horizontal and vertical directions: where G x and G y are the image gradient along x and y directions at the pixel (x, y), respectively. The Laplacian gradient is basically the same as Tenengrad gradient, which uses Laplacian operator instead of Sobel operator.
SMD (Sum of Modulus of gray Difference) is usually to describe the image clarity and defined as: To evaluate image sharpness in real time, the energy gradient function ''Energy'' is used: From the perspective of DTI, we evaluate the fiber structure, expressed by helix angle (HA) and transverse angle (TA), and diffusion metrics, including FA and MD. TA is the angle between the projection of myocardial fiber on the short axis of the ventricle and the tangent plane of the ventricle, which reflects the direction of the myocardial fiber on the short axis of the ventricular wall (as shown in Fig. 3(a)), and the angle is between 90 • and −90 • . HA is the angle between the projection of the myocardial fiber on the ventricular tangent plane and the short axis of the ventricle, reflecting the inclination of the myocardial fiber (as shown in Fig. 3(b)), the range is also between 90 • and −90 • .
FA and MD are well-known indicators that describe the diffusion of water molecules in myocardial fibers and have been widely used in the study of cardiac DTI. The FA is a scalar value between 0 and 1, describing the degree of anisotropy of the diffusion process. MD reflects the overall diffusion level of molecules and the overall situation of diffusion resistance. Both FA and MD are defined in terms of the eigenvalues λ 1 , λ 2 and λ 3 of diffusion tensor.

A. DATA DESCRIPTION AND PREPROCESSING
To evaluate the performance of the proposed method, both simulated and acquired datasets were used. The simulated cardiac DW images were obtained with the method proposed by Wang et al. [31], in which 14 slices of CCBM heart (http://www.ccbm.jhu.edu/research/DTMRIDS.php) were selected for modeling and simulation. For each slice, DW images along 12 diffusion gradient directions with 10 time-delays were simulated. In the simulated dataset, we have 1680 images in total. The simulated dataset provides the ground-truth for cardiac fiber orientations, FA and MD, which enables us to evaluate different methods with respect to ground-truth.
As to the acquired dataset, the in-vivo cardiac DW images of five volunteers were acquired using a multiple shifted TDs acquisition strategy on a 1.5 T clinical scanner (MAGNE-TOM Avanto, Siemens AG) at the Neuro-Cardiology Hospital of Lyon in France. The main acquisition parameters are as follows: TE/TR = 51/100 ms, spatial resolution = 2.6 × 2.6×6 mm 3 , acceleration rate = 2, partial Fourier = 6/8, matrix size = 160 × 90, FOV = 420 × 236 mm 3 , bandwidth = 1302 Hz, directions = 12, and b = 200 s/mm 2 . For each volunteer, 10 slices (slice thickness = 6 mm) were acquired in the short-axis view without any interslice gap. At the first TD (e.g., 850 ms), we acquired one b 0 image and DW images along 12 different diffusion gradient directions. By shifting the TDs with an interval of 10 ms for 9 times, 130 multi-phase images (10 time points ×13 images) are obtained for each slice. Therefore, for acquisition dataset, we have 6500 images in total. Under free breathing conditions, the total scan time for each slice is approximately 2 minutes. Fig. 1 shows b 0 and several DW images of 2 slices acquired from 5 healthy volunteers (V1-V5) in the first TD. It can be seen that, because of cardiac motion, the SNR of DW images are extremely low and the partial or complete signal loss in DW images exists, as marked by the yellow frame.
In this work, we attempted to compensate for the motion effects by fusing the DW images acquired at 10 TDs. To achieve this, firstly, we registered the DW images acquired with 10 different TDs using non-rigid registration method embedded in FSL [32], and then extracted the myocardium as ROI using a threshold segmentation method. After that, the designed network extracted the useful features for fusion. Finally, based on a certain fusion rule and the network, the fused DW images without or with less motion effects are derived. The process will be detailed in the next subsection.

B. THE FEATURE FUSION RESULT
To compensate for the signal loss caused by cardiac motion, the features extracted from the encoder at 10 TDs are fused according to the fusion rules designed in section II. As shown in Fig. 4, the left side is the feature map of the same myocardial slice extracted by the dense encoder at different times (TD1-TD10), and the right side is the feature map fused by the fusion layer. From the fusion results for both simulated (Fig. 4(a)) and acquired (Fig.4(b)) datasets, we can see that the proposed fusion strategy compensates for most of the lost signals.

C. RESTORED IN-VIVO CARDIAC DW IMAGES
To better evaluate the performance of the DEFD-net, we compare DEFD-net with PCATMIP, WIF and U2Fusion on both simulated and acquired in-vivo cardiac DTI datasets. The first row in Fig. 5(a) shows the simulated in-vivo DW images at 10 TDs, in which obvious signal loss exists. After correction by different methods, the signal loss in the DW images is compensated. As shown in the second row of Fig. 5(a), the DW image obtained with the proposed DEFD-net is visually closer to the reference image with respect to the other methods. Such better performance can also be observed in the case of acquired dataset (Fig. 5(b)); VOLUME 8, 2020  the DW image fused by DEFD-net is visually better than the DW images produced by the other methods.
To quantitatively evaluate the performance of different methods, SSIM and PSNR were calculated for the simulation dataset, and Brenner, Tenengrad, Laplacian, SMD and Energy were compared for the acquired dataset, as given in Table 2. We observe that, for the simulated dataset, DEFD-net yields the highest SSIM and PSNR, compared to PCAMIP, WIF, and U2FUSION. Likewise, PSNR is increased respectively by 49%, 16%, and 15%, and SSIM is increased respectively by 5%, 2% and 2%. For the acquired dataset, all the NRIQA metrics obtained by our proposed method are higher than the others. This validates the superiority of the DEFD-net in motion correction and noise removal.

D. RESTORED IN-VIVO CARDIAC DTI METRICS
It is necessary not only to guarantee the quality of the restored DW images, but also to ensure that the fiber orientation and diffusion metrics calculated from the restored in-vivo cardiac DW images are correct. Fig. 6 shows the reconstructed DTI metrics for simulated datasets, including fiber orientation, fiber tracking, FA and MD maps, helix and transverse angles.
From Fig. 6, it can be seen that the reference fiber orientation and fiber tracking maps are complete, and the fiber orientations of left ventricle is in a circular ring shape. With the added motion and noise, signal loss is more pronounced, resulting in disordered fiber orientation and incomplete fiber tracking (as shown in the second column of Fig. 6(a) and (b)). Although the PCATMIP, WIF and U2Fusion can be used to recover signals, the difference between the corrected fiber orientation and the reference one is still significant. Fig. 6(c) and (d) gives the FA and MD maps. We observe that FA and MD values obtained with DEFD-net are closer to the reference values. Fig. 6(e) and (f) shows the helix and transverse angle maps. The reference HA maps show a positive to negative transition from endocardium to epicardium, which reflects the rotation characteristic of myocardial fibers. TA maps are predominantly close to 0, reflecting a circumferential arrangement of myocytes. Motion and noise greatly influence HA and TA maps. We found that DEFD-net has the best  performance in correcting HA and TA compared with the other methods.
To quantitatively assess the impact of DW image restoration on cardiac fibers, Table 3 gives several metrics related to fiber tracks, including track length, number of tracks and track volume. To compare the different methods fairly, all the fiber tracking parameters are set as the same for different methods. In this work, fiber tracking and tack statistics were performed with DSI_studio (http://dsi-studio.labsolver.org/). The tract length reflects the degree of connection of the fiber; long length indicates that the fiber direction changes smoothly. Compared to the other three methods, the tract length obtained with DEFD-net correction is longer. As to the number of tracts and tracts volume, the values obtained using DEFD-net are much closer to the reference ones, which further demonstrates that our method clearly outperforms the others.
In Fig. 7 are shown the restored DTI metrics for the acquired dataset with different methods. As can be seen in Fig. 7(a), the fiber orientation of original in-vivo cardiac DTI without correction does not change regularly along circumferential directions due to the influence of heart motion. Although the PCATMIP, WIF and U2Fusion methods correct myocardial fiber directions to a certain extent, there are still some wrong fiber directions in the left ventricle. In contrast, fiber orientations after correction by the proposed method are more regular and smoother. To observe the changes of myocardial fibers more intuitively before and after correction, we performed 3D fiber tracking, as shown in Fig. 7(b). Obviously, the orientation of uncorrected myocardial fibers is disorder. After correction by our method, the myocardial fibers of the left ventricle are circularly more regular. Fig. 7(c) and (d) shows the FA and MD maps. The FA values obtained by our method are larger at mid-myocardium than at endocardium or epicardium, which confirms the findings of Ariga et al. in 2019 [43]. However, the FA values obtained by the PCATMIP, WIF or U2Fusion method do not exhibit this phenomenon.
Concerning HA (Fig. 7(e)), we clearly observe that, after correction with our method, HA changes from positive to negative when radially going from endocardium to epicardium, which conforms to the normal helix structure of myocardial fibers [44]. However, HA corrected by PCATMIP, WIF and  U2Fusion methods shows negative values at endocardium of the first volunteer, and there is no normal change pattern for the second volunteer. Regarding TA that indicates the circumferential distribution of myocardium fiber aggregates, its values are usually distributed around 0 • when not affected by motion. As can be observed in Fig. 7(f), the original invivo cardiac DTI is affected by motion, and TA deviates far from 0 • . Compared to PCATMIP, WIF and U2Fusion, TA values after correction by our method are closer to 0 • . More quantitatively, Table IV provides quantitative results of FA and MD. After processing by DEFD-net, we observe higher FA values with respect to WIF (+8%), PCATMIP (+15%) and U2Fusion (+23%), and lower MD values with respect to WIF (−47%), PCATMIP (−56%) and U2Fusion (−52%).
To further illustrate the superiority of our proposed method over the other post-processing methods, Fig. 8 plots the mean ± SD of helix angle, transverse angle and FA over all 5 volunteers obtained using the different methods. It can be clearly seen in Fig. 8(a) that, with our method, the mean HA for all the volunteers varies from 40 • ± 5 • at endocardium to −33 • ± 7 • at epicardium, passing through 8 • ± 5 • at mid-endocardium. This clearly brings out the particularly important helix angle variation characteristic of human cardiac fiber architecture. However, the HA range obtained by WIF, PCAMIP and U2Fusion is much smaller than that obtained with our method. The mean TA value of all volunteers obtained by our method is around 0 • while the mean TA values obtained by the other two methods deviate far from 0 • , especially at mid-myocardium ( Fig. 8(b)). With regard to FA, it can be seen from Fig. 8(c) that the mean FA value after DEFD-net correction ranges from 0.31 at endocardium to 0.29 at epicardium, passing to 0.45 at mid-myocardium. This indicates that FA value is higher at mid-myocardium than at endocardium or epicardium, which conforms to the observation of Ariga et al. [43].

IV. DISCUSSION
Due to free-breathing, heart motion and low SNR of DW images, it is still challenging to investigate the fiber structure of of in-vivo human hearts. Although there are some acquisition techniques devoting to reducing overall motion effect on in-vivo cardiac DTI, such as the gating technique for suppressing respiratory motion artifacts [15] and first-order [11], second-order [12], or high-order motion-compensated diffusion coding schemes [47]. Some researchers have proposed using special magnetic resonance diffusion imaging sequences to compensate for physiological movements, but most of the sequences require a longer echo time, which makes it impossible to obtain a high SNR.
To deal with such issue, in this work, we proposed an unsupervised DEFD-net to fuse DW images acquired at different TDs for correcting motion effects. Firstly, a dense-encoderdecoder network is trained with the original DW images, which allows us to extract the local and global features of DW images acquired at different TDs from the well trained denseencoder. Then, the features are fused according to a maximum intensity rule to get the feature maps without or with less influence of motion. Finally, the fused feature maps are used to reconstruct DW images by the well-trained decoder. Since the dense-encoder is capable of extracting semantic features, which allows us to use more accurate information to fuse DW images at different TDs, which is why the method proposed in this paper is superior to the existing state-of-the-art methods, including PCATMIP, WIF and U2Fusion methods.
In addition, the two separate steps of fusion and reconstruction in the proposed network makes the latter more flexible, namely the fusion strategy can be designed according to different types of data.
The experimental results showed that the proposed method allowed us to screen a in-vivo human myocardium structure by using diffusion-weighted sequences available on almost all the clinical scanners with multiple shift delay acquisition strategy during diastole, and that the post-processing method using deep learning obviously restored diffusion properties of in-vivo cardiac DTI, which can overcome signal loss caused by cardiac motion to a large extent. Even for DW images that are greatly affected by heart motion, our method can still clearly obtain the circular structure of myocardial fibers.
Although the methods proposed in this work outperforms the other post-processing methods, there are still two limitations that need to be addressed in the future. Firstly, the STEAM (Stimulated Echo Acquisition Mode) and SE (spin echo) sequences are the most common used techniques for acquiring in-vivo cardiac DW images. With STEAM, the acquired DW images suffer from low SNR instead of signal loss caused by cardiac motion; therefore, using the data acquired by STEAM sequence to improve the SNR with our proposed method will be our future work. Secondly, in this work, motion effect correction for in-vivo cardiac DTI was only implemented on the images acquired at the end of diastole; however, it is well known that motion effects in the early diastole is more serious, especially for SE sequence, therefore, it would be interesting to evaluate our proposed method on in-vivo cardiac DW images acquired in the early diastole phase to remove the motion effects.

V. CONCLUSION
In this work, we proposed a DEFD-net network for investigating in-vivo cardiac DTI properties from DW images acquired at the end of diastole phase under free-breathing with multiple trigger delays. By training a dense-encoder and decoder in an unsupervised manner, the useful features were extracted from the DW images. Such features were fused according to the maximum intensity rule and then from the fused features, the restored DW images were obtained by the decoder network. The experimental results demonstrated that the proposed method effectively improves the quality of the in-vivo DW images and the subsequent measurement accuracy of fiber structures and diffusion metrics. Compared to existing methods such as PCAMIP, WIF and U2Fusion, DW image quality was improved respectively by 21%, 74% and 7% in terms of Brenner, and respectively by 10%, 30% and 24% in terms of Tenengrad. Moreover, the DTI metrics obtained with our method are more consistent with human myocardium structure with respect to PCAMIP, WIF and U2Fusion methods.This suggests a new direction of using deep learning methods to investigate the myocardium structure of free-breathing human heart.