MulViMotion: Shape-aware 3D Myocardial Motion Tracking from Multi-View Cardiac MRI

Recovering the 3D motion of the heart from cine cardiac magnetic resonance (CMR) imaging enables the assessment of regional myocardial function and is important for understanding and analyzing cardiovascular disease. However, 3D cardiac motion estimation is challenging because the acquired cine CMR images are usually 2D slices which limit the accurate estimation of through-plane motion. To address this problem, we propose a novel multi-view motion estimation network (MulViMotion), which integrates 2D cine CMR images acquired in short-axis and long-axis planes to learn a consistent 3D motion field of the heart. In the proposed method, a hybrid 2D/3D network is built to generate dense 3D motion fields by learning fused representations from multi-view images. To ensure that the motion estimation is consistent in 3D, a shape regularization module is introduced during training, where shape information from multi-view images is exploited to provide weak supervision to 3D motion estimation. We extensively evaluate the proposed method on 2D cine CMR images from 580 subjects of the UK Biobank study for 3D motion tracking of the left ventricular myocardium. Experimental results show that the proposed method quantitatively and qualitatively outperforms competing methods.

Cine cardiac magnetic resonance (CMR) imaging supports motion analysis by acquiring sequences of 2D images in different views. Each image sequence covers the complete cardiac cycle containing end-diastolic (ED) and end-systolic (ES) phases [24]. Two types of anatomical views are identified, including (1) short-axis (SAX) view and (2) long-axis (LAX) view such as 2-chamber (2CH) view and 4-chamber (4CH) view (Fig. 1). The SAX sequences typically contain a stack of 2D slices sampling from base to apex in each frame (e.g., 9-12 slices). The LAX sequences contain a single 2D slice that is approximately orthogonal to the SAX plane in each frame. These acquired images have high temporal resolution, high signal-to-noise ratio as well as high contrast between the blood pool and myocardium. With these properties, cine CMR imaging has been utilized in recent works for 2D myocardial motion estimation, e.g., [5,39,40,56,58].
2D myocardial motion estimation only considers motion in either the SAX plane or LAX plane and does not provide complete 3D motion information for the heart. This may lead to inaccurate assessment of cardiac function. Therefore, 3D motion estimation that recovers myocardial deformation in the X, Y and Z directions is important. However, estimating 3D motion fields from cine CMR images remains challenging because (1) SAX stacks have much lower through-plane resolution (typically 8 mm slice thickness) than in-plane resolution (typically 1.8 x 1.8 mm), (2) image quality can be negatively affected by slice misalignment in SAX stacks as only one or two slices are acquired during a single breath-hold, and (3) high-resolution 2CH and 4CH view images are too spatially sparse to estimate 3D motion fields on their own.
In this work, we take full advantage of both SAX and LAX (2CH and 4CH) view images, and propose a multiview motion estimation network for 3D myocardial motion tracking from cine CMR images. In the proposed method, a hybrid 2D/3D network is developed for 3D motion estimation. This hybrid network learns combined representations from multi-view images to estimate a 3D motion field from the ED frame to any t-th frame in the cardiac cycle. To guarantee an accurate motion estimation, especially along the longitudinal direction (i.e., the Z direction), a shape regularization module is introduced to leverage anatomical shape information for motion estimation during training. This module encourages the estimated 3D motion field to correctly transform the 3D shape of the myocardial wall from the ED frame to the t-th frame. Here anatomical shape is represented by edge maps that show the contour of the cardiac anatomy. During inference, the hybrid network generates a sequence of 3D motion fields between paired frames (ED and t-th frames), which represents the myocardial motion across the cardiac cycle. The main contributions of this paper are summarized as follows: • We develop a solution to a challenging cardiac motion tracking problem: learning 3D motion fields from a set of 2D SAX and LAX cine CMR images. We propose an endto-end trainable multi-view motion estimation network (MulViMotion) for 3D myocardial motion tracking. • The proposed method enables accurate 3D motion tracking by combining multi-view images using both latent information and shape information: (1) the representations of multi-view images are combined in the latent space for the generation of 3D motion fields; (2) the complementary shape information from multi-view images is exploited in a shape regularization module to provide explicit constraint on the estimated 3D motion fields. • The proposed method is trained in a weakly supervised manner which only requires sparsely annotated data in different 2D SAX and LAX views and requires no ground truth 3D motion fields. The 2D edge maps from the corresponding SAX and LAX planes provide weak supervision to the estimated 3D edge maps for guiding 3D motion estimation in the shape regularization module. • We perform extensive evaluations for the proposed method on 580 subjects from the UK Biobank study. We further present qualitative analysis on the CMR images with severe slice misalignment and we explore the applicability of our method for wall thickening measurement.
II. RELATED WORK 1) Conventional motion estimation methods: A common method for quantifying cardiac motion is to track noninvasive markers. CMR myocardial tagging provides tissue markers (stripe-like darker tags) in myocardium which can deform with myocardial motion [57]. By tracking the deformation of markers, dense displacement fields can be retrieved in the imaging plane. Harmonic phase (HARP) technique is the most representative approach for motion tracking in tagged images [14,30,33]. Several other methods have been proposed to compute dense displacement fields from dynamic myocardial contours or surfaces using geometrical and biomechanical modeling [34,53]. For example, Papademetris et al. [34] proposed a Bayesian estimation framework for myocardial motion tracking from 3D echocardiography. In addition, image registration has been applied to cardiac motion estimation in previous works. Craene et al. [19] introduced continuous spatio-temporal B-spline kernels for computing a 4D velocity field, which enforced temporal consistency in motion recovery. Rueckert et al. [43] proposed a free form deformation (FFD) method for general non-rigid image registration. This method has been used for cardiac motion estimation in many recent works, e.g., [5,7,12,37,38,44,45,50]. Thirion [49] built a demons algorithm which utilizes diffusing models for image matching and further cardiac motion tracking. Based on this work, Vercauteren et al. [52] adapted demons algorithm to provide non-parametric diffeomorphic transformation and McLeod et al. [32] introduced an elastic-like regularizer to improve the incompressibility of deformation recovery.
2) Deep learning-based motion estimation methods: In recent years, deep convolutional neural networks (CNNs) have been successfully applied to medical image analysis, which has inspired the exploration of deep learning-based cardiac motion estimation approaches. Qin et al. [39] proposed a multi-task framework for joint estimation of segmentation and motion. This multi-task framework contains a shared feature encoder which enables a weakly-supervised segmentation. Zheng et al. [58] proposed a method for cardiac pathology classification based on cardiac motion. Their method utilizes a modified U-Net [42] to generate flow maps between ED frame and any other frame. For cardiac motion tracking in multiple datasets, Yu et al. [56] considered the distribution mismatch problem and proposed a meta-learning-based online model adaption framework. Different from these methods which estimate motion in cine CMR, Ye et al. [55] proposed a deep learning model for tagged image motion tracking. In their work, the motion field between any two consecutive frames is first computed, followed by estimating the Lagrangian motion field between ED frame and any other frame. Most of these existing deep learning-based methods aim at 2D motion tracking by only using SAX stacks. In contrast, our method focuses on 3D motion tracking by fully combining multiple anatomical views (i.e., SAX, 2CH and 4CH), which is able to estimate both in-plane and through-plane myocardial motion.
3) Multi-view based cardiac analysis: Different anatomical scan views usually contain complementary information and the combined multiple views can be more descriptive than a single   Fig. 2: An overview of MulViMotion. We use a hybrid 2D/3D network to estimate a 3D motion field Φ t from the input multi-view images. In the hybrid network, FeatureNet learns multi-view motion feature F M and multi-view shape feature F S from the input, followed by MotionNet which generates Φ t based on F M . A shape regularization module leverages anatomical shape information for 3D motion estimation. It encourages the predicted 3D edge maps of the myocardial wallÊ 0 /Ê t (predicted from F S using ShapeNet) and the warped 3D edge mapÊ 0→t (warped from ED frame to the t-th frame by Φ t ) to be consistent with the ground truth 2D edge maps defined on multi-view images. Shape regularization is only used during training.
view. Chen et al. [13] utilized both SAX and LAX views for 2D cardiac segmentation, where the features of multi-view images are combined in the bottleneck of 2D U-Net. Puyol-Antón et al. [37] introduced a framework that separately uses multiview images for myocardial strain analysis. In their method, the SAX view is used for radial and circumferential strain estimation while the LAX view is used for longitudinal strain estimation. Abdelkhalek et al. [1] proposed a 3D myocardial strain estimation framework, where the point clouds from SAX and LAX views are aligned for surface reconstruction. Attar et al. [3] proposed a framework for 3D cardiac shape prediction, in which the features of multi-view images are concatenated in CNNs to predict the 3D shape parameters.
In this work, we focus on using multi-view images for 3D motion estimation. Compared to most of these existing works which only combine the features of multi-view images in the latent space (e.g., [3,13]), our method additionally combines complementary shape information from multiple views to predict anatomically plausible 3D edge map of myocardial wall on different time frames, which provides guidance for 3D motion estimation.

III. METHOD
Our goal is to estimate 3D motion fields of the LV myocardium from multi-view 2D cine CMR images. We formulate our task as follows: Let I SA = {I sa t ∈ R H×W ×D |0 t T − 1} be a SAX sequence which contains stacks of 2D images (D slices) and quences which contain 2D images in the 2CH and 4CH views. H and W are the height and width of each image and T is the number of frames. We want to train a network to estimate a 3D motion field Φ t ∈ R H×W ×D×3 by using the multi-view images of the ED frame ({I sa 0 , I 2ch 0 , I 4ch 0 }) and of any t-th frame ({I sa t , I 2ch t , I 4ch t }). Φ t describes the motion of the LV myocardium from ED frame to the t-th frame. For each voxel in Φ t , we estimate its displacement in the X, Y , Z directions.
To solve this task, we propose MulViMotion that estimates 3D motion fields from multi-view images with shape regularization. The schematic architecture of our method is shown in Fig. 2. A hybrid 2D/3D network that contains FeatureNet (2D CNNs) and MotionNet (3D CNNs) is used to predict Φ t from the input multi-view images. FeatureNet learns multiview multi-scale features and is used to extract multi-view motion feature F M and multi-view shape feature F S from the input. MotionNet generates Φ t based on F M . A shape regularization module is used to leverage anatomical shape information for 3D motion estimation during training. In this module, 3D edge maps of the myocardial wall are predicted from F S using ShapeNet and warped from ED frame to the tth frame by Φ t . The sparse ground truth 2D edge maps derived from the multi-view images provide weak supervision to the predicted and warped 3D edge maps, and thus encourage an accurate estimation of Φ t , especially in the Z direction. Here, a slicing step is used to extract corresponding multi-view planes from the 3D edge maps in order to compare 3D edge maps with 2D ground truth. During inference, a 3D motion field is directly generated from the input multi-view images by the  hybrid network, without using shape regularization.
A. 3D motion estimation 1) Multi-view multi-scale feature extraction (FeatureNet): The first step of 3D motion estimation is to extract internal representations from the input 2D multi-view images {I sa j , I 2ch j , I 4ch j |j = {0, t}}. We build FeatureNet to simultaneously learn motion and shape feature from the input because the motion and shape of the myocardial wall are closely related and can provide complementary information to each other [15,39,48]. FeatureNet consists of (1) multi-scale feature fusion and (2) multi-view concatenation (see Fig. 3).
In the multi-scale feature fusion ( Fig. 3 (a)), the input multiview images are unified to D-channel 2D feature maps by applying 2D convolution on 2CH and 4CH view images. Then three 2D encoders {f ψi |i = {sa, 2ch, 4ch}} are built to extract motion and shape features from each anatomical view, Here, i represents anatomical views and ψ i refers to the network parameters of f ψi . F i M and F i S are the learned motion feature and shape feature, respectively. As these encoders aim to extract the same type of information (i.e., shape and motion information), the three encoders share weights to learn representations that are useful and related to different views.
In each encoder, representations at different scales are fully exploited for feature extraction. {f ψi |i = {sa, 2ch, 4ch}} consists of (1) a Siamese network that extracts features from both ED frame and t-th frame, and (2) feature-fusion layers that concatenate multi-scale features from pairs of frames ( Fig. 3 (b)). From the Siamese network, the last feature maps of the two streams are used as shape feature of the ED frame (F i S 0 ) and the t-th frame (F i S t ), respectively, and All features across different scales from both streams are combined by feature-fusion layers to generate motion feature F i M . In detail, these multi-scale features are upsampled to the original resolution by a convolution and upsampling operation and then combined using a concatenation layer.
, a multiview concatenation generates the multi-view motion feature F M and the multi-view shape feature F S via channel-wise concatenation C(·, ·, ·) (see Fig. 3 (c)), The FeatureNet model is composed of 2D CNNs which learns 2D features from the multi-view images and interslice correlation from SAX stacks. The obtained F M is first unified to D-channels using 2D convolution and then is used to predict Φ t in the next step. The obtained F S is used for shape regularization in Sec. III-B.
2) Motion estimation (MotionNet): In this step, we introduce MotionNet to predict the 3D motion field Φ t by learning 3D representations from the multi-view motion feature F M . MotionNet is built with a 3D encoder-decoder architecture. Φ t is predicted by MotionNet with where g θ represents MotionNet and θ refers to the network parameters of g θ . The function U (·) denotes an un-squeeze operation which changes F M from a stack of 2D feature maps to a 3D feature map by adding an extra dimension.
3) Spatial transform (Warping): Inspired by the successful application of spatial transformer networks [10,27], the SAX stack of the ED frame (I sa 0 ) can be transformed to the tth frame using the motion field Φ t . For voxel with location p in the transformed SAX stack (I sa 0→t ), we compute the corresponding location p in I sa 0 by p = p + Φ t (p). As image values are only defined at discrete locations, the value at p in I sa 0→t is computed from p in I sa 0 using trillinear interpolation 2 . 4) Motion loss: As true dense motion fields of paired frames are usually unavailable in real practice, we propose an unsupervised motion loss L mov to evaluate the 3D motion estimation model using only the input SAX stack (I sa t ) and the generated 3D motion field (Φ t ). L mov consists of two components: (1) an image similarity loss L sim that penalizes appearance difference between I sa t and I sa 0→t , and (2) a local smoothness loss L smooth that penalizes the gradients of Φ t , Here λ is a hyper-parameter, L sim is defined by voxel-wise mean squared error and L smooth is the Huber loss used in [10,39] which encourages a smooth Φ t , Here ∂Φt(pi) and we use the same approximation to ∂Φt(pi) ∂y and ∂Φt(pi) ∂z . Same to [10,39], is set to 0.01. In Eq. 5 and Eq. 6, p i is the ith voxel and N denotes the number of voxels.
Note that L sim is only applied to SAX stacks because 2D images in 2CH and 4CH views typically consist of only one slice and can not be directly warped by a 3D motion field.

B. Shape regularization
The motion loss (L mov ) on its own is not sufficient to guarantee motion estimation in the Z direction due to the low through-plane resolution in SAX stacks. To address this problem, we introduce a shape regularization module which ensures the 3D edge map of the myocardial wall is correct before and after Φ t warping, and thus enables an accurate estimation of Φ t . Here, the ground truth 2D edge maps derived from the multi-view images provide weak supervision to the predicted and warped 3D edge maps.
1) Shape estimation (ShapeNet): ShapeNet is built to generate the 3D edge map of the myocardial wall in the ED frame (Ê 0 ) and the t-th frame (Ê t ) from F S = {F S 0 , F S t }, Here h 1 and h 2 are the two branches in ShapeNet which contain shared 2D decoders and 3D convolutional layers in order to learn 3D edge maps from 2D features for all frames (Fig. 4). The dimension ofÊ 0 andÊ t are H ×W ×D. With the spatial transform in Sec. III-A3,Ê 0 is warped to the t-th frame by Φ t , which generates the transformed 3D edge mapÊ 0→t . ThenÊ 0 ,Ê t andÊ 0→t are weakly supervised by ground truth 2D edge maps.
2) Slicing: To compare the 3D edge maps with 2D ground truth, we use 3D masks {M sa , M 2ch , M 4ch } to extract SAX, 2CH and 4CH view planes fromÊ 0 ,Ê t andÊ 0→t witĥ where i = {sa, 2ch, 4ch} represents anatomical views and refers to element-wise multiplication. These 3D masks describe the locations of multi-view images in SAX stacks and are generated based on the input during image preprocessing.  3) Shape loss: For each component in L shape , we utilize cross-entropy loss (CE(·, ·)) to measure the similarity of edge maps, e.g., Same to Eq. 10,

C. Optimization
Our model is an end-to-end trainable framework and the overall objective is a linear combination of all loss functions where λ and β are hyper-parameters chosen experimentally depending on the dataset. We use the Adam optimizer (learning rate = 10 −4 ) to update the parameters of MulViMotion. Our model is implemented by Pytorch and is trained on a NVIDIA Tesla T4 GPU with 16 GB of memory.

IV. EXPERIMENTS
We demonstrate our method on the task of 3D myocardial motion tracking. We evaluate the proposed method using quantitative metrics such as Dice, Hausdorff distance, volume difference and Jacobian determinant. Geometric mesh is used to provide qualitative results with 3D visualization. We compared the proposed method with other state-of-the-art motion estimation methods and performed extensive ablation study. In addition, we show the effectiveness of the proposed method on the subjects with severe slice misalignment. We further explore the applicability of the proposed method for wall thickening measurement. We show the key results in the main paper. More results (e.g., dynamic videos) are shown in the Appendix.
A. Experiment setups 1) Data: We performed experiments on randomly selected 580 subjects from the UK Biobank study 3 . All participants gave written informed consent [9]. The participant characteristics are shown in Table I (3) to cover the whole LV as the ROI, based on the center of the LV in the middle slice, the resampled SAX stacks were cropped to a size of 128 × 128 × 64 (note that we computed the center of the LV based on the LV myocardium segmentation of the middle slice of the SAX stack), (4) 2CH and 4CH view images were cropped to 128 × 128 based on the center of the intersecting line between the middle slice of the cropped SAX stack and the 2CH/4CH view image, (5) each frame was independently normalized to zero mean and unitary standard deviation, and (6) 3D masks (Eq. 8) were computed by a coordinate transformation using DICOM image header information of SAX, 2CH and 4CH view images. Note that 2D SAX slices used in the shape regularization module were unified to 9 adjacent slices for all subjects, including the middle slice and 4 upper and lower slices. With this image preprocessing, the input SAX, 2CH and 4CH view images cover the whole LV in the center. 3D high-resolution segmentations of these subjects were automatically generated using the 4Dsegment tool [22] based on the resampled SAX stacks, followed by manual quality control. The obtained segmentations have been shown to be useful in clinical applications (e.g., [7]), and thus we use them to generate ground truth 2D edge maps (Fig. 1) in this work. In detail, we utilize the obtained 3D masks to extract SAX, 2CH and 4CH view planes from these 3D segmentations and then use contour extraction to obtain {E i 0 , E i t |i = {sa, 2ch, 4ch}} used in Sec. III-B2. Note that we use 3D segmentation(s) to refer to the 3D segmentations obtained by [22] in this section.
2) Evaluation metrics: We use segmentations to provide quantitative evaluation to the estimated 3D motion fields. This is the same evaluation performed in other cardiac motion tracking literature [39,56,58]. Here, 3D segmentations obtained by [22] are used in the evaluation metrics. The framework in [22] performs learning-based segmentation, followed by an atlas-based refinement step to ensure robustness towards potential imaging artifacts. The generated segmentations are anatomically meaningful and spatially consistent. As our work aims to estimate real 3D motion of the heart from the acquired CMR images, such segmentations that approximate the real shape of the heart can provide a reasonable evaluation. In specific, on test data, we estimate the 3D motion field Φ ES from ED frame to ES frame, which shows large deformation. Then we warp the 3D segmentation of the ED frame (S ED ) to ES frame by Φ ES . Finally, we compared the transformed 3D segmentation (S ED→ES ) with the ground truth 3D segmentation of the ES frame (S ES ) using following metrics. Note that the ES frame is identified as the frame with the least image intensity similarity to the ED frame.
Dice score and Hausdorff distance (HD) are utilized to respectively quantify the volume overlap and contour distance between S ES and S ED→ES . A high value of Dice and a low value of HD represent an accurate 3D motion estimation.
Volume difference (VD) is computed to evaluate the volume preservation, as incompressible motion is desired within the myocardium [30,32,40,45] computes the number of voxels in the segmentation volume. A low value of VD means a good volume preservation ability of Φ ES .
The Jacobian determinant det(J Φ ES ) (J Φ ES = Φ ES ) is employed to evaluate the local behavior of Φ ES : A negative Jacobian determinant det(J Φ ES (p)) < 0 indicates that the motion field at position p results in folding and leads to nondiffeomorphic transformations. Therefore, a low number of points with det(J Φ ES (p)) < 0 corresponds to an anatomically plausible deformation from ED frame to ES frame and thus indicates a good Φ ES . We count the percentage of voxels in the myocardial wall with det(J Φ ES (p)) < 0 in the evaluation.
3) Baseline methods: We compared the proposed method with three cardiac motion tracking methods, including two conventional methods and one deep learning method. The first baseline is a B-spline free form deformation (FFD) algorithm [43] which has been used in many recent cardiac motion tracking works [5,7,37,38,50]. We use the FFD approach implemented in the MIRTK toolkit 4 . The second baseline is a diffeomorphic Demons (dDemons) algorithm [52] which has been used in [40] for cardiac motion tracking. We use a Sim-pleITK software package as the dDemons implementation 5 . In addition, the UNet architecture has been used in many recent works for image registration [6,48,54], and thus our third baseline is a deep learning method with 3D-UNet [16]. The input of 3D-UNet baseline is paired frames (I sa 0 , I sa t ) and output is a 3D motion field. Eq. 4 is used as the loss function for this baseline. We implemented 3D-UNet based on its online code 6 . For the baseline methods with hyperparameters, we evaluated several sets of parameter values. The hyper-parameters that achieve the best Dice score on the validation set are selected.
B. 3D myocardial motion tracking 1) Motion tracking performance: For each test subject, MulViMotion is utilized to estimate 3D motion fields in the full cardiac cycle. With the obtained {Φ t |t = [0, 49]}, we warp the 3D segmentation of ED frame (t = 0) to the tth frame. Fig. 5 (a) shows that the estimated Φ t enables the warped 3D segmentation to match the myocardial area in images from different anatomical views. In addition, we warp the SAX stack of the ED frame (I sa 0 ) to the t-th frame. Fig. 5 (b) shows the effectiveness of Φ t by comparing the warped and the ground truth SAX view images. By utilizing the warped 3D segmentation, we further compute established clinical biomarkers. Fig. 6  volume over time. The shape of the curve are consistent with reported results in the literature [18,39].
We quantitatively compared MulViMotion with baseline methods in Table II. With the 3D motion fields generated by different methods, the 3D segmentations of ED frame are warped to ES frame and compared with the ground truth 3D segmentations of ES frame by using metrics introduced in Sec. IV-A2. From this table, we observe that MulViMotion outperforms all baseline methods for Dice and Hausdorff distance, demonstrating the effectiveness of the proposed method on estimating 3D motion fields. MulViMotion achieves the lowest volume difference, indicating that the proposed method is more capable of preserving the volume of the myocardial wall during cardiac motion tracking. Compared to a diffeomorphic motion tracking method (dDemons [52]), the proposed method has a similar number of voxels with a negative Jacobian determinant. This illustrates that the learned motion field is smooth and preserves topology.
We further qualitatively compared MulViMotion with baseline methods in Fig. 7. A geometric mesh is used to provide 3D visualization of the myocardial wall. Specifically, 3D segmentations of ED frame are warped to any t-th frame in the cardiac cycle and geometric meshes are reconstructed from these warped 3D segmentations. Red meshes in Fig. 7 demonstrate that in contrast to all baseline methods which only show motion within SAX plane (i.e., along the X and Y directions), MulViMotion is able to estimation through-plane motion along the longitudinal direction (i.e., the Z direction) in the cardiac cycle, e.g., the reconstructed meshes of t = 20 frame is deformed in the X, Y , Z directions compared to t = 0 and t = 40 frames. In addition, white meshes in Fig. 7 illustrate that compared to all baseline methods, the 3D motion field generated by MulViMotion performs best in warping ED frame to ES frame and obtains the reconstructed mesh of ES frame which is most similar to the ground truth (GT) ES frame mesh (blue meshes). These results demonstrate the effectiveness of MulViMotion for 3D motion tracking, especially for estimating through-plane motion.
2) Runtime: Table II shows runtime results of MulViMotion and baseline methods using Intel Xeon E5-2643 CPU dDemons [52] 3D-UNet [16] MulViMotion ED frame (GT) t=0 t=10 t=20 t=30 t=40 ES frame (warped) ES frame (GT) Fig. 7: 3D visualization of motion tracking results using the baseline methods and MulViMotion. Column 1 (blue) shows the ground truth (GT) meshes of ED frame. Columns 2-6 (red) show 3D motion tracking results across the cardiac cycle. These meshes are reconstructed from the warped 3D segmentations (warped from ED frame to different time frames). Column 7 (white) additionally shows the reconstructed meshes of ES frame from the motion tracking results and Column 8 (blue) shows the ground truth meshes of ES frame. and NVIDIA Tesla T4 GPU. The average inference time for a single subject is reported. FFD [43] and dDemons [52] are only available on CPUs while the 3D-UNet [16] and MulViMotion are available on both CPU and GPU. The results show that our method achieves similar runtime to 3D-UNet [16] on GPU and at least 5 times faster than baseline methods on CPU.
3) Ablation study: For the proposed method, we explore the effects of using different anatomical views and the importance of the shape regularization module. We use evaluation metrics in Sec. IV-A2 to show quantitative results. Table III shows the motion tracking results using different anatomical views. In particular, M1 only uses images and 2D edge maps from SAX view to train the proposed method, M2 uses those from both SAX and 2CH views and M3 uses those from both SAX and 4CH views. M2 and M3 outperforms M1, illustrating the importance of LAX view images. In addition, MulViMotion (M) outperforms other variant models. This might be because more LAX views can introduce more high- resolution 3D anatomical information for 3D motion tracking. In Table IV, the proposed method is trained using all three anatomical views but optimized by different combination of losses. A1 optimizes the proposed method without shape regularization (i.e., without L shape in Eq. 11). A2 introduces basic shape regularization on top of A1, which adds L S 0 and L S 0→t for L shape . MulViMotion (M) outperforms A1, illustrating the importance of shape regularization. MulViMotion also outperforms A2. This is likely because L S 0 and L S t are both needed to guarantee the generation of distinct and correct 3D edge maps for all frames in the cardiac cycle. These results show the effectiveness of all proposed components in L shape . Fig. 8 shows motion estimation performance using different strength of shape regularization. In detail, the proposed method is trained by three anatomical views and all loss components, but the shape loss (L shape ) is computed by different percentage of training subjects (20%, 40%, 60%, 80%, 100%). From Fig. 8, we observe that motion estimation performance is improved with an increased percentage of subjects.
4) The influence of hyper-parameters: Fig. 9 presents Dice and Hausdorff distance (HD) on the test data for various smoothness loss weight λ and shape regularization weight β (Eq. 11). The Dice scores and HDs are computed according to Sec. IV-A2. We observe that a strong constraint on motion field smoothness may scarify registration accuracy (see Fig. 9 (a)). Moreover, registration performance improves as β increases from 1 to 5 and then deteriorates with a further increased β (from 5 to 9). This might be because a strong shape regularization can enforce motion estimation to focus mainly on the few 2D planes which contain sparse labels.
5) The performance on subjects with slice misalignment: Acquired SAX stacks may contain slice misalignment due to poor compliance with breath holding instructions or the change of position during breath-holding acquisitions [8]. This leads to an incorrect representation of cardiac volume and result in difficulties for accurate 3D motion tracking. Fig. 10 compares the motion tracking results of 3D-UNet [16], MulViMotion and MulViMotion without L shape for the test subject with the  severe slice misalignment (e.g., Fig. 10 (a) middle column). Fig. 10 (b) shows that in contrast to 3D-UNet, the motion fields generated by MulViMotion enables topology preservation of the myocardial wall (e.g., mesh of t = 17). MulViMotion outperforms MulViMotion without L shape , which indicates the importance of the shape regularization module for reducing negative effect of slice misalignment. These results demonstrate the advantage of integrating shape information from multiple views and shows the effectiveness of the proposed method on special cases. 6) Wall thickening measurement: We have computed regional and global myocardial wall thickness at ED frame and ES frame based on ED frame segmentation and warped ES frame segmentation 7 , respectively. The global wall thickness at ED frame is 6.6 ± 0.9mm, which is consistent with results obtained by [5] (5.5 ± 0.8mm). The wall thickness at the ES frame for American Heart Association 16-segments are shown in Table V. In addition, we have computed the fractional wall thickening between ED frame and ES frame by (ES −ED)/ED * 100%. The results in Table V shows that the regional and global fractional wall thickening are comparable with results reported in literature [21,51].

V. DISCUSSION
In this paper, we propose a deep learning-based method for estimating 3D myocardial motion from 2D multi-view cine CMR images. A naïve alternative to our method would be to train a fully unsupervised motion estimation network using high-resolution 3D cine CMR images. However, such 3D images are rarely available because (1) 3D cine imaging requires long breath holds during acquisition and are not commonly used in clinical practice, and (2) recovering highresolution 3D volumes purely from 2D multi-view images is challenging due to the sparsity of multi-view planes. Our focus has been on LV myocardial motion tracking because it is important for clinical assessment of cardiac function. Our model can be easily adapted to 3D right ventricular myocardial motion tracking by using the corresponding 2D edge maps in the shape regularization module during training.
In shape regularization, we use edge maps to represent anatomical shape, i.e., we predict 3D edge maps of the myocardial wall and we use 2D edge maps defined in the multiview images to provide shape information. This is because (1) the contour of the myocardial wall is more representative of anatomical shape than the content, (2) compared to 3D dense segmentation, 3D edge maps with sparse labels are more likely to be estimated by images from sparse multiview planes, and (3) using edge maps offers the potential of using automatic contour detection algorithms to obtain shape information directly from images. An automated algorithm is utilized to obtain 2D edge maps for providing shape information in the shape regularization module. This is because manual data labeling is time-consuming, costly and usually unavailable. The proposed method can be robust to these automatically obtained 2D edge maps since the 2D edge maps only provides constraint to spatially sparse planes for the estimated 3D edge maps.
We use the aligned 2D edge maps of SAX stacks to train MulViMotion. This is reasonable because aligned SAX ground truth edge maps can introduce correct shape information of the heart, and thus can explicitly constrain the estimated 3D motion field to reflect the real motion of the heart. Nevertheless, we further test the effectiveness of the proposed method by utilizing unaligned SAX edge maps during training. In specific, MulViMotion* uses the algorithm in [4] to predict the 2D segmentation of myocardium for each SAX slice independently without accounting for the inter-slice misalignment. The contour of this 2D segmentation is used as the SAX ground truth edge map during training. LAX ground truth edge maps are still generated based on [22]. Table VI and Fig. 11 (e.g., t = 20) show that the proposed method is capable of estimating 3D motion even if it is trained with unaligned SAX edge maps. This indicates that the LAX 2CH and 4CH view images that provides correct longitudinal anatomical shape information can compensate the slice misalignment in the SAX stacks and thus makes a major contribution to the improved estimation accuracy of through-plane motion.
In the proposed method, a hybrid 2D/3D network is built to estimate 3D motion fields, where the 2D CNNs combine multiview features and the 3D CNNs learn 3D representations from the combined features. Such a hybrid network can occupy less GPU memory compared to a pure 3D network. In particular, the number of parameters in this hybrid network is 21.7  . Moreover, this hybrid network is able to take full advantage of 2D multiview images because it enables learning 2D features from each anatomical view before learning 3D representations.
In the experiment, we use 580 subjects for model training and evaluation. This is mainly because our work tackles 3D data and the number of training subjects is limited by the cost of model training. In specific, we used 500 subjects to train our model for 300 epochs with a NVIDIA Tesla T4 GPU, which requires ∼ 60 hours of training for each model. In addition, this work focus on developing the methodology for multi-view motion tracking and this sample size align with other previous cardiac analysis work for method development [13,39,55,56]. A population-based clinical study for the whole UK Biobank (currently ∼ 50, 000 subjects) still requires future investigation.
With the view planning step in standard cardiac MRI acquisition, the acquired multi-view images are aligned and thus are able to describe a heart from different views [31]. In order to preserve such spatial connection between multiple separate anatomical views, data augmentations (e.g., rotation and scaling) that used in some 2D motion estimation works are excluded in this multi-view 3D motion tracking task.
We use two LAX views (2CH and 4CH) in this work for 3D motion estimation but the number of anatomical views is not a limitation of the proposed method. More LAX views (e.g., 3-chamber view) can be integrated into MulViMotion by adding extra encoders in FeatureNet and extra views in L shape for shape regularization. However, each additional anatomical view can lead to an increased occupation of GPU memory and extra requirement of image annotation (i.e., 2D edge maps).
The data used in the experiment is acquired by a 1.5 Tesla (1.5T) scanner but the proposed method can be applied on 3T CMR images. The possible dark band artifacts in 3T CMR images may affect the image similarity loss (L sim ). However, the high image quality of 3T CMR and utilizing high weights for the regularization terms (e.g., shape regularization and the local smoothness loss) may potentially reduce the negative effect of these artifacts. We utilize the ED frame and the t-th frame (t = 0, 1, ..., T , T is the number of frames) as paired frames to estimate the 3D motion field. This is mainly because the motion estimated from such frame pairing is needed for downstream tasks such as strain estimation [23,37,46]. In the cardiac motion tracking task, the reference frame is commonly chosen as the ED or ES frame [56]. Such frame pairing can often be observed in other cardiac motion tracking literature, e.g., [39,56,58].
In this work, apart from two typical and widely used conventional algorithms, we also compared the proposed method with a learning-based method [42] which can represent most of the recent image registration works. In specific, the architecture of [42] has been used in many recent works, e.g., [6,48,54], and many other recent works, e.g., [6,20,29], are similar to [42] where only single view images are utilized for image registration. Nevertheless, we further thoroughly compared the proposed method with another recent and widely used learning-based image registration method [6] (VoxelMorph 8 ). We train VoxelMorph following the optimal architecture and hyper-parameters suggested by the authors (VM) and we also train VoxelMorph with a bigger architecture 9 (VM † ). For fair comparison, 2D ground truth edge maps (E sa 0 , E sa t in Eq. 8) are used to generate the segmentation of SAX stacks for adding auxiliary information. Table VI shows that the proposed method outperforms VoxelMorph for 3D cardiac motion tracking. This is expected because SAX segmentation used in VoxelMorph has low through-plane resolution and thus can hardly help improve 3D motion estimation. Moreover, VoxelMorph only uses single view images while the proposed method utilizes information from multiple views.

VI. CONCLUSION
In this paper, we propose multi-view motion estimation network (MulViMotion) for 3D myocardial motion tracking.
The proposed method takes full advantage of routinely acquired multi-view 2D cine CMR images to accurately estimate 3D motion fields. Experiments on the UK Biobank dataset demonstrate the effectiveness and practical applicability of our method compared with other competing methods.

APPENDIX
A. Examples of 3D masks Fig. 12 shows the examples of 3D masks used in the shape regularization module of MulViMotion. These 3D masks identify the locations of multi-view images in the SAX stack. We generate these 3D masks in image preprocessing step by a coordinate transformation using DICOM image header information.

B. The dynamic videos of motion tracking results
The dynamic videos of motion tracking results of different motion estimation methods have been attached as "Dynamic videos.zip" in the supplementary material. This file contains four MPEG-4 movies where "FFD.mp4", "dDemons.mp4", "3D-UNet.mp4" are the results of the corresponding baseline methods and "MulViMotion.mp4" is the result of the proposed method. All methods are applied on the same test subject. The Codecs of these videos is H.264. We have opened the uploaded videos in computers with (1) Win10 operating system, Movies&TV player, (2) Linux Ubuntu 20.04 operating system, Videos player, and (3) Mac OS, QuickTime Player. However, if there is any difficulty to open the attached videos, the same dynamic videos can be found in https://github.com/qmeng99/dynamic videos/blob/main/README.md C. Additional 3D motion tracking results Fig. 13 shows the additional 3D motion tracking results on a test subject with slice misalignment. This test subject is the same subject used in Fig. 10 in the main paper. These more results further demonstrate that the proposed method is able to reduce the negative effect of slice misalignment on 3D motion tracking. In addition, we have computed more established clinical biomarkers. Fig. 14 shows the temporal ejection fraction across the cardiac cycle. along three orthogonal directions, namely radial, circumferential and longitudinal. Here, we evaluate the performance of the proposed method by estimating the three strains based on the estimated 3D motion field Φ t . The myocardial mesh at the ED frame is warped to the t-th frame using a numeric method and vertex-wise strain is calculated using the Lagrangian strain tensor formula [36] (implemented by https://github.com/Marjola89/3Dstrain analysis). Subsequently, global strain is computed by averaging across all the vertices of the myocardial wall.   [11,23,28], i.e., radial peak strain is ∼ 20% to ∼ 70%, circumferential peak strain is ∼ −15% to ∼ −22% and longitudinal peak strain is ∼ −8% to ∼ −20%.
To get more reference strains, we have separately computed global longitudinal and circumferential strains on the 2D LAX and SAX slices according to the algorithm in [5]. On the test set, global longitudinal peak strain is −18.55% ± 2.74% (ours is −9.72% ± 2.49%) while global circumferential peak strain is −22.76%±3.31% (ours is −27.38%±9.63%). It is possible that our strains are different from these strains. This is because these strains in [5] are computed only on sparse 2D slices by 2D motion field estimation, and in contrast, we compute global strains by considering the whole myocardium wall with 3D motion fields.
For strain estimation, our results are in general consistent with the value ranges reported in [11,23,28]. However, it has to be noted that we calculate the strain based on 3D motion fields, whereas most existing strain analysis methods or software packages are based on 2D motion fields, i.e. only accounting for in-plane motion within SAX or LAX views. This may result in difference between our estimated strain values and the reported strain values in literature. In addition, there is still a lack of agreement of strain value ranges (in particular for radial strains) even among mainstream commercial software packages [11]. This is because strain value ranges can vary depending on the vendors, imaging modalities, image quality and motion estimation techniques [2,11]. It still requires further investigations to set up a reference standard for strain evaluation and to carry out clinical association studies using the reported strain values. Moreover, when manual segmentation is available, it could be used to provide more perfect and accurate shape constraint, which may further improve 3D motion estimation and thus strain estimation.