Spatio-Temporal Deep Learning-Based Undersampling Artefact Reduction for 2D Radial Cine MRI With Limited Training Data

In this work we reduce undersampling artefacts in two-dimensional (2D) golden-angle radial cine cardiac MRI by applying a modified version of the U-net. The network is trained on 2D spatio-temporal slices which are previously extracted from the image sequences. We compare our approach to two 2D and a 3D deep learning-based post processing methods, three iterative reconstruction methods and two recently proposed methods for dynamic cardiac MRI based on 2D and 3D cascaded networks. Our method outperforms the 2D spatially trained U-net and the 2D spatio-temporal U-net. Compared to the 3D spatio-temporal U-net, our method delivers comparable results, but requiring shorter training times and less training data. Compared to the compressed sensing-based methods kt-FOCUSS and a total variation regularized reconstruction approach, our method improves image quality with respect to all reported metrics. Further, it achieves competitive results when compared to the iterative reconstruction method based on adaptive regularization with dictionary learning and total variation and when compared to the methods based on cascaded networks, while only requiring a small fraction of the computational and training time. A persistent homology analysis demonstrates that the data manifold of the spatio-temporal domain has a lower complexity than the one of the spatial domain and therefore, the learning of a projection-like mapping is facilitated. Even when trained on only one single subject without data-augmentation, our approach yields results which are similar to the ones obtained on a large training dataset. This makes the method particularly suitable for training a network on limited training data. Finally, in contrast to the spatial 2D U-net, our proposed method is shown to be naturally robust with respect to image rotation in image space and almost achieves rotation-equivariance where neither data-augmentation nor a particular network design are required.


I. INTRODUCTION
M AGNETIC Resonance Imaging (MRI) is a widely used non-invasive imaging modality in clinical practice.Especially for cardiac applications, MRI does not only provide anatomical imaging with excellent soft tissue contrast but also allows for functional assessment by using 2D cine MRI.
Such images show the heart anatomy for different phases of the cardiac cycle providing valuable information of the heart function [1], [2].
However, MRI suffers from long data-acquisition which determines the achievable spatial and temporal resolution.In order to shorten scan times, ensure sufficiently large spatial coverage and high spatial and temporal resolution, a wide range of undersampling and reconstruction techniques have been proposed ranging from Parallel Imaging to Compressed Sensing (CS) and Dictionary Learning [3] - [12].Cine MRI provides a temporal sequence of images and therefore offers the possibility to exploit the temporal correlation of adjacent frames in order to reduce undersampling artefacts.The movement of the heart during the cardiac cycle is mainly smooth and continuous.Ensuring that undersampling artefacts along time are incoherent and using a sparsifying transform along time such as Fourier transform [3], Principal Component Analysis [13], [14], Wavelet transform [4] or a transform learned from data [7], [11] combined with a L 1 -norm minimization approach can strongly reduce undersampling artefacts.The main challenges of these techniques are to ensure that the sparsifying transform really leads to a sparse signal and long reconstruction times due to the iterative reconstruction.
Recently, Neural Networks (NNs) have been applied to inverse problems as image reconstruction in MRI [15], [16], [17], [18] and computed tomography (CT) [15], [19], [20].Autoencoders, and in particular the U-net [21], a convolutional NN (CNN) which was first introduced for biomedical image segmentation, and different derivations from it [22], [23], have been widely used for removing undersampling artefacts in different medical imaging modalities.In initial works, the images were most commonly reconstructed or processed frame by frame, see e.g.[15].In the case of dynamic MRI, however, the temporal correlation of 2D MRI sequences can be exploited by aligning frames along the channel axis.Thus, 2D CNNs can be trained to map whole undersampled image sequences to their corresponding fully sampled image sequences [24], [25].Further, also CNNs employing 3D-convolutions can be trained on entire image sequences [26].However, in general, due to the resulting high dimensionality of the considered problem, either a large dataset or the application of data-augmentation techniques are indispensable to obtain satisfactory results, see e.g.[24], [26].Nowadays it is common practice to learn the filters of the convolutional layers by considering the images in the spatial domain.In this work, we propose to apply CNNs to two-arXiv:1904.01574v1[eess.IV] 1 Apr 2019 dimensional slices extracted from the spatio-temporal dimension in order to remove undersampling artefacts from a 2D cine MR scan obtained with a 2D Golden radial sampling scheme [27].A persistent homology analysis shows that the manifold of the spatio-temporal slices has a lower topological complexity than the manifold of the two-dimensional spatial image frames and suggests that the learning process of the network can therefore be facilitated.We compare our proposed approach to a 2D U-net trained frame-by-frame [15], a 2D U-net trained sequence-wise [25] and a 3D U-net [26] in terms of image quality, amount of required training data and stability with respect to rotation of the images.The latter is important because 2D cine MRI is commonly obtained in oblique planes which are adapted to the patients anatomy.In addition, our spatio-temporal approach method is compared to three CS-based approaches for image reconstruction of cine MRI: kt-FOCUSS [28], a total variation minimization-based method [12] and a Dictionary Learning-and total variationbased reconstruction method [11].
The paper is organized as follows.In Section II, we shortly discuss how NNs have been integrated within the problem of image reconstruction in MRI so far.Section III introduces our proposed method by discussing an a priori performed persistent homology analysis of the data which is needed to derive the approach as well as the network's architecture.We then show results of In-Vivo experiments and compare our method to other Deep Learning-and CS-based methods in Section IV and finish with a discussion and conclusion in Section V.

II. PROBLEM FORMULATION
In dynamic MRI, the image reconstruction problem is given by finding a solution of the inverse problem where x ∈ C N denotes the complex-valued image sequence with N = N x N y N t , F denotes the Fourier encoding matrix and y corresponds to the measured data in k-space.As the data-acquisition process in MRI is slow, undersampling schemes are applied to fasten the measurement process.Therefore, the inverse problem one encounters in applications is of the form where F I = S I F and S I ∈ C M ×N denotes a binary undersampling operator with M N which sets non-measured values in k-space to zero.Thereby, I ⊂ J = {1, . . ., N } corresponds to the set of indices of the measured Fourier coefficients.Since M N , the problem in (2) is underdetermined and therefore ill-posed.Hence, a direct solution is not possible and usually regularization approaches have to be applied in order to constrain the sought solution.Two widely used regularization techniques are based on Dictionary Learning [7], [11] and total-variation (TV) minimization [29], [12].However, since the methods employ the regularization within an iterative reconstruction, solving the problem in ( 2) is time consuming and NNs have been considered as a valid and powerful alternative, see e.g.[16], [15], [17], [24], [26].
Most commonly, the networks are trained by considering the images in the spatial domain.By doing so, the network learns to distinguish between diagnostic content of the image and the artefacts by exploiting the natural correlation of neighbouring pixel values in spatial domain.Given a dynamic process, one can further make use of the correlation of temporal slices amongst each other.In [25], the work of [15] is extended in the sense that a U-net is trained to directly map whole 2D image sequences of undersampled image reconstructions to 2D image sequences of ground truth images.In [24], the temporal dimension of the sequence is taken into account in the same manner, where furthermore, a weighted datasharing and a data-consistency approach further improve the quality of the reconstruction.Thereby, frames corresponding to different cardiac phases are aligned along the channel axis.The CNNs used in these approaches employ 2D-convolutional layers.As shown in [26], CNNs employing 3D convolutional layers can also be applied for the task of removing undersampling artefacts in dynamic sequences.Note that, for a network employing 2D convolutional layers and assuming the channel's dimension to be the one along which feature maps are combined by linear combination, aligning temporal frames along the channel's axis only slightly increases the computational complexity of the CNN.In this case, the filters size only increases for the first and the last convolutional layers.Employing 3D convolutional layers, in contrast, adds further non-negligible computational cost as well as hardware requirements, increases training time, the number of trainable parameters and therefore the number of samples required to successfully train a network without experiencing overfitting.In the aforementioned methods, the resulting number of availabe training samples reduces to the number of 2D image sequences.Since NNs are well known to require a large number of training samples and as the collection of proper data can be challenging, using these approaches, one usually has to heavily rely on the use of data-augmentation techniques, see e.g.[24], have access to a large dataset [26] or both.However, data-augmentation might also be non-trivial, time consuming or not possible to be performed on the fly.In the case of image reconstruction, the dataset is obtained by a prior dataacquisition process.In a simulation-based framework, one can for example apply arbitrary transformations to a ground truth image, e.g.elastic transformations, and then simulate the dataacquisition process.Also, using different undersampling masks to obtain zero-filled reconstructions can further enrich the data, see for example [24], [25].However, assuming a fixed dataset of pairs of undersampled image reconstructions and ground truth images, transformations would have to be applied to each pair, possibly altering the structure of the undersampling pattern in the input images.The same holds true for including rotated versions of training pairs into the dataset.As CNNs are not necessarily rotation-invariant or rotation-equivariant, these properties are usually achieved by properly augmenting the dataset [30].In contrast, other approaches explicitly incorporate mathematical operations in the design of the network architectures and therewith attempt to reach rotation-invariance or -equivariance [31], [32].High quality images in cardiac MRI are usually reconstructed by applying iterative methods.Thus, obtaining realistic versions of images rotated by a nontrivial rotation, i.e. by a rotation of θ ∈ { kπ 2 : k ∈ {0, 1, 2, 3}}, is computationally demanding, as the k-space data has to be rotated and the iterative reconstruction has to be performed on the rotated data.Therefore, rotation-equivariance, in this case, can either be achieved by means of the network architecture design or by a possibly time consuming data-augmentation process.

III. PROPOSED APPROACH
In medical imaging, the number of available training samples is usually very small compared to the underlying mathematical dimension of the data, i.e. the number of pixels of an image.Therefore, we are particularly interested in the question of whether or not it is possible to train a CNN on a highly limited dataset by making best use of the given data.We propose to train a CNN employing 2D convolutional layers on 2D spatio-temporal slices which can be extracted from the cine image sequences over the cardiac cycle.Once the network is trained, the image sequences can be reconstructed by properly reassembling the spatio-temporal slices.Later, we demonstrate that with our proposed approach, already a small number of 2D cine MRI datasets suffices to successfully train a network.Furthermore, robustness with respect to rotation in the spatial domain is achieved in a natural way by the change of perspective on the given dataset and our method is therefore almost rotation-equivariant.
Consider a dataset of 2D cine MR images D of n subjects, each with N z slices of size N x × N y and N t cardiac phases.Figure 1 shows different possible Deep Learning-based methods for removing undersampling artefacts in dynamic MRI sequences.In the first case, the artefacts are removed by training a network f Θ to map frames to frames, see Figure 1 (a).Given the temporal correlation of adjacent frames, one could also align temporal frames along the channel's axis and apply a network which is trained to map whole image sequences to image sequences, see Figure 1 (b).The same approach can be extended to map image sequences to image sequences but with the network employing three-dimensional convolutional filters, see Figure 1 (c).Our approach exploits spatio-temporal correlation but employs 2D convolutional filters which are trained on the spatio-temporal slices of the image sequences, see Figure 1 (d).Table I lists the number of immediately available training samples, i.e. without data augmentation.Note that with our proposed approach, the number is by far the highest.

A. Persistent Homology Analysis
As a trained denoising autoencoder can be geometrically interpreted to perform a projection-like mapping onto a manifold [33], the study of topological features of the manifold of the input and output images might be of interest for the design of the network architecture, [19], [34].Persistent homology is a mathematical tool that can be used for analysing datasets X ⊂ R n [35].For a two-classes classification problem, singular homology has been used as a complexity measure of the positively labelled submanifold of the input space and a relation between this complexity and the depth of the networks was proven in [36].This and experimental evidence using persistent homology [19], [34], motivates that it might be beneficial to investigate the persistent homology of datasets since it might explain the superiority of specific approaches to others.For a concise introduction to persistent homology see [37], Chapter 1.In general, persistent homology H * assigns a family of persistence modules {H i (X) : i ∈ N} over some field F to a set X ⊂ R n , see [37], Chapter 2. We will only use H 0 which has a much simpler interpretation as follows, see Figure 2. Let X ⊂ R n be a finite set and let r ≥ 0.Then, we can define a graph G r (X) with vertices V r (X) = X and edges This graph is the Rips complex restricted to simplices of dimension at most 1 [35], Chapter 1.3.Let Π(G r (X)) be the set of connected components of G r (X).Then, we can define where F 2 is the field with two elements.For 0 ≤ r < r we have a map Π(G r (X)) → Π(G r (X)) which induces a map H r 0 → H r 0 .The family of these maps is called the 0-th persistent homology of X.A good visualization of persistent homology is the persistent barcode, see Figure 2.For a real number r > 0, the number of connected components of G r (X) is equal the number of intersections of the vertical line at x = r with the barcodes, see Figure 2.This is also the 0-th Betti number β 0 of G r (X) which is a measure of complexity for G r (X), see [35], Chapter 2.3.Hence, the faster the persistent barcode of a dataset X decreases, the less complex the dataset is.By x I , x and r I := x I − x we denote the vector representations of direct reconstruction from undersampled radially acquired data using a non-uniform fast Fourier transform approach (NUFFT), the ground truth reconstruction and the residual, respectively.Since our network reduces artefacts arising from the NUFFT reconstruction as a post-processing step, similar to denoising, we operate on the real-valued magnitude images.Note that, in order to keep notation as simple as possible, by abuse of notation, we do not explicitly distinguish between a spatio-temporal slice and a 2D frame, but the meaning of the symbols should easily emerge from the context.Therefore, in the spatio-temporal training scenario, x I denotes a spatio-temporal slice extracted from an undersampled NUFFT reconstruction, x its corresponding artefact-free spatio-temporal slice and r I its spatio-temporal residual.In the spatial training scenario, x I , x and r I denote 2D frames.In the following, we compare the complexity of the manifolds given by the set of the ground truth images and their residuals in the spatial as well as in the spatio-temporal domain and denote them by M img xy , M res xy and M img xt,yt , M res xt,yt .Note that, in contrast to [19], we find ourselves in the situation where spatio-temporal slices and spatial images do not have the same mathematical dimension, and therefore, in order to be able to compare the manifolds, we restrict our considerations to image patches of the same shape.We performed a persistent homology analysis of the manifold to be learned by using GUDHI [38], [39].We randomly selected 1400 patches of size 18 × 18, obtaining a set X ⊂ R 18 2 for which we computed its persistent homology.To be able to compare the persistent barcodes at the same scale, we normalized the patches by the maximal pairwise L 2 -distance of points in X.We performed the persistent homology analysis for all patches extracted from the spatio-temporal slices and from spatial image frames.We repeated the experiment ten times and averaged the obtained number of connected components for each r ≥ 0 over the experiments.The corresponding barcode diagrams in Figure 3  of the ground truth images, i.e. the connected components merge at larger scales r. Figure 3 (c) also shows that for the ground truth images, the spatial manifold is more complex than the spatio-temporal manifold which is intuitively clear, as the spatial-temporal slices exhibit the temporal correlation of the sequence.This suggests that a network should achieve the best performance when trained to learn the ground truth spatiotemporal manifold.Furthermore, we see that in the case of the spatio-temporal domain, the topological complexity tends to be independent of the number of subjects whose patches are extracted to perform the analysis, see Figure 3 (c) and (d).
In contrast, in the spatial domain, a higher number of subjects used to extract the patches slightly reduces the topological complexity of the data.Therefore, we conclude that a good representation of all possible two-dimensional spatio-temporal slices might be already contained in a small number of 2D image sequences and thus, the number of 2D image sequences needed to successfully train a network in the spatio-temporal domain should be lower than for training the network in the spatial domain.

B. Network Architecture
In the following, we always refer to Θ as the set of trainable parameters of a network and denote a U-net by u Θ .Figure 4 shows the single components of a U-net without residual connection.The network consists of five stages, where each stage is a block of four convolutional layers with 2D filters of shape 3 × 3, followed by batch-normalization [40] and a component-wise ReLU as activation function.The stages are intercepted by 2 × 1-max-pooling layers in the encoding phase and by bilinear interpolation layers followed by 3 × 3 convolutional layers with no activation function in the decoding phase.The initial number of feature maps extracted from the first convolutional layer is set to 64 and is doubled in each block in the encoding phase.The network's output is given by a 1 × 1-convolutional layer which corresponds to a linear combination of the last extracted feature maps.The replacement of the original 2×2-max-pooling by a contraction solely along the spatial dimension empirically turned out to deliver superior results.The black arrows in Figure 4 denote concatenations between the last and the first layer of the corresponding encoding and decoding phases.
Figure 4.The U-net with three encoding stages and four convolutional layers per stage, no residual connection and batch-normalization (BN).In the case we train on the spatial domain, max-pooling is performed in both spatial dimensions, whereas in our proposed approach max-pooling is solely performed along the spatial dimension without contracting the data along the temporal dimension.
Recall from Figure 3 in Subsection III-A that the manifolds of the ground truth images have a lower topological complexity compared to the manifolds of their corresponding residuals.Therefore, according to [19] and [34], one should train the network to learn the features of the artefact-free images.Note that, if the U-net employs a residual connection as in [15], the output is of the form ũΘ (x I ) = x I + u Θ (x I ).If x is used as a label, ũΘ is trained to learn the residual up to a change of sign, as u Θ is the only part of the network containing trainable parameters.Therefore, being consistent with [19], [34], [41], in order to exploit the simpler topological complexity of the ground truth images and still be able to benefit from the residual connection as in [15], we propose to train a U-net with residual connection to estimate the image residuals r I of the spatio-temporal slices.More precisely, if by ũΘ we denote a U-net with residual connection which is trained to map x I to the ground truth residuals r I , and r cnn = ũΘ (x I ) = x I + u Θ (x I ) = x I − x cnn , then the estimates of the images are obtained by Figure 5 shows different approaches for training a U-net to remove undersampling artefacts by training on spatio-temporal slices.Note that, using x as labels for a U-net with residual connection and using the residuals r I as labels for a U-net without residual connection is equivalent in the sense that the trainable parameters are fitted to learn the residuals r I .On the other hand, if we want the network to learn the artefact-free images, we can either use the x as labels and not  Learning the ground truth images x is achieved by using the residuals as labels (right).

C. Loss Function
Dependent on what we want the network to learn, we train the network architecture to minimize different loss functions.
respectively.In the spatio-temporal case, the models u res xt,yt and u img xt,yt are analogously trained by minimizing the loss functions (4)

A. Data acquisition
In the following experiments we evaluate the proposed approach on 2D Golden radial cine MRI images of 19 subjects (15 healthy volunteers + 4 patients) obtained with a bSSFP sequence on a 1.5T MR scanner (Achieva, Philips Healthcare, Best, The Netherlands) during a 10 s breathhold (TR/TE = 3.0/1.5 ms, FA 60 • ).The spatial dimensions are N x × N y = 320 × 320 with an in plane resolution of 2 mm and 8 mm slice thickness.The number of cardiac phases which were reconstructed based on ECG signal is N t = 30.Coil sensitivity information was used to combine the image data of each coil after NUFFT-reconstruction.No further normalization was applied to the image data.The reference images used as ground truth images in the data were reconstructed with kt-SENSE [3] using N θ = 3400 spokes, which already corresponds to an undersampling factor of ∼ 3 in each cine image.In addition, dynamic images with N θ = 1130 (3.4 s scan time) were reconstructed using standard gridding (NUFFT), leading to an undersampling factor of ∼ 9.For each of the 15 healthy volunteers and two patients, N z = 12 slices were acquired while for two patients, only N z = 6 slices were obtained due to limited breathhold capabilities.Note that, in contrast to the healthy volunteers, the patients data contains images where the heart movement dysfunction can be diagnosed provided that the temporal information is enough accurate.

B. Evaluation Metrics
The performance of our method was evaluated in terms of peak signal-to-noise ratio (PSNR), structural similarity index (SSIM) [42] and Haar-Wavelet based perceptual similarity index measure (HPSI) [43] as similarity measures and normalized root mean squared error (NRMSE) as error-measure.Note that HPSI has been reported to achieve higher correlation with human opinion scores on different benchmark databases than SSIM [43].The quantitative measures are reported for the two-dimensional frames as well as for the two-dimensional spatio-temporal slices after the image sequences were cropped to 160 × 160 × 30 in order to compute the statistics over the regions of interest of the images.

C. Training Set-Up
Due to our relatively small dataset, all the following experiments were performed in a four-fold cross-validation setting.We split our dataset in portions of 12/3/4 subjects for training/validation/test data, where for one of these configurations, the test data corresponds to the image data coming from patients with heart movement dysfunction.Obviously, the resulting number of training samples in the spatio-temporal domain is much higher than in the spatial case and therefore, for a fair comparison of the methods, we train the networks by keeping the number of backpropagations fixed.Dependent on the perspective on the dataset, this results in a different number of epochs the networks are trained for.For data-balance reasons, we crop the image sequences using a cut-off of 50 pixels in xand y direction.Therefore, the spatial dimensions per frame reduce to 220 × 220.Due to the relatively small number of temporal frames and the large receptive field of the U-net, we also conducted experiments evaluating the performance of the networks trained on spatio-temporal slices by mirroring the boundaries.However, as we did not experience any increase or decrease of performance in explicitly handling the boundary conditions, we conducted all experiments on spatio-temporal slices of shape 220 × 30.The convolutional layers use zeropadding in order to maintain the spatial shape of the samples constant over each stage.Given a U-net as displayed in Figure 4, we are able to use a mini-batch size of 44 when training in the spatio-temporal domain.Thus, we set the mini-batch size in the spatial training case to 6 in order to have a constant number of pixels which the networks are fed with per forward pass, i.e. 44 • 220 • 30 = 290 400 = 6 • 220 • 220.The networks are trained for 5 • 10 4 backpropagations by stochastic gradient descent (SGD) using a learning rate which was gradually decreased from 10 −5 to 10 −7 and from 10 −6 to 10 −8 for the training in the spatio-temporal domain and in the spatial domain, respectively.The learning rates were chosen in a prior parameter study on the validation set.

D. Residual Vs. Image Learning
Here we compare the performance of the spatial U-net models u res xy and u img xy and our spatio-temporal approaches u res xt,yt and u img xy .The models were trained by minimizing the loss functions defined in ( 3) and (4), respectively.Figure 6 shows qualitative results for different possibilities of training illustrated in Figure 5.We see that in both domains, consistent with the previously shown persistent homology analysis, the networks removed the artefacts at their best when they were trained to learn the artefact-free images.From Figure 6 we also already see the superiority of our approach, see (d) and (e), compared to the spatially trained U-net which slightly tends to smooth out image details and less accurately removed artefacts in spatio-temporal domain, see (b) and (c).Table II shows the results obtained for the spatial U-nets u res xy and u img xy and the spatio-temporal U-nets u res xt,yt and u img xt,yt for n = 12, which confirms the heuristics given in Subsection III-A.Note that for the experiment, no data-augmentation was used and therefore, the results differ from the ones reported in Table IV.As a result, we conclude that for the task of removing undersampling artefacts or image denoising, the relation between the topological complexity of the residuals and the fully-sampled image reconstructions can be used to determine which labels to train the network on as well as how to design the network architecture.Since the radial acquisition is designed to be incoherent along the temporal dimension, in all our following experiments we use the U-net architecture as shown in Figure 4 where we make use of the residuals as labels and employ a residual connection as shown in Figure 5 for the case of image learning.In the next Subsection, we also see how learning the manifold M img xt,yt can reduce the training time as the convergence of the training and validation errors is achieved faster.

E. Training with Limited Amount of Data
Here we demonstrate the performance of our proposed approach when we restrict the number of available training samples.For this purpose, we trained the same network on different datasets where we fixed a different number of subjects n whose images we included in the training dataset.We show that with our proposed approach we are able to obtain comparable results even with a small number of subjects.
Note how in the spatial training scenario, the given training data is naturally constrained by the fact that for a fixed slice, different time frames of the ground truth images exhibit a high similarity.Therefore, regardless of the fact that in the spatial domain the ground truth image manifold has a lower complexity than the residual manifold, a network which is trained to learn the ground truth images should be expected to suffer from the limited variability of the data.In contrast, due to the temporal incoherence of the undersampling pattern, this issue should be overcome when learning the residuals.In the spatio-temporal domain, the availability of the data is not an issue as we have n N z (N x + N y ) N z N t samples.Therefore, one would expect the performance of the network to be to some extent independent of the number of subjects n the samples are extracted from.Also, according to the performed persistent homology analysis, the training of the network should be facilitated when trained to learn the manifold of the ground truth images.
Figure 8 shows the behaviour of the loss decay for the spatial approach ((a) and (b)), the spatio-temporal training approach ((c) and (d)), and in both cases, for the situation where the residuals are learned ((a) and (c)) and where the ground truth images are learned ((b) and (d)).We see that for the spatial U-net, for the residual learning and the image learning, increasing the number of subjects n leads to a decrease of the gap between training and validation error.Further, we see that the gaps are larger in the case where the ground truth images are learned which can be related to the low variability of the dataset.In both cases, for n = 12 the gap is small enough to assume that the networks have been properly trained and generalize well.for the case n = 12.For the spatio-temporal approaches, the gaps between training and validation error are smaller compared to the ones for the spatial approaches.This holds for the residual learning as well as the image learning scenario.Further, when the network is trained to learn the ground truth images, the errors converge faster than in the residual training approach, compare Figure 8   convergence rate is highly independent on the number of subjects n.From these experiments, we first conclude that our proposed method is well suited for training a network on a limited number of subjects.Second, forcing the network to learn the manifold given by the ground truth images M img xt,yt facilitates the training, which leads to a faster convergence of the errors and therefore to lower training times.Figure 7 shows a slice of the output of an image in the test set which was obtained with our proposed method.For all n, the artefacts have been successfully removed.We also see that even for n = 1, the dataset is already rich enough in order to allow for a good depiction of cardiac contraction and expansion during the heart cycle.Table III shows the achieved average of the quantitative measures.Even if in terms of quantitative measures the network performs better the larger the training data, the differences are marginal and hardly perceivable by the human eye, see Figure 7. Therefore, we conclude that since the data has a particularly simple structure, little data is already

F. Rotation Equivariance
CNNs are well known to be able to achieve properties as translation-invariance and -equivariance [44].However, they are not naturally invariant or equivariant with respect to rotation and one of the still most used methods to achieve these properties is to appropriately augment the dataset, [30], [45].In contrast, other approaches [31], [32], [46] explicitly incorporate invariant/equivariant convolutional operations in the networks which comes at the cost of a more complex network design.As a rotation in image space, i.e. due to a rotation of the field of view in order to adapt the scan to the geometry of the patient's heart, might easily be encountered, we are interested in achieving rotation-equivariance, i.e. f Θ (ψ(x I )) = ψ(f Θ (x I )) for an already trained network f Θ and rotation ψ in the xy-plane.For the following experiment, we generated new different test sets D ψ θ xy and D ψ θ xt,yt by applying rotations ψ θ with rotation-angle θ and tested the networks which were previously trained on the non-rotated images on the different test sets.By doing so, we were able to isolate and measure the direct effect of the sole rotation in image space on the performance of the network.
We rotated the measured data in k-space and reconstructed the training set for different angles θ.Note that the process is time consuming since the images were reconstructed with kt-SENSE.Therefore, we only reconstructed rotated images for θ = ±66 • , ±33 • and for each θ we further rotated the frames by ±90 • and 180 • , obtaining an overall number of 19 rotated test sets.Figure 9 compares our approach to the 2D spatially trained U-net in terms of quantitative measures calculated over the 2D frames of the different test sets with different rotation angles.For θ = 0, the measures indicate the average measure achieved on the training set.First, we see again that the spatio-temporal training approach clearly outperforms the spatial training approach in terms of all quantitative measures.Further, while rotating the 2D frames yields a noticeable decrease of performance of the network trained in the spatial domain, the network trained on the spatio-temporal slices performs similarly well on the different rotated test sets and is therefore almost rotation-equivariant.

G. Comparison with other Deep Learning-based Methods
Here we compare our approach to other methods based on post-processing with deep NNs.Since we only have access to a limited dataset, for the following experiments, we made use of data-augmentation by using all our rotated images, flipping, shifting the images along the channel axis and adding random constant values to the whole image sequences.By doing so, we created a potentially infinite training set.Note that we did not include elastic deformations as a data-augmentation technique, as the data-acquisition process is not simulated and elastic deformations might alter the structure of the undersampling artefacts in the input data.The first method of comparison is the already discussed spatially trained U-net u img xy .It is trained to map frames to frames and corresponds to the method discussed in [15], [19].The second method of comparison is a natural extension of the first and corresponds to the 2D U-net approach shown in Figure 1 (b) which we refer to as u xy,t .The net is trained to map whole image sequences to whole image sequences by aligning the cardiac phases along the channel's axis and was presented in [25].Further, we compare our method to the 3D U-net approach u xyt presented in [26], see Figure 1 (c).While for the 2D NNs, we cropped the images to 220 × 220 and 220 × 220 × 30 in order to let the networks focus on the diagnostic content of the images, for the 3D U-net, the images used for training needed to be cropped to 128 × 128 × 20, as the network is computationally more expensive.The shape was the one used in [26].In order to obtain image sequences of 320 × 320 × 30, the outputs of the networks were treated as patches and the image sequences were reconstructed from the patches by properly averaging over regions with overlapping patches.In contrast to the models employing 2D convolutional layers, which were trained using SGD, the 3D U-net u xyt was trained in the same setting as suggested in [26] using ADAM [47]. Figure 10 and Table IV show and summarize the obtained results with the described networks.For more detailed information about the reassembling of the image sequences from the patches, see Section IV-J.The spatially trained U-net u img xy correctly removed the undersampling artefacts in the spatial domain.However, the reduction of the artefacts is less accurate than for u img xt,yt , see Figure 10 (b) and (e).Although we report a successful training in terms of consistent decrease of training as well as validation error, the model u xy,t poorly removed the artefacts.Intuitively, the temporal incoherence of the radial undersampling pattern which differs from the one in [25] hinders the learning of the residual manifold and the network is therefore not suitable for our used undersampling scheme.Further, in [25], a zero-filled reconstruction is used as input of the network and therefore, the relation between the manifolds of the residuals and the ground truth images might differ as well from our case.In contrast, learning the manifold of ground truth sequences is highly facilitated by the temporal correlation of the 2D frames.In fact, already a network with one single convolutional layer with N t channels and 64 filters accurately removed all the artefacts from the image sequence.However, temporal information is lost and we point out we were not able to obtain satisfactory results by the application of deeper networks.The 3D U-net u xyt and our proposed method u img xt,yt perform comparably well.Both correctly removed the undersampling artefacts in spatial as well in spatio-temporal domain and led to a good preservation of the heart movement.In terms of the image-error-based PSNR and NRMSE measures, our method performs slightly better than the 3D U-net u xyt which, on the other hand, yields slightly better results in terms of SSIM and HPSI.However, the differences are marginal and barely visible.Further, note how our proposed method achieves similar results as the 3D U-net u xyt even when trained on one single patient, see Table III.Figure 11

H. Experiments with Shallower Networks
Even if we used the network architecture shown in Figure 4 for all experiments, the strength of the method lies in the change of perspective on the data.To demonstrate this, we applied different network architectures following our suggested approach.More precisely, we tested different types of CNNs which can be seen as special cases of the U-net.If by E and C we denote the numbers of encoding stages and convolutional layers per stage of a U-net, E3 C4 corresponds to the network displayed in Figure 4. E1 C8, on the other hand, denotes a single-scale fully CNN with eight convolutional layers and no max-pooling.Figure 12 shows results obtained with different network architectures parametrized by E and C. We see that the networks E1 C8 and E4 C4 which differ in terms of number of trainable parameters by approximately a factor of 10, achieve similar performance.This suggests that the number of trainable parameters and consequently, also training times, could further be reduced without significantly losing performance.

I. Comparison with State-of-the-Art Methods
Here, we compare our proposed approach to established state-of-the-art iterative reconstruction methods for cine cardiac MRI.Since iterative reconstruction methods are time consuming, we only reconstructed images from the patients' data which corresponds to one training/validation/testing setting of our four-fold cross-validation set-up.For comparison, images were reconstructed with kt-FOCUSS, a CS-based approach [4], an iterative reconstruction approach using spatial and temporal total variation (TV+TVT) for regularization [12] and a method employing regularization based on learned spatiotemporal dictionaries as well as spatial and total variation minimization (DL+TV) [42].The latter method was extended by combining the approach proposed in [42] with [7] by learning the dictionaries jointly from the real and imaginary part of the image data.Further, we extended the method to applicable to multi-coil datasets.We implemented the method using the operator discretization library (ODL) [48] for all needed operators.Figure 13 shows examples of the results obtained on the patients' data for the mentioned iterative reconstruction methods and our proposed model u img xt,yt .Although our method was trained on healthy volunteers, pathological heart wall motion (septal flash in Figure 13 (a)-(e) and hypo-kinetic anterior and posterior wall strongly reduced ejection fraction in Figure 13 (f) -(j)) is clearly visible with the proposed method.Also small features, such as the chordae tendinae connecting the valves and the papillary muscles, are well preserved, see Figure 13 (i).Table V shows the obtained results with the iterative reconstruction methods as well as with our proposed network u img xt,yt .We see that our method clearly outperforms the methods kt-FOCUSS and TV+TVT with respect to all reported quantitative measures.The most significant increase of performance is achieved against kt-FOCUSS, where, on the 2D frames, our method yields an increase of approximately 6 dB, 4.9% and 0.2% in terms of PSNR, SSIM and HPSI.Further, our proposed method's NRMSE is approximately half of the one of kt-FOCUSS.TV+TVT surpasses kt-FOCUSS in terms of all reported measures.Even if DL+TV surpasses TV+TVT with respect to all reported measures but HPSI, DL+TV tends to slightly smooth image details, possibly caused by a too strong regularization as well as the smoothing effect of the average of the reconstruction from patches.Further, note that the complex-valued patches were obtained by a disjoint sparse coding of the real and imaginary part of the patches as in [7].Our method u img xt,yt outperforms DL+TV with respect to all reported measures except for SSIM on the spatio-temporal slices.Note that the reconstruction time for DL+TV is higher than for our method by several orders of magnitude, see Subsection IV-J.

J. Reconstruction Times
We report the reconstruction times needed for the reconstruction of the images with the different previously discussed methods.First, we note that the methods employing iterative reconstruction are the most demanding in terms of computational times.kt-FOCUSS, kt-SENSE and TV+TVT are in the same range, where the reconstruction times per slice vary from approximately 110 s to approximately 180 s.The DL+TV method is by far the most computationally expensive method, as the regularized inverse problem has to be solved for each coil separately.Therefore, the average overall reconstruction time per slice amounts to roughly 13 000 s, where nearly  VI summarizes the reconstruction times for a slice of size 320 × 320 × 30 for all the reported methods with the aforementioned strides.The times needed to denoise a slice obviously heavily depend on the number of patches the sequence is reconstructed from and could be easily reduced by using larger strides.For the 2D methods, one could also obtain the 320 × 320 × 30 image sequences by directly applying the networks to the 320 × 320 × 30 samples.Note that for the 3D U-net this not possible because of memory limits.V. DISCUSSION AND CONCLUSION In this work, we have presented a new approach for the task of undersampling artefacts reduction in 2D cine MRI.Even if the employed U-net is a widely used network architecture for various inverse problems, to the best of our knowledge, this is the first work in which the U-net is applied to 2D spatiotemporal slices.We have investigated and demonstrated several advantages of the approach compared to the training in the spatial domain.Consistent with [19], [34], [41], the performed persistent homology analysis confirms the motivation that the superiority of the proposed approach can be attributed to the simpler topological complexity of the two-dimensional spatio-temporal slices.Further, the analysis suggests that the architecture should be chosen such that the network is trained to learn the ground truth images rather than the residuals.Note that our analysis is consistent with the results presented in [15] and [19], where streaking artefacts from a sparse view CT acquisition are most efficiently removed when U-net learns the residual manifold, which was shown to have a lower complexity than the one of the ground truth images [19].This is related to the fact that the undersampling pattern in sparse view CT is regular.Conversely, in CS MRI, where the undersampling schemes, e.g.golden-angle radial undersampling, are designed to be incoherent with the assumed sparsifying basis [51], one would expect the residual manifolds to have a more complex topological structure and therefore, the network's architecture should be chosen appropriately.Further investigation of the relation between the topological complexity of the residuals and the artefact-free images in different imaging modalities and the performance of the trained networks will be investigated in the future.
Our approach allows to successfully train a U-net on highly limited data, overcoming the problem of unavailability of large datasets or the need to rely on data-augmentation.We demonstrated that our method already outperforms the spatially trained U-net when trained on one single healthy volunteer in terms of all quantitative measures.When trained on a small number of volunteers, our network is already able to accurately preserve the heart movement and delivers results which are similar to the ones obtained when training on 12 subjects.In contrast to the spatial training approach, the proposed method naturally almost achieves rotation-equivariance by the sole change of perspective on the data.The network does therefore neither require changes in the architecture, nor data-augmentation based on rotation to achieve this property.Clearly, the reason lies in how a rotation in image space results in a transformation similar to a translation in the spatiotemporal domain, and therefore, since the network consists of convolutional and max-pooling layers, it is stable with respect to rotation in image space.Even if the reconstruction of a single slice and all its cardiac phases requires the evaluation of a large number of samples, reconstruction is fast and can be achieved in approximately 4.4 s on a Titan Xp GPU.As discussed in [22] and [23], the U-net tends to smooth out image details when trained in the spatial domain.In the proposed approach, however, image details in the spatial domain are well preserved.Our method, on the other hand, well preserves image details and further outperforms all other tested 2D CNNs with respect to all reported measures and achieves results comparable to the 3D U-net even when trained only on two subjects.Due to the small size of the data when considered in spatio-temporal domain, training times could be shortened to 3 hours compared to 6 hours for the 3D U-net.Further, since the spatio-temporal manifold M img xt,yt has a particularly simple structure, the reducing the artefacts reduces to a simpler task than in the spatial domain and training times could be further reduced by earlier stopping the training.As for all Deep Learning-based post-processing methods, the main limitation of our proposed method is the possible lack of data-consistency.However, an extension of the method to use complex data and ensure data-consistency could be approached similar as in [52], [24], [53] and might be subject of future work.We further have compared our proposed method to several state-of-the-art methods for iterative reconstruction in dynamic MRI.Our method outperforms kt-FOCUSS and TV+TVT with respect to all reported measures and achieves similar results as the dictionary learning-and total variationbased method DL+TV.However, our method is faster than DL+TV by several orders of magnitude as it performs a onestep regularization based on an initial NUFFT reconstruction.The iterative reconstruction methods kt-FOCUSS, TV+TVT and DL+TV used for comparison require the tuning of several parameters which were kept fixed for all patients.Therefore, further patient-specific parameter tuning might further improve the image quality in Figure 13 (a), (b), (c), (f), (g) and (h).In particular, DL+TV makes specific parameter tuning difficult due to its prohibitive reconstruction times.
In this work, we used kt-SENSE to obtain the ground truth samples from a 10 s breathhold.Although this yielded high image quality, residual undersampling artefacts which might impair the trained U-net might still be visible.Also, kt-SENSE already makes assumptions about the temporal smoothness of the image data.Therefore, further improvement of our method might be achieved by increasing the duration of the breathhold scan to achieve higher ground truth-image quality.

Figure 1 .
Figure 1.Different 2D and 3D Deep Learning-based approaches for the undersampling artefacts reduction.2D network for frame-wise mapping (a), 2D network for sequence-wise mapping with cardiac phases aligned as channels (b), 3D network for sequence-wise mapping with three-dimensional convolutional kernels (c), 2D network for our proposed approach on twodimensional spatio-temporal slices (d).

Figure 2 .
Figure 2. Procedure of the persistent homology analysis.The image shows an example for six randomly extracted patches of an image in the spatial domain and its corresponding barcode.

Figure 3 .
Figure 3.The number of connected components β 0 of Gr(X) for different datasets X at different r.Pairwise comparison of the persistent barcodes for M res xt,yt and M img xt,yt (a), for M res xy and M img xy (b), for M img xy and M img xt,yt (c) and for M res xy and M res xt,yt (d).Persistent codes of M img xy and M img xt,yt for different n.For the sake of visibility, in (e) and (f), only the endpoints of the bars are displayed.
employ a residual connection or use the residuals r I as labels and employ a residual connection.This holds for training the network on two-dimensional frames as well as on twodimensional spatio-temporal slices.By u res xy and u img xy we denote spatial U-net models when trained to learn the spatial residual manifold M res xy and the spatial ground truth image manifold M img xy , respectively.Analogously, we identify u res xt,yt and u img xt,yt as spatio-temporally trained U-nets trained to learn the spatio-temporal manifolds M res xt,yt and M img xt,yt , respectively.

Figure 5 .
Figure 5. Residual and Image Learning: Learning the residuals is achieved by using the images r I as labels and employing a residual connection (left).Learning the ground truth images x is achieved by using the residuals as labels (right).
Let D res xy , D img xy and D res xt,yt , D img xt,yt denote the set of available training samples, i.e. the pairs (x I , r I ) or (x I , x), depending on the domain the data is considered in and on which are the labels used for training.By N xy and N xt,yt we denote their corresponding cardinality.Recall that we use the U-net ũΘ to estimate the residual r I = x I − x and therefore, the image estimate is given by x cnn = x I − ũΘ (x I ).Therefore, in order to define the loss function for a network with residual connection to learn the ground truth images, we use the residuals as labels and vice versa.The models u res xy and u img xy are trained by minimizing the L 2 -errors between the predicted 2D frames and their corresponding labels which are given by For n = 1 and for n = 1, 2, 4, the spatially trained U-nets u res xy and u img xy poorly generalize in both training scenarios, as the networks almost immediately start to overfit the data, see (a) and (b).Spatial training of the networks without data-augmentation is possible for n = 2, 4, 8, 13 for the residual learning and for n = 8, 13for the image learning.However, our method outperforms the spatially trained U-net as it better maintains diagnostic details in spatial and spatio-temporal domain, see Figure7

Figure 6 .
Figure 6.Comparison of different training approaches for U-nets with residual connection.NUFFT reconstruction with N θ = 1130 radial lines (a), spatially trained U-nets u res xy (b) and u img xy (c), proposed spatio-temporal approaches u res xt,yt (d) and u img xt,yt (e), ground truth (f).The point-wise error images are magnified by a factor of ×3.All images are displayed on the same scale.

Figure 7 .
Figure 7. Results on the test set for N θ = 1130 radial lines when the number of subjects whose spatio-temporal slices are extracted was varied.Note that no data-augmentation was used.Proposed method for n = 1 (a), n = 2 (b), n = 8 (c), n = 12 (d), the spatial U-net for n = 12 (e) and the kt-SENSE reconstruction with 3400 radial lines (f).The point-wise error images are magnified by a factor of ×3.All images are displayed on the same scale.

Figure 8 .
Figure 8. Loss behaviour during training with N θ = 1130 for different number of volunteers n contained in the dataset.Training loss (solid) and validation loss (dashed) for the spatial and spatio-temporal U-nets.Spatial residual learning (a), spatial image learning (b), spatio-temporal residual learning (c), spatio-temporal image learning (d).Note that the scales differ due to the different losses and the different domains in which the networks are trained.

Figure 9 .
Figure 9. Performance of the networks when tested on rotated copies of the images contained in the training set.While the network trained in the spatio-temporal domain is robust with respect to rotation, the network trained on images in the spatial domain loses generalization power when tested on rotated copies of the images it was trained on.The dashed lines correspond to the corresponding measure achieved on the training set.

Figure 10 .
Figure 10.Comparison with different Deep Learning-based methods.NUFFT reconstruction with N θ = 1130 radial lines (a), u img xy (b), uxy,t (c), uxyt (d), proposed approach u img xy (e), ground truth kt-SENSE reconstruction (f).The point-wise error images are magnified by a factor of ×3.All images are displayed on the same scale.
shows the statistics calculated on the 2D frames for all different discussed Deep Learning-based post-processing approaches where the number of subjects n contained in the training dataset was varied.The case n = ∞ corresponds to n = 12 with all previously mentioned data-augmentation techniques.Clearly, our proposed method of training on the 2D spatiotemporal slices is the most suitable for obtaining satisfactory results when training a network on a highly limited dataset.The models u img xt,yt and u res xt,yt are the only ones to allow the successful training of a network on data extracted from one single subject.For u img xy and u res xy , the results obtained for n = 2 and n = 4 were obtained by early stopping due to early overfitting.The models u xy,t and u xyt are properly trainable starting from n = 8.The 3D U-net u xyt and our method u img xt,yt achieve comparable performance in terms of the reported measures for n = ∞.

Figure 11 .
Figure 11.Quantitative measures for all discussed Deep Learning-based postprocessing methods.Missing values for some n denote that the network was not properly trainable with the corresponding number of subjects.

Figure 12 .
Figure 12. Results obtained with different CNNs following our proposed aproach u img xt,yt .E1 C8 (a), E4 C4 (b) and E5 C2 (c), kt-SENSE reconstruction with N θ = 3400 radial lines (d).Our approach therefore offers the possibility to further reduce the network complexity as well as training times.

Figure 13 .
Figure 13.Comparison with different state-of-the-art iterative reconstruction methods.kt-FOCUSS (a) and (f), TV+TVT (b) and (g), DL+TV (c) and (h), proposed method (d) and (i), kt-SENSE reconstruction with N θ = 3400 radial lines.The point-wise error are magnified by a factor of ×3.All images are displayed on the same scale.

Table II PERFORMANCE
FOR THE SPATIAL AND OUR SPATIO-TEMPORAL APPROACHES DEPENDENT ON THE USED ARCHITECTURES.

Table III RESULTS
ON THE TEST WHEN THE NUMBER OF SUBJECTS WHOSE IMAGES WERE INCLUDED IN THE TRAINING SET IS VARIED.

Table IV COMPARISON
OF DIFFERENT DEEP LEARNING-BASED APPROACHES.

Table V COMPARISON
WITH DIFFERENT ITERATIVE RECONSTRUCTION METHODS.