DuDoCFNet: Dual-Domain Coarse-to-Fine Progressive Network for Simultaneous Denoising, Limited-View Reconstruction, and Attenuation Correction of Cardiac SPECT

Single-Photon Emission Computed Tomography (SPECT) is widely applied for the diagnosis of coronary artery diseases. Low-dose (LD) SPECT aims to minimize radiation exposure but leads to increased image noise. Limited-view (LV) SPECT, such as the latest GE MyoSPECT ES system, enables accelerated scanning and reduces hardware expenses but degrades reconstruction accuracy. Additionally, Computed Tomography (CT) is commonly used to derive attenuation maps (<inline-formula> <tex-math notation="LaTeX">$\mu $ </tex-math></inline-formula>-maps) for attenuation correction (AC) of cardiac SPECT, but it will introduce additional radiation exposure and SPECT-CT misalignments. Although various methods have been developed to solely focus on LD denoising, LV reconstruction, or CT-free AC in SPECT, the solution for simultaneously addressing these tasks remains challenging and under-explored. Furthermore, it is essential to explore the potential of fusing cross-domain and cross-modality information across these interrelated tasks to further enhance the accuracy of each task. Thus, we propose a Dual-Domain Coarse-to-Fine Progressive Network (DuDoCFNet), a multi-task learning method for simultaneous LD denoising, LV reconstruction, and CT-free <inline-formula> <tex-math notation="LaTeX">$\mu $ </tex-math></inline-formula>-map generation of cardiac SPECT. Paired dual-domain networks in DuDoCFNet are cascaded using a multi-layer fusion mechanism for cross-domain and cross-modality feature fusion. Two-stage progressive learning strategies are applied in both projection and image domains to achieve coarse-to-fine estimations of SPECT projections and CT-derived <inline-formula> <tex-math notation="LaTeX">$\mu $ </tex-math></inline-formula>-maps. Our experiments demonstrate DuDoCFNet’s superior accuracy in estimating projections, generating <inline-formula> <tex-math notation="LaTeX">$\mu $ </tex-math></inline-formula>-maps, and AC reconstructions compared to existing single- or multi-task learning methods, under various iterations and LD levels. The source code of this work is available at <uri>https://github.com/XiongchaoChen/DuDoCFNet-MultiTask</uri>.


I. INTRODUCTION
M YOCARDIAL perfusion imaging (MPI) using Single- Photon Emission Computed Tomography (SPECT) is the most widely performed nuclear medicine exam for the diagnosis of coronary artery diseases.Reducing the injected dose can lower the potential risk of radiation to patients, but it will increase the image noise [1].Acquiring limited-view (LV) projections using fewer solid-state detectors enables accelerated scanning and reduces hardware costs.However, it can lead to lower reconstruction accuracy due to reduced angular sampling [2].Additionally, Computed Tomography (CT)-derived attenuation maps (µ-maps) are commonly employed for attenuation correction (AC) to improve the diagnostic accuracy of cardiac SPECT [3].However, the extra CT scans will cause additional radiation exposure and introduce SPECT-CT misalignments [4].Additionally, stand-alone SPECT scanners without CT assistance dominate the SPECT market share, where the conventional CT-based AC approach is not available.
Deep learning techniques have been developed to address either the low-dose (LD) denoising, LV reconstruction, or CT-free AC in nuclear medicine.Existing deep learning approaches for LD denoising in nuclear medicine are categorized into projection-domain and image-domain approaches.For the projection-domain approaches, Shiri et al. [5] applied a 2D ResNet to estimate full-dose (FD) projections from LD projections in cardiac SPECT.Similarly, Olia et al. [6] used a 2D conditional generative adversarial network (GAN) to denoise LD projections under multiple noise levels.Sun et al. [7] further used a 3D conditional GAN to denoise LD projections in cardiac SPECT.In contrast, the image-domain approaches were developed by inputting LD images into neural networks to estimate FD images [8], [9], [10], [11].Previous studies proved that the projection-domain approaches consistently outperformed the image-domain approaches in both simulation and clinical studies [7], [12].
Previous deep learning techniques for LV reconstruction in nuclear medicine can be summarized into projectiondomain, image-domain, and dual-domain methods.In the projection-domain methods, Whiteley and Gregor [13] applied a U-Net [14] to estimate full-view (FV) projections from LV projections of whole-body Positron Emission Tomography (PET).Similarly, Shiri et al. [5] used a ResNet to predict FV projections from LV projections in cardiac SPECT.In contrast, the image-domain methods were developed by feeding reconstructed LV images into neural networks to estimate FV images [15].Amirrashedi et al. [15] further reported that the projection-domain methods outperformed the image-domain methods due to richer information in the projection presentation.Furthermore, Chen et al. [16] proposed a dual-domain method named Dual-Domain Sinogram Synthesis (DuDoSS), which utilized the image-domain output as the prior information to estimate FV projection in the projection domain.DuDoSS outperformed both projection-domain and imagedomain methods [16].Moreover, Li et al. [17] proposed a dual-domain end-to-end training method (End2End-Net) for accurate SPECT reconstruction.This method developed the image backpropagation using the true reconstructed images through image reconstruction and the dual-domain framework.In addition, deep learning techniques are also employed for the LV reconstruction of other related medical imaging modalities such as CT [18], [19], [20], [21], [22].Specifically, a pivotal study by Wang [19] proposed to incorporate deep learning algorithms into CT image reconstruction, which demonstrated the potential of neural networks to enhance image quality based on LV CT data.Another study by Chen et al. [20] combined a deep convolutional neural network (CNN) with total variation minimization, proving that this hybrid approach can effectively improve image quality and reduce noise in LV CT reconstruction.Additionally, in a study by Yi and Babyn [21], GAN was trained to generate realistic CT images from LV data, thereby improving the structural integrity of the reconstructed CT images.Wu et al. [22] proposed a dualdomain residual-based optimization network (DRONE) for LV CT reconstruction, which consisted of three modules for embedding, refinement, and awareness respectively.It also integrates deep learning with compressed sensing for enhanced reconstruction accuracy.
Deep learning-based CT-free AC in nuclear medicine can be generally classified into indirect and direct strategies [23].For the indirect strategy, Shi et al. [24] employed both U-Net and GAN to generate synthetic µ-maps from SPECT images, which were then utilized for the AC reconstruction of cardiac SPECT.In contrast, Yang et al. [25] applied a direct strategy in which the non-AC SPECT image was input to neural networks to estimate the AC SPECT image without the intermediate step of generating µ-maps.Furthermore, the patient non-imaging clinical information was embedded to predict more accurate AC SPECT images in a recent study [26].Previous works proved that the indirect approaches outperformed the direct approaches in multiple clinical scanners [27], [28], [29].However, the inaccurate estimation of the µ-map boundaries remains a major limitation of the indirect approaches as shown in [27].
While the aforementioned methods have been developed to individually tackle LD denoising, LV reconstruction, or CT-free AC in nuclear medicine, the solution for simultaneously addressing these tasks remains challenging and under-explored.Recently, multi-task simultaneous learning has been investigated in various medical imaging applications, which leverages domain-specific information across interrelated tasks to further enhance accuracy [30].For instance, existing strategies for simultaneous registration and segmentation involved utilizing a shared encoder for multi-task feature extraction [31], [32] or employing segmentation results to assist the registration [33], [34], [35], [36], [37].Specifically, Qin et al. [31] proposed a multi-scale network for the registration and segmentation of 2D cardiac Magnetic Resonance (MR) images.A shared encoder was applied by the segmentation and registration branches for feature extraction, leading to improved accuracy in both tasks.In contrast, Xu and Niethammer [33] trained two convolutional modules for the registration and segmentation of 3D brain MR images, respectively.The registration module was supervised by an anatomy similarity loss based on the predicted masks from the segmentation module.Another important multi-task learning application is simultaneous segmentation and classification, where the common approach is utilizing segmentation results to improve the classification accuracy [38], [39], [40], [41].Specifically, Wu et al. [38] applied a U-Net for the segmentation and classification of lung nodules in CT.The predicted segmentation masks and bottleneck features were fused to improve the subsequent classification.In addition, Xu et al. [41] applied two cascaded convolutional modules, in which the output of the segmentation module was fed into the classification module to improve the classification of tongue images.The multi-task learning has also been explored in nuclear medicine imaging.Li et al. [42] proposed a recurrent framework for joint motion estimation and reconstruction in PET.In this study, a learned registration network was incorporated into a regularized PET image reconstruction module for simultaneous learning.Moreover, Zhou et al. [43] incorporated bidirectional LSTM layers into a Siamese pyramid network for simultaneous motion estimation and denoising in LD PET.
Although showing promising results, the above multi-task learning studies primarily focused on image-based features within single-domain frameworks.Dual-domain methods have exhibited superior performance to single-domain methods in various studies, due to augmented information constraints in both domains [16], [44], [45], [46].Therefore, recent studies further explored fusing dual-domain features for better performance in multi-task learning [47], [48].However, these studies were conducted based on a single imaging modality.Fusing multi-modality image features was proved to be more effective than using a single modality in deep learning applications [49] since the complementary information from multiple modalities enables more comprehensive feature representations [50].Multi-modality information fusion have been explored in a wide range of deep learning studies based on CNN [51], [52], [53], [54], [55], [56].Liu et al. [51] presented a Siamese convolutional network to obtain a weighted map that fuses the pixel-wise information of the two input images.Hou et al. [52] introduced a multi-modality fusion method based on a dual-channel spiking cortical model (DCSCM), which generated and combined both low-and high-frequency coefficients of images to achieve enhanced feature fusion.Xia et al. [54] proposed a fusion scheme for multi-modality medical images that utilizes and combines the image features from both the multi-scale transformation and convolutional modules.Song et al. [55] presented a cross-attention block based on non-local attention for the cross-modality fusion and registration of ultrasound and CT images.Chen et al. [56] further developed a dual-branch squeeze-fusion-excitation (DuSFE) co-attention module for cross-modality image fusion and registration of nuclear medicine image and CT.Thus, a recent study proposed a cross-domain and cross-modality network (CDI-Net) for multi-task learning [57].However, the simple channel concatenation in CDI-Net might not effectively fuse the cross-domain or cross-modality image features.Furthermore, the basic U-Net models within the CDI-Net might not be optimal for estimating projections or µ-maps, without considering the characteristics of specific images and tasks.
Therefore, we propose a Dual-Domain Coarse-to-Fine Progressive Network (DuDoCFNet) for simultaneous LD denoising, LV reconstruction, and µ-map generation of cardiac SPECT.Paired projection-domain and image-domain networks are cascaded using a multi-layer fusion (MLF) mechanism for cross-domain and cross-modality feature fusion.In the projection domain, Two-Stage Progressive Networks (TSP-Net) are utilized for LD denoising and LV restoration.The U-Net-like downsampling-upsampling framework [14] can restore general structures but might not preserve finer image details.Thus, TSP-Net utilizes a U-Net-like framework in Stage 1 to restore the coarse LV structures, followed by a non-downsampling module in Stage 2 to recover finer details of the LD projection.In the image domain, Boundary-Aware Networks (BDA-Net) enhance spatial attention on image boundaries and thus improve the boundary accuracy of the predicted µ-maps.BDA-Net first employs a shared encoder in Stage 1 to predict a coarse µ-map and its boundary image, which are then adaptively fused in Stage 2 to generate a refined µ-map.DuDoCFNet was trained end-to-end using both projection and image losses.Experiments showed that DuDoCFNet achieved superior accuracy under various iterations and LD levels.

A. Problem Formulation
The goal of this multi-task learning study is to generate the predicted FD and FV projection ( PF D F V ) and µ-map ( μ) with the LD and LV projection (P L DL V ) as input, formulated as: where H (•) is the DuDoCFNet operator.The output labels are the ground-truth FD and FV projection (P F D F V ) and the CT-derived µ-map (µ).Then, PF D F V and μ are input into a reconstruction module to output the predicted AC SPECT images.Thus, predicting PF D F V achieves the LD denoising and LV restoration, and predicting μ enables the CT-free AC.

B. Data Preparation
A total of 600 anonymized clinical hybrid one-day SPECT-CT stress/rest MPI studies were included in this work.Each study along with the raw list-mode data was acquired on a GE NM/CT 570c dedicated SPECT-CT scanner following the injection of 99m Tc-tetrofosmin.
The GE NM/CT 570c scanner features a cylindrical surface array of 19 detectors, organized into three columns.Specifically, there are 5, 9, and 5 detectors placed on the top, central, and bottom columns, respectively [58].These detectors collectively focus on the central cardiac regions to gather photon projection data from various spatial angles.The GE MyoSPECT ES scanner is a latest cost-effective variant of the GE NM/CT 570c scanner, which only comprises the 9 centralcolumn detectors.By only including the central 9 detectors, we simulated the configurations of the GE MyoSPECT ES scanner, as presented in Fig. 1.
The LV projections is generated by zero-padding the top and bottom columns of the FV projections while retaining the central-column projections unchanged.Replicated padding was utilized to expand all 19-layer projections to 20 layers, facilitating the convolutional and downsampling computations.The LD projection was generated by randomly decimating the FD list-mode data with a default 10% downsampling rate.P L DL V ∈ R 32×32×20 was produced by conducting both the LV and LD downsampling.P F D F V ∈ R 32×32×20 was the original FD and FV projection.The CT-derived µ-maps (µ ∈ R 72×72×40 ) with a voxel size of 4 × 4 × 4 mm 3 were well registered and resolution-matched to the SPECT images.250, 100, and 250 cases were utilized for training, validation, and testing.

C. DuDoCFNet Overview
The framework of DuDoCFNet is presented in Fig. 2.
Expectation-Maximization module (ML-EM, 30 iterations), and then generates the output LD and LV SPECT image S L DL V ∈ R 72×72×40 .Then, P L DL V and S L DL V are fed into DuDoCFNet to simultaneously estimate P F D F V and µ.The projection-domain TSP-Nets (Fig. 3) and image-domain BDA-Nets (Fig. 4) are cascaded for cross-domain and crossmodality feature fusion.The details of TSP-Net and BDA-Net are described in subsequent subsections.
In the 1 st iteration, P L DL V is first input into TSP-Net 1 and output the predicted FD and FV projection P1 F D F V .After Back-Projection (BP), P1 F D F V ∈ R 32×32×20 is input to BDA-Net 1 and turns into an image-domain feature map with a matrix size of 72 × 72 × 40, leveraging the auxiliary emission information to enhance the µ-map estimation.S L DL V is also input into BDA-Net 1 to provide the image-domain information.An MLF mechanism (to be described later) is employed in BDA-Net 1 to adaptively fuse the cross-domain features to produce a more accurate μ1 .
Then, in the i th (i ≥ 2) iteration, the image-domain output of the previous iteration, μ(i−1) , is input to TSP-Net i after Forward-Projection (FP).The FP operator turns μ(i−1) ∈ R 72×72×40 into a projection-domain feature map with a matrix size of 32 × 32 × 20, employing the auxiliary anatomical information of μ(i−1) to assist the projection estimation.Additionally, the predicted projections from previous (i − 1) iterations are concatenated and input into TSP-Net i , preserving the previously estimated emission features to gradually enhance the prediction accuracy.TSP-Net i also utilizes the MLF mechanism to adaptively fuse the cross-domain and cross-modality features.The output of TSP-Net i is formulated as: where T i is the TSP-Net i and f is the FP operator.{•} refers to channel-wise concatenation.Similarly, Pi F D F V after  BP is input to BDA-Net i , employing the auxiliary emission information of Pi F D F V to assist the µ-map estimation.The predicted µ-maps from previous (i − 1) iterations are also concatenated and input into BDA-Net i , generating μi as: where D i is the BDA-Net i and b is the BP operator.The predictions of the N th iteration (N represents the total number of iterations, default 5), P N F D F V and μN , are the prediction outputs of DuDoCFNet as described in Eq. 1.The predicted projection P N F D F V ∈ R 32×32×20 and µ-map μN ∈ R 72×72×40 are then input to an offline ML-EM reconstruction module (30 iterations) to generate the final reconstructed AC SPECT image ŜF D F V ∈ R 72×72×40 .

D. TSP-Net in Projection Domain
The architecture of TSP-Net is illustrated in Fig. 3.As mentioned above, the downsampling-upsampling framework can recover general structures but not finer image details.Thus, TSP-Net uses a U-Net-like framework for coarse LV restoration in Stage 1 and a non-downsampling module for finer LD denoising in Stage 2. As described in Eq. 2, the inputs to TSP-Net include the projection-domain emission features and the auxiliary image-domain anatomical features after FP.
In Stage 1, residual dense blocks with attention (AttenRDB, depicted at the bottom left of Fig. 3) are employed for image feature extractions in TSP-Net.The emission and the auxiliary anatomical features are connected at multiple downsampling layers to enable effective feature fusion in various spatial dimensions, which is the MLF mechanism.Due to the discrepancy between the two imaging modalities, we propose a Cross-Domain Feature Fusion (CDF) module to calibrate the channel-wise weights before fusing the two-modality features.As shown at the bottom right of Fig. 3, the two inputs of CDF, X 1 and X 2 ∈ R h×w×d×c (h, w, d, and c refer to height, width, depth, and number of channels), are first downsampled into w 1 and w 2 ∈ R c to encode the channel-wise weight features.Then, w 1 and w 2 are fused into w f ∈ R 2c using concatenation and a fully-connected layer, described as: where D(•) is downsampling by average pooling and F 0 is the fully-connected layer.Then, w f is utilized to generate the recalibration weights ŵ1 and ŵ2 ∈ R c , which are applied back to X 1 and X 2 using element-wise multiplication.The recalibrated features are then concatenated to generate the CDF output Xcd f ∈ R h×w×d×2c , formulated as: where σ (•) is the sigmoid activation function and ⊗ refers to the element-wise multiplication.Therefore, CDF effectively fuses the channel information and optimizes the fusion weights of the anatomical and emission features.Next, the output of the last decoding layer, P0 , goes through a 1×1×1 convolution layer to generate the predicted LD and FV projection PL D F V .In Stage 2, PL D F V is input to a non-downsampling module for denoising.Meanwhile, we apply a self-attention block to refine and improve the PL D F V before feeding it to the next stage, inspired by [61].The self-attention output is: where where Pi L D F V and Pi F D F V indicate the predictions in the i th iteration.P L D F V is the ground-truth LD and FV projection.

E. BDA-Net in Image Domain
The structure of BDA-Net is shown in Fig. 4. As mentioned above, the inaccurate boundary estimation poses a significant limitation in predicting µ-maps [27].To address this, BDA-Net first uses a shared encoder in Stage 1 to predict a coarse µ-map and a boundary image as the preliminary results.Then, the predicted coarse µ-map and boundary image are fused in Stage 2 to enhance the spatial attention on the image boundaries, thus generating a refined µ-map.As described in Eq. 3, the inputs to BDA-Net include the image-domain anatomical features and the auxiliary projection-domain emission features after BP.
In the shared encoder of Stage 1, the MLF mechanism is utilized to connect and fuse the anatomical and the auxiliary emission features at multiple downsampling layers.Then, inspired by [32], two task-specific decoders are used to predict a coarse image μ0 and a boundary profile β.Next, a Spatial Boundary Enhancement (SBE) module utilizes μ0 and β as inputs to fuse the spatial features and enhance the spatial attention on the image boundaries.As shown at the bottom of Fig. 4, the input U 1 and U 2 ∈ R 72×72×40×c are squeezed into s 1 and s 2 ∈ R 72×72×40×1 using 3 × 3 × 3 convolutions.Then, s 1 and s 2 are fused by concatenation and a convolutional layer, generating the combined spatial feature s f ∈ R 72×72×40×2 as: where Then, ŝ1 and ŝ2 are generated from s f and applied back to U 1 and U 2 using element-wise multiplication.Then, the recalibrated spatial features are concatenated to generate the SBE output Ûsbe ∈ R 72×72×40×2c , formulated as: where G 3 (•) and G 4 (•) refer to convolutions.N 3 (•) and N 4 (•) are the batch normalization layers.Then, Ûsbe is input to the subsequent convolutional module in Stage 2 to extract the boundary information.Finally, a refined µ-map with more accurate boundaries, μ, is generated as the output of BDA-Net.
The training loss of BDA-Net i is calculated based on its preliminary and refined outputs in both stages, formulated as: where βi , μi 0 , and μi are predicted boundary and µ-maps in the i th iteration.β represents the boundary of µ and is determined by calculating the 3D gradients of µ, formulated as:

F. Implementation Details and Ablation Studies
DuDoCFNet is end-to-end trained using both projection and image losses in all iterations.The overall loss function is: where N is the total number of iterations (default: 5).L i Pr oj and L i I mg are the projection and image losses described in Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Eq. 7 and 10. α P and α I are loss weights.DuDoCFNet achieves the optimal performance when α P = 1 and α I = 0.2.In this study, DuDoCFNet was compared to various singleand multi-task learning methods.For estimating projections, U-Net [14], Densely-Connected U-Net (DenseUNet) [59], and Attention U-Net (AttenUNet) [60] with squeeze-excitation mechanism [62] were employed to estimate P F D F V using P L DL V as the input.The aforementioned networks were utilized twice in sequence to first perform denoising and then LV reconstruction, generating the final predicted P F D F V as the output.DuDoSS [16], the dual-domain data consistency prediction method, was included as the previous state-of-theart method for limit-view reconstruction and denoising in SPECT imaging.For the µ-map estimation, U-Net, Dense-UNet, and AttenUNet were utilized to predict µ using S L DL V as the input.The multi-energy-window image generation approach using AttenUNet with squeeze-excitation attention mechanism [62], was the previous state-of-the-art method for µ-map generation in dedicated cardiac SPECT [27].Moreover, the multi-task learning method, CDI-Net [57], was applied to simultaneously estimate P F D F V and µ.CDI-Net is set to run for 5 iterations as in [57].
Ablation studies of DuDoCFNet were also conducted for further evaluation.Either TSP-Net or BDA-Net was solely used to predict P F D F V or µ, to assess the impact of the multi-task learning framework on the prediction accuracy.Furthermore, to assess the effect of the progressive learning strategy, Stage 2 of TSP-Net or BDA-Net in DuDoCFNet is removed to produce DuDoCFNet w/o TSP and DuDoCFNet w/o BDA.Moreover, the MLF mechanism is replaced by input-channel concatenations as in [57] to produce DuDoCFNet w/o MLF.
All the aforementioned frameworks were developed using PyTorch [63] with Adam optimizers (β 1 = 0.5, β 2 = 0.999).The projection-domain networks for predicting FDFV projections, including the projection-domain U-Net, DenseUNet, AttenUNet, and TSP-Net, were trained with a learning rate of 1 × 10 −3 and a batch size of 2. The image-domain networks for predicting µ-maps, including the image-domain U-Net, AttenUNet, and BDA-Net were trained with a learning rate of 1 × 10 −4 and a batch size of 2. For the dual-domain methods, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.including DuDoSS, End2End-Net, CDI-Net, and DuDoCFNet, the projection-and image-domain sections were trained with learning rates of 1×10 −3 and 1×10 −4 respectively, and a batch size of 2. Batch normalization was implemented to stabilize the learning process by re-centering and re-scaling the inputs from the previous layer [64].A learning rate decay policy with a step size of 1 and a decay rate of 0.99 was employed to avoid overfittings [65].The single-task learning networks were trained for 200 epochs, while the multi-task simultaneous learning networks including CDI-Net and DuDoCFNet were trained for 50 epochs to reach overfitting and convergence, as shown in Fig. 5.The training parameters were tuned by inspecting the performance on the validation datasets with repeated and comparative experiments.DuDoCFNet is designed to be super light-weighted by reducing the number of convolutional channels from 32 to 16 and reducing the number of convolutional layers in each densely-connected module from 6 to 4, to minimize the computational costs of the iterative framework.

A. Predicted Projections
Normalized Mean Square Error (NMSE), Structural Similarity (SSIM), and Peak Signal-to-Noise Ratio (PSNR) are used for voxel-wise quantitative evaluations of predicted projections, µ-maps, and reconstructed AC images.Clinical 17-segment polar maps are generated from AC images and quantified using Absolute Percent Error (APE) [26].
Fig. 6 shows the predicted FD and FV projections.We can observe that the single-task learning methods including U-Net, DenseUNet, AttenUNet, DuDoSS, and End2End-Net significantly underestimate the projection intensities in the cardiac regions.In contrast, TSP-Net outputs more accurate projections than AttenUNet, showing the effectiveness of the progressive learning framework (Fig. 3) in recovering the FD and FV projections.The multi-task learning method, CDI-Net, produces more accurate projections than single-task learning methods.Moreover, DuDoCFNet outperforms CDI-Net and ablation groups, confirming the efficacy of our proposed progressive learning strategies and the MLF mechanism in enhancing estimation accuracy.Table I lists the quantitative evaluations of the predicted projections.It can be observed that TSP-Net shows higher accuracy than other single-task learning methods but uses fewer parameters.DuDoCFNet exhibits higher accuracy than ablation study groups and CDI-Net (NMSE, 1.83% vs 2.28%, p<0.001) but has fewer parameters than CDI-Net.

B. Predicted Attenuation Maps
Fig. 7 shows the predicted µ-maps.The indirect approaches using U-Net, DenseUNet, and AttenUNet exhibit inaccurate estimations of µ-map boundaries.BDA-Net estimates a more accurate µ-map than AttenUNet, proving that the  boundary enhancement mechanism (Fig. 4) improves the prediction accuracy.The predicted µ-map by DuDoCFNet is more consistent with ground truth compared to CDI-Net and the ablation groups.This demonstrates that the progressive strategies and the MLF mechanism in DuDoCFNet improve prediction accuracy.Table II illustrates the quantitative evaluations of the predicted µ-maps.BDA-Net generates more accurate µ-maps than other indirect methods but uses fewer parameters.The light-weighted DuDoCFNet predicts the most accurate µ-maps among all the testing groups and has fewer parameters than CDI-Net (NMSE, 11.43% vs 12.19%, p<0.001).

C. Reconstructed AC SPECT Images
The predicted projections and µ-maps are then incorporated into the Maximum-Likelihood Expectation-Maximization reconstruction (ML-EM, 30 iterations) to produce AC SPECT images.The reconstruction group DuDoSS refers to reconstructing images using sequentially predicted full-view and denoised projections by DuDoSS [16] and predicted µ-map by AttenUNet [27], as the integration of the previous state-ofthe-art methods.
TSP&BDA-Net refers to the AC images reconstructed using predicted projections from TSP-Net and predicted µ-maps from BDA-Net.Fig. 8 shows the AC SPECT images.TSP&BDA-Net outputs more accurate AC images than existing single-task learning methods.The AC images generated by DuDoCFNet are more consistent with the ground truth, compared to single-task learning methods, CDI-Net, and the ablation groups.Table III lists the voxel-wise quantitative evaluations of the AC images.TSP&BDA-Net outputs more accurate results than existing single-task learning methods.In addition, DuDoCFNet produces more accurate AC images than the other testing methods and the ablation groups (NMSE, 4.34% vs 5.26%, p<0.001).This proves that DuDoCFNet significantly improves the final AC reconstruction accuracy.
The clinical standard 17-segment polar maps are generated from the AC SPECT images as shown in Fig. 9.The TSP&BDA-Net generates more accurate polar maps than existing single-task learning methods.In addition, the polar map by DuDoCFNet is the most consistent with the ground truth.

D. Impact of Iterations and Low-Dose Levels
We further evaluated the performance of DuDoCFNet with different numbers of iterations as shown in Fig. 11.It can be observed that DuDoCFNet's performance improves as the number of iterations increases from 1 to 5.However, after the number of iterations reaches 5, the network shows convergence, and the prediction accuracy gradually decreases due to overfitting.The possible reason is that, as proved by Kipf et al. [66], augmenting a deep neural network with additional layers can potentially lead to performance degradation due to overfitting, especially when the network's depth exceeds what is necessary for the specific task and dataset.The optimal number of iterations is 5 for the dataset size employed in our study.However, higher optimal number of iterations and improved performance are expected with larger dataset, which can be validated in further down-stream clinical studies.
In addition, we generated additional 7 datasets with varying LD levels ranging from 1% to 80%.These datasets were used to evaluate DuDoCFNet and existing methods under different LD levels as shown in Fig. 12.It can be observed that DuDoCFNet exhibits consistently superior accuracy in estimating projections and µ-maps under various dose settings.

E. Impact of Image Padding
We further investigated how the padding mechanism of LV projections affect DuDoCFNet's prediction accuracy.We generated two additional datasets in which the LV projections are processed using replicated padding or linear padding.Then, we tested the performance of DuDoCFNet using the two newly generated datasets, and compared the performance with that using the default zero-padding dataset.Table V presents the quantitative evaluations of DuDoCFNet in predicting projections and µ-maps using datasets with replicated, linear, or zero padding.It can be observed that the Zero Padding group generates quantitatively more accurate FDFV projections (1.83% vs 2.00% or 2.02%, NMSE, p < 0.001) and µ-maps (11.43% vs 11.82% or 12.13%, p < 0.001) compared to the Replicated Padding and Linear Padding groups.Fig. 13 shows the predicted projections and µ-maps by DuDoCFNet using datasets with replicated, linear, or zero padding.It can be observed that the Zero Padding group predicts more consistent FDFV projections and µ-maps with ground truths, compared to the Replicated Padding and Linear Padding groups.Thus, according to the experimental results, the default zero padding mechanism shows better performance in predicting projections and µ-maps than replicated or linear padding.

F. Investigations of Multiple Limited-View Configurations
In order to test DuDoCFNet's performance on different LV arrangements, we generated another two datasets with different LV angle arrangement modes: (1) including the central detectors and the top 5 detectors; (2) including the central 9 detectors and the bottom 5 detectors.Under the two additional LV datasets, we tested DuDoCFNet's performance and compared that with the performance of the previous stateof-the-art simultaneous learning technique in cardiac SPECT, CDI-Net.Table VI presents the quantitative evaluations of the predicted FDFV projections and µ-maps by DuDoCFNet and CDI-Net using the two additional datasets.It can be observed in the table below that DuDoCFNet predicts quantitatively more accurate FDFV projections (1.29% vs 1.52%, 1.63 vs 1.84%, NMSE, p < 0.001) and µ-maps (11.13% vs 11.82%, 11.35% vs 11.85%, NMSE, p < 0.001) compared to CDI-Net, on both LV datasets.Fig. 14 shows the predicted FDFV projections and µ-maps on the two additional datasets.It can be observed that DuDoCFNet generates more consistent FDFV projections and µ-maps with the ground truths compared to CDI-Net, on both additional LV datasets.Thus, according to the experimental results, DuDoCFNet shows consistently superior performance on different LV arrangement modes.

IV. DISCUSSION
In this work, we propose DuDoCFNet, a multi-task learning method, for simultaneous LD denoising, LV reconstruction, and CT-free µ-map generation of cardiac SPECT.Specifically, DuDoCFNet employs a dual-domain cascaded framework that enables cross-domain feature fusion.In DuDoCFNet, projection-domain and image-domain networks are concatenated through forward and backward projections to enable data flow across the two distinct domains.This framework effectively fuses the cross-domain and cross-modality information to enhance the prediction accuracy of each individual task.The MLF mechanism effectively connects and fuses the cross-modality image features at different spatial dimensions.The embedded CDF modules are utilized to adaptively adjust the channel-wise weights to enhance the cross-modality feature fusion.The TSP-Net and BDA-Net within DuDoCFNet share similar design motivations that utilize two-stage architectures for progressive coarse-to-fine estimations of projections or µ-maps.Specifically, TSP-Net employs a U-Net-like structure to restore general structures and a non-downsampling module to recover finer details of the LA and LD projections.BDA-Net employs a shared encoder to predict a coarse µ-map and the boundary, followed by a spatial co-attention module to generate a refined µ-map with enhanced boundary accuracy.Specifically, the overall dual-domain structure of DuDoCFNet is designed to generate SPECT projections in the projection domain and generate µ-maps in the image domain, respectively.The FP and BP operators are utilized for linearly transforming the image matrix, thus guaranteeing consistent image dimensions after cross-domain connections.Due to the discrepancy of the cross-domain and cross-modality image features, the MLF mechanism is employed as an adaptive data fusion manner for dynamic integration of these diverse image features.The first stage of TSP-Net, a U-Net-like structure, is intended for recovering the general structure of the input LV projection by the feature extraction capabilities of the downsampling-upsampling backbones.The second stage of TSP-Net applies a non-downsampling convolutional module, intended for restoring finer image details by the identical-dimension convolution structures.In contrast, the first stage of BDA-Net is designed to estimate the overall coarse structure and more importantly the boundary of the µ-maps that are pre-requisites for the boundary restoration in the next stage.The second stage of BDA-Net first applies a CDF module to adaptively fuse the predicted coarse µ-maps and boundaries and also enhance the network's spatial attention on the µ-map boundaries.A subsequent CNN module is then employed to encode the fused general and boundary information, thus generating a more consistent µ-map with improved boundary consistency.
The advantages of the two-stage progressive learning strategies are dividing and distributing an otherwise challenging task to two end-to-end connected specialized modules.Each module is specifically designed to concentrate on one aspect of the task, thereby enhancing not only the predictive accuracy of each individual component but also improving the overall performance of the system.The main disadvantage of the two-stage progressive learning strategies is the relatively specialized and complicated network structures.The module of each stage must be customized to align with the specific context and demands of each subtask, in order to reach the optimal overall performance.Thus, this approach might lack a "one-size-fits-all" framework suitable for all the complex tasks.Moreover, the complicated structure of the two-stage network typically requires additional computational resources.To minimize the computational costs, DuDoCFNet is designed to be super light-weighted by reducing the number of channels in its convolutional layers.To summarize briefly, this work features the first dual-domain approach for multi-task simultaneous learning in nuclear medicine.The methodology aspects demand a considerable amount of technical innovations to realize the novel multi-task combination and improvement.Specifically, our contributions in this work include: 1) we proposed the cross-domain and cross-modality data fusion framework, which can effectively fuse the image features across multiple domains and imaging modalities and thus improve the network performance in each individual domain; 2) we developed a MLF mechanism and a CDF module to enable more accurate data fusion across multiple domains and in different spatial dimensions; 3) we designed the TSP-Net that employs the "two-stage progressive estimation" principle for simultaneous denoising and limited-view restoration of SPECT projections; 4) we designed the BDA-Net that utilizes the "coarse-to-fine boundary enhancement" principle for estimating more accurate µ-maps with enhanced boundary consistency.
Our experiments based on clinical data demonstrated that DuDoCFNet predicts more accurate projections and µ-maps than existing single-task learning methods.It can be noticed in Fig. 6 that previous deep learning methods under-estimate the projection intensities especially at the cardiac regions, which was also noted in a previous study [16].As a possible explanation, the under-estimations might be attributed to the initial input values in the areas with missing projection information, which are zeros and significantly lower than surrounding areas.Consequently, neural networks tend to output relatively lower values in these specific areas with significantly lower input values.This phenomenon could be affected by imaging modalities, data properties, network architectures, etc.Our proposed DuDoCFNet can largely improve the under-estimations as shown in Fig. 6.Further studies can be conducted to explore and address the projection under-estimation issues in the LV reconstruction applications.Additionally, DuDoCFNet has fewer network parameters but performs better than the previous multi-task learning framework CDI-Net, as indicated in Tables I and II.Ablation studies of DuDoCFNet further validate the impact of the proposed two-stage progressive coarse-to-fine estimation strategies and MLF mechanism on improving network performance.Moreover, we demonstrated that DuDoCFNet's performance, based on the dataset in this study, improves as the number of iterations increases and converges at 5 iterations due to overfitting, as illustrated in Fig. 11, higher optimal number of iterations and improved performance could be obtained with larger datasets, which will be validated in our future studies.Notably, even with only 1 iteration, DuDoCFNet still outperforms TSP-Net in predicting projections (2.23% vs 2.66%, p < 0.001) and BDA-Net in µ-maps (12.07%vs 12.34%, p < 0.001).This further proves that the simultaneous learning framework of DuDoCFNet improves the prediction accuracy of each interrelated task.In addition, we demonstrated that DuDoCFNet consistently exhibits superior performance under varying LD levels from 1% to 80%, as presented in Fig. 12.Then, we evaluated the reconstruction accuracy of DuDoCFNet in terms of the reconstructed AC SPECT images and the clinical standard 17-segment polar maps.The polar map patterns by DuDoCFNet are highly consistent with the ground truth, which could largely enhance the SPECT imaging accuracy and thus improve the clinical diagnostic capabilities for coronary artery diseases.Furthermore, we investigated the impact of image padding mechanism on DuDoCFNet's prediction accuracy and proved that the zero padding we used generated the optimal results.A possible reason of this result is associated with the unique configurations of the GE NM/CT 570c scanners involved in our study, as presented in Fig. 1.The GE 570c scanner includes 19 detectors in total with pinhole collimators, with 5, 9, 5 placed on the top, central, and bottom columns.The input LV projections include the central 9 detectors in our study.However, due to the irregular shape of the scanner and pinhole collimation, the detectors on the top and bottom columns are not spatially aligned with the detectors on the central column.Specifically, the angular arrangements of the top/bottom columns are different from that of the central column, in terms of angular intervals and detectors' positions on the z-axis.Therefore, using replicated or linear padding will introduce projection information that do not match the true data of the top and bottom detectors, leading to a deviation from the ground-truth FV projections and potentially reducing prediction accuracy.Finally, we investigated DuDoCFNet's performance under multiple LV configurations and proved that DuDoCFNet generated consistently accurate results on different LV arrangement modes.
Our current work also has some potential limitations.First, DuDoCFNet employs an iterative framework that consists of cascaded TSP-Nets and BDA-Nets.Consequently, the computational costs associated with DuDoCFNet are relatively high compared to the single-task learning methods such as Dense-UNet [59] and AttenUNet [60], particularly when multiple iterations are employed.However, even with just 1 iteration, DuDoCFNet still demonstrates promising performance.The inference time of DuDoCFNet is < 1s even with high iterations, which is a reasonable speed for clinical practice.Second, although DuDoCFNet can generate voxel-wise accurate µmaps with enhanced boundary consistency, finer details of spine bone structures still cannot be well estimated as shown in Fig. 7. Due to the limited field-of-view of the dedicated cardiac SPECT scanner in this study, the input SPECT information is inaccurate outside the cardiac regions, for example, the spine regions.Thus, predicting accurate spine bone structures with incomplete emission information as input is quite challenging now [27].The investigation of advanced deep learning techniques for estimating µ-maps with finer details in dedicated SPECT scanner is also an interesting research topic.Third, in this study, DuDoCFNet is only tested using clinical data from cardiac SPECT-CT scanners.More validations of DuDoCFNet with diverse datasets from various tracers, scanners, organs, and imaging modalities could be conducted to comprehensively assess DuDoCFNet's capabilities under different clinical scenarios.Fourth, the primary emphasis of this study lies in the methodology development and validation of DuDoCFNet.While we have generated and analyzed the standard 17-segment polar maps using clinical analysis tools, the clinical validation of DuDoCFNet remains insufficient and warrants further investigation and evaluation before being employed in clinical practice.Last, the applications of current deep learning-based reconstruction models, including our proposed DoDuCFNet, are typically limited to some specific hardware configurations.As a result, the well-trained models cannot be easily deployed into different clinics due to the distinct machine hardware architectures.Thus, integrating Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
machine parameters into the model as another input channel would facilitate broader and more universal clinical applications of current deep learning algorithms.
There are some factors worth attention in cardiac SPECT including motion correction and scatter correction.Motion correction methods for this GE 530c/570c dedicated scanner were investigated by several studies [67], [68], [69], [70].The post-reconstruction method can also be employed for reducing the respiratory and cardiac motions [71].These motion correction methods can be investigated and directly integrated into our DuDoCFNet framework in our future clinical studies to further improve the clinical reconstruction accuracy.For the proposed dedicated cardiac scanners with cadmium zinc telluride (CZT) detectors, conventional scatter methods such as the triple energy window (TEW) or dual energy window (DEW) [72] could not be directly applied due to the CZT tailing effect.Approaches have been developed to account for the tailing effect in corrections for scatter and crosstalk [73], [74], [75].Such methods can be directly integrated into our DuDoCFNet framework in the future studies.
Moreover, the ML-EM algorithm is derived based on Poisson distributed data [76], [77], but the deep-learningbased synthetic projections are not strictly Poisson distributed.The ML-EM algorithm has previously been successfully employed even when the data is not Poisson distributed [16], [78], [79].In addition, out proposed DuDoCFNet framework is "reconstruction algorithm neutral" and not specific to ML-EM.Alternative algorithms like Maximum a Posteriori (MAP) [80] and Penalized Weighted Least-Squared (PWLS) [81] could also be applied to replace ML-EM in our study depending on users' specific requirements and preferences.
Our work suggests some promising directions for future studies.First, DuDoCFNet indicates the significance of cross-modality feature fusion in enhancing prediction accuracy.As indicated in this study, the complementary information from emission and anatomical images enables more comprehensive feature representations and thus higher prediction accuracy.Hence, leveraging multi-modality features can potentially improve the performance of many medical imaging applications such as segmentation, registration, etc.Second, the multi-stage progressive learning strategy offers a viable solution for many complicated applications.As shown in our study, separating a complex learning task into several sections and utilizing specialized network modules for each individual section can enhance the overall performance.Third, the ML-EM algorithm is employed in our study for post reconstruction to keep aligned with the clinical practice.However, additional image reconstruction algorithms, such as the maximum a posteriori reconstruction with TV regularization [82], can be explored in our future studies to further enhance the reconstruction accuracy.Last, this study showed that simultaneous learning of interrelated tasks can largely enhance the performance of each task.Therefore, further investigation is warranted for interrelated tasks such as denoising [10] and registration [56], [83], [84], [85], which can be learned simultaneously.

V. CONCLUSION
In conclusion, we propose DuDoCFNet for simultaneous LD denoising, LV reconstruction, and CT-free AC of cardiac SPECT.DuDoCFNet enables accurate and accelerated AC SPECT imaging while reducing hardware expenses and minimizing radiation exposure.The cascaded framework fuses the cross-domain and cross-modality image features for simultaneous learning.The two-stage progressive learning strategies improve estimation accuracy in both projection and image domains.Experiments with clinical data exhibit the superior performance of DuDoCFNet in predicting projections, generating µ-maps, and AC reconstructions, compared to existing single-or multi-task learning methods.The clinical segment-wise evaluations using standard polar maps demonstrate that the AC reconstructions by DuDoCFNet are highly consistent with the clinical ground truth.This could largely enhance the SPECT MPI imaging accuracy and further improve the clinical diagnostic capabilities for coronary artery diseases.

Fig. 1 .
Fig. 1.Illustration of configurations and limited-view arrangements of systems in this study.The original GE NM/CT 570c scanner has 19 detectors in total, with 5, 9, and 5 placed on the top, central, and bottom columns respectively.All the 19 detectors will collect the cardiac SPECT projection information in different spatial angles.By only including the central 9 detectors, we simulated a cost-effective single-column scanner, such as the GE MyoSPECT ES system, as denoted in the blue solid box above.

Fig. 2 .
Fig. 2. Overview of the Dual-Domain Coarse-To-Fine Progressive Network (DuDoCFNet).In each iteration, DuDoCFNet employs a Two-Stage Progressive Network (TSP-Net) in the projection domain for denoising and restoration of the LD and LV projections, and a Boundary-Aware Network (BDA-Net) in the image domain for predicting µ-maps.All the TSP-Nets and BDA-Nets are cascaded to enable cross-domain and cross-modality feature fusion.The predicted projection and µ-map of the last iteration are employed as the final prediction outputs of DuDoCFNet.

Fig. 3 .
Fig. 3. Two-Stage Progressive Network (TSP-Net).In Stage 1, a U-Net-like structure is utilized to achieve the LV restoration.The auxiliary anatomical features are fed into multiple downsampling layers as the multi-layer fusion (MLF) mechanism.Cross-Domain Feature Fusion (CDF) modules recalibrate the channel weights for adaptive feature fusion.A non-downsampling module is employed in Stage 2 for the LD denoising.

Fig. 4 .
Fig. 4. Boundary-Aware Network (BDA-Net).A shared encoder and two task-specific decoders are utilized to estimate a coarse µ-map and its boundary image.Cross-domain features are embedded in multiple downsampling layers as the multi-level fusion.The estimated µ-map and boundary image are jointly fed into a Spatial Boundary Enhancement (SBE) module to enhance the boundary accuracy of the final refined µ-map.

Fig. 5 .
Fig. 5.The training and validation losses of a single-task learning group (AttenUNet) and a multi-task simultaneous learning group (DuDoCFNet).The left figure shows the losses of the predicted FDFV projections by AttenUNet.The middle and right figures show the losses of the simultaneously predicted FDFV projections and µ-maps by DuDoCFNet.The single-task learning network reaches overfitting and convergence at about 50 epochs.In contrast, the multi-task learning network reaches overfitting and convergence at about 200 epochs.

Fig. 6 .Fig. 7 .
Fig. 6.Predicted FD and FV projections displayed in the central-column angle, bottom-column angle, and side view.White arrows denote the regions with over-or under-estimated projection intensities.NMSE and SSIM between the predicted and ground-truth projections are annotated.

Fig. 8 .
Fig. 8. Reconstructed AC SPECT images using predicted projections and µ-maps presented in horizontal long axis (HLA) and short axis (SA) views.White arrows denote the image regions with inaccurate reconstructions.DuDoCFNet outputs the most accurate AC images.NMSE and SSIM between predicted and ground-truth images are annotated.

Fig. 9 .
Fig. 9. Standard 17-segment polar maps of the AC SPECT images.White arrows denote the segment regions with over-or under-estimated intensities.The polar map by DuDoCFNet is the most consistent with ground truth, compared to single-task, CDI-Net, and ablation study groups.

Fig. 10 .
Fig. 10.Correlation maps of segment values between polar maps of the ground-truth and estimated AC SPECT images.Correlation Coefficients (Corr.Coef.) and Coefficients of Determination (R 2 ) are annotated.

Fig. 11 .
Fig. 11.Quantitative evaluations of the predicted projections, µ-maps, and SPECT images by DuDoCFNet with number of iterations ranging from 1 to 8. DuDoCFNet's performance improves as the number of iterations increases from 1 to 5. The network shows convergence over 5 iterations, and the prediction accuracy gradually decreases due to overfitting.

Fig. 10 .
DuDoCFNet shows the most concentrated point distributions and the highest Corr.Coef.further proves that DuDoCFNet produces AC reconstruction results.The polar map patterns by DuDoCFNet are highly consistent with the clinical ground truth, which could largely enhance the SPECT imaging accuracy and thus improve the clinical diagnosis of coronary artery diseases.

Fig. 13 .
Fig. 13.Visualizations of the predicted projections and µ-maps by DuDoCFNet using datasets with replicated padding, linear padding, or zero padding.White arrows denote the prediction inconsistency in the predicted projections and µ-maps.

Fig. 14 .
Fig. 14.Visualizations of DuDoCFNet's performance on two additional limited-view (LV) angle arrangement.The predicted projections and µ-maps using the Central + Top LV dataset are shown in the top red dash box.The predicted projections and µ-maps using the Central + Bottom LV dataset are shown in the bottom blue dash box.The while arrows denote the inconsistency of the predicted images.

TABLE I QUANTITATIVE
EVALUATIONS OF PREDICTED PROJECTIONS ON 250 TESTING CASES USING NMSE, SSIM, AND PSNR.THE NUMBERS OF NETWORK PARAMETERS (PARAM, UNIT: MILLION) ARE INDICATED IN THE LAST COLUMN.THE BEST RESULTS ARE MARKED IN BOLD Table IV presents the segment-wise quantification of the polar maps.The polar maps by DuDoCFNet demonstrate the lowest segment-wise errors (APE, 7.16% vs 9.03%, p<0.001)

TABLE III QUANTITATIVE
EVALUATIONS OF RECONSTRUCTED AC SPECT IMAGES USING PREDICTED PROJECTIONS AND µ-MAPS ON 250 TESTING CASES.THE BEST RESULTS ARE MARKED IN BOLD

TABLE IV SEGMENT
-WISE QUANTITATIVE EVALUATIONS OF POLAR MAPS ON 100 TESTING CASES USING APE, CORRELATION COEFFICIENT (CORR.COEF.), AND COEFFICIENT OF DETERMINATION (R 2 ).THE BEST RESULTS ARE MARKED IN BOLD and the highest Correlation Coefficient Coef.) among testing groups.The correlation maps of the segment-wise values are shown in

TABLE V QUANTITATIVE
EVALUATIONS OF DUDOCFNET IN PREDICTING PROJECTIONS AND µ-MAPS USING DATASETS WITH REPLICATED PADDING, LINEAR PADDING, OR ZERO PADDING.THE BEST PERFORMING RESULTS ARE MARKED IN BOLD