Super-resolution Reconstruction of 3T-like Images from 0.35T MRI Using A Hybrid Attention Residual Network

Magnetic resonance (MR) images from low-field scanners present poorer signal-to-noise ratios (SNRs) than those from high-field scanners at the same spatial resolution. To obtain a clinically acceptable SNR, radiologists operating the low-field scanners use a much smaller acquisition matrix than high-field scanners. Thus, the current state of the image quality indicates the need for further research to improve the image quality of low-field systems. Strategies based on super-resolution (SR) techniques can be alternatives for image reconstruction. However, predetermined degradation methods embedded in these techniques, such as bicubic downsampling, seem to impose a performance drop when the actual degradation is different from the pre-defined assumption. In this study, we collected a unique dataset by scanning 70 participants to address this problem. The anatomical locations of the scanned image slices were the same for 0.35T and 3T data. Low-resolution (LR) images (0.35T) and high-resolution (HR) images (3T) were the image pairs used for data training. Herein, we introduce a novel CNN-based network with hybrid attention mechanisms (HybridAttentionResNet, HARN) to adaptively capture diverse information and reconstruct super-resolution 0.35T MR images (3T-like MR images). Specifically, the proposed dense block combines variant dense blocks and attention blocks to extract abundant features from LR images. The experimental results demonstrate that our proposed residual network efficiently recovers significant textures while rendering a high peak signal-to-noise ratio (PSNR) and an appealing structural similarity index (SSIM). Moreover, an extensive subjective-mean-opinion-score (SMOS) proves to be promising in the clinical application using HARN.

improve the images' quality for better disease diagnosis and image-guided intervention. Image post-processing method can be an alternative solution to reconstruct the 3T-like MR images from 0.35T MR images. Notably, though spatial resolution and SNR are not the only differences between images acquired from the low-field and high-field MRI systems, e.g., tissue contrast, using super-resolution (SR) related methods to increase the image's resolution without degrading the SNR is still a prominent approach to improve low-field image quality and make it comparable to the high-field ones.
SR techniques can reconstruct HR images from one or multi LR images without changing the MRI hardware system. The SR methods can be categorized based on the number of input LR images to single image super-resolution (SISR) [3] and multi-image super-resolution (MISR) [4]. Unlike MISR, SISR has a much higher efficiency [5] and lower graphics memory demands. Thus, we only foucus on the SISR technique in this study.
Existing SR techniques in MRI can be classified as interpolation-based, reconstruction-based, and learningbased approaches. The interpolation functions are often considered the most straightforward and intuitive SR method [6], [7]. Whereas the interpolation-based methods are computationally simple, the processed images may be over smoothed and usually have visual artifacts such as ringing and fuzzy edges. The present reconstruction-based methods apply the degradation model by utilizing prior information with regularization methods. Bahrami et al. [8] used regression random forests and proposed a novel sparse representation method that predicted 7T-like images from 3T MR images. Although their method has high accuracy for brain MRIs, their study has high input requirements, and the sample size limits generalization.
The learning-based methods are the most widely used algorithms because they can generate novel details that do not appear in LR images. The SR methods based on convolution neural networks (CNN) have attracted broad interest, with Dong et al. [9] developing the first CNN-based SR method (a simple three-layer architecture called SRCNN) that performed well on super-resolving photographic images. Later they proposed a faster network (FSRCNN) with fewer parameters achieving better performance [10]. Subsequently, researchers focused on the architecture's feature extraction ability, with Kim et al. [11] proposing an intensive, very deep SR network (VDSR). To accelerate the VDSR's convergence, researchers put forward residual learning and gradient clipping. Lim et al. [12] developed an enhanced deep SR network (EDSR) by removing VDSR's unnecessary modules and expanding the model size. Although VDSR and its variants solve the gradient problems in deep networks and achieve good performance, a deeper network is harder to train and preserve hierarchical information. To handle this problem, Tong et al. [13] leveraged dense skip connections and created a novel super-resolution dense network (SRDenseNet). For MR images, Zheng et al. [14] employed variants of dense blocks to enrich the features extracted from the MR slices.
Moreover, Pham et al. [15] developed a three-dimensional (3D) version of SRCNN for brain MRI. Similarly, Wang et al. [16] proposed a 3D feature attention SR network (FASR), which utilized channel and sparse attention operations in parallel.
However, the disadvantage of the above-mentioned learning-based SR algorithms is that they assume the degradation from HR to LR is fixed and known. Thus, the LR images could be generated using bicubic or other average-type methods for the models to learn the mapping relationship from the fixed LR and HR images and estimate the weights. The weighted model is then exploited to create the desired HR image. Nevertheless, for a large distribution gap between the LR and HR images, the reconstruction performance of these methods may be unsatisfactory.
To adapt the degradation uncertainty, in this paper, we create a dataset by scanning 70 volunteers with both 0.35T and 3T machines (refer to Section II-A) and utilized Advanced Neuroimaging Tools (ANTs) [17] to pairwise register them. This work assumes that the high-frequency information obtained in high-field MR images can be directly predicted from the low-field MR images. Consequently, a low-field 0.35T image can be reconstructed to a 3T-like image by learning the mapping correction between 0.35T and 3T MR images.
In addition, the learning-based SR networks can extract rich frequency information in the channels and spatial regions. To extract abundant features from input LR images efficiently and motivated by recent advances [18], we utilize a dense attention block (DAB) comprising variants of parallel placed densely and hybrid attention blocks. The dense structure assists in the deeper network's gradients backpropagation, while the attention blocks fully utilize the channel and spatial information. Hence we propose a novel hybrid attention residual network, entitled HybridAttentionResNet (HARN), to generate 3T-like MR images by incorporating the mapping relationship of 0.35T and 3T MR images.
The major contributions of this work are: • Scanning 70 volunteers using 0.35T and 3T machines to collect a particular dataset (Dataset I) for learning the real-world association between LR and HR imagery. • Introducing a new feature extraction module, the dense attention block (DAB), based on dense connections with an attention mechanism that focuses more on the channel and spatial information.   Dataset II) to validate its robustness and accuracy. This paper is presented as follows. In Section II, we introduce the data source and propose our 3T-like images reconstruction network. Designed ablation experiments and visual results are given in Sections III. Section IV and V provides the discussion and conclusion of the paper.

A. DATA PREPARATION
For this work, 70 participants were enlisted, equally divided by gender, and were scanned by 0.35T and 3T MRI scanning systems. Permission was obtained from the Institution Review Board and all subjects provided written informed consent before the scans. A total of 2100 axial slices/images were acquired from the 0.35T scanner (CLIMBER035 designed by Anhui Fuqing Medical Technology Co., Ltd.) with a 2D T1 scanning sequence SE weighted with parameters: TR=400ms, TE=16ms, FOV=24cm×24cm, and a matrix of 128×128. Regarding the 3T system (GE MEDICAL SYS-TEM -DISCOVER MR750), 13160 images were acquired with sequence 3D T1-BRAVO adopting the following parameters: TR=8.2ms, TE=3.2ms, TI=1.0ms, FOV=24cm×24cm, and a matrix of 256×256×188. The 3T scanning system was utilized after 0.35T, with the FOV of the 3T covering that of 0.35T scanning for better alignment. As note above, the scanning parameters of the two MRI systems are deviated, and thus the images of both systems were intrinsically different considering image resolution and contrast. Nevertheless, as this study aims to improve low-field images to be like high-field, image contrast differences were properly handled through our proposed network. Despite the contrast difference, for convenience, the 0.35T images are considered the LR dataset, and the 3T images the HR dataset.
If a patient moves between two subsequent scans, he causes image distortion due to the magnetic field inhomogeneity of each system, and thus we perfectly align the LR and HR dataset by applying a medical image analysis toolkit named Advanced Neuroimaging Tools (ANTs) [17]. The latter toolkit includes the software suite Analysis of Functional Neuro Images (AFNI) [19] to minimize the possible distributions between two sub-datasets in different resolutions. We aligned all 3T images on the 0.35T images, and the choice of the target/reference image was due to the fact that 3T images have higher resolution with smaller slice thickness (1mm < 5mm). After registration, the new HR slices were selected by re-slicing the aligned 3T volume, which was reconstructed from the aligned 3T images that corresponded to the same slice location of each slice in the 0.35T dataset. This ensured that the corresponding slices of both datasets depict the same axial physical slice of the brain and have the same anatomical structures. We marked this unique dataset as Dataset I. Moreover, to evaluate the robustness of the proposed method, we employed the IXI open-source dataset provided by BarinWeb 1 . Specifically, we chose 50 different T1 axial plane images from the 3T IXI dataset as an unseen test dataset and marked them as dataset II. To generate the input LR images, we blurred the original 3T images (HR) 1 https://brainweb.bic.mni.mcgill.ca/brainweb/ using a Gaussian kernel with α=4 and then downsampled them by averaging every four voxels. In this way, the input LR images have half the resolution of the HR images.

B. NETWORK STRUCTURE
This section introduces HARN, with its overview presented in Figure 2. The HARN network comprises feature extraction, outer feature fusion, and up-sampling modules. The critical phases of HARN are as follows: Initially, a shallow convolution layer with a ReLU function extracts the initial features from the input LR images. Then, the feature extraction module recovers the important hierarchical features from the previously constructed feature maps. After that, we simplify the calculations utilizing the outer feature fusion layer (OFFL) scheme that decreases the merged feature maps to a specific size. Finally, the up-sampling module transfers the fused features into the desired 3T MR images.

1) Dense attention blocks
MR images are fundamentally resembling and redundant. To fully exploit the properties of MR images and capture the delicate local texture information on a small receptive field, we focus more on the feature capturing module and propose a novel architecture named dense attention block (DAB), which can be regarded as a delicate feature encoder. The DAB module is depicted in Figure 3, containing various parallel variant dense blocks (VDB), an inner feature fusion layer (IFFL), and a hybrid attention block (HAB).
Variant dense block : We employ convolution layers with variable kernel sizes to capture enhanced multiscale information combined in a dense structure [20] at the same level. As seen in Figure 3 (a), the blocks adopt two distinct kernel sizes and arrange them in various sequences. For instance, in the VDB 1 the kernel sizes of the two convolution layers are 1×1 and 3×3, respectively, arranged alternatively. We employ a small kernel size (1×1 and 3×3), as such sizes require fewer parameters and use less RAM, speeding up processing.
Each VDB has four layers, each of which implements a composite operation function F l , where l is the layer index. As a consequence, in the p th path number of VDBs, the l th layer receives all the previous layers' feature maps , with the l th layer's output being: .., f p (l−1) ] represents the concatenated feature maps, and σ = max(x, 0) refers to the ReLU activation function. Equation (1) indicates that a particular f p l depends on the kernel size of each layer and can extract feature maps of various sizes. The VDB's hyperparameter growth rate (G) refers to the channels of each layer's output feature maps. Thus, the output channels are the input channels plus four times G, where four indicates the convolution layers numbers inside VDB.
Inner feature fusion : We utilize the inner feature fusion layer to concatenate the features and reduce their dimension, VOLUME 4, 2016 preventing excessive model parameter increase as the VDB's path number (P) increases. The output can be defined as: .., f p l ] refers to the concatenation of p VDB outputs, F i−1 is the DAB input, and conv 1×1 indicates the convolution operation with a 1×1 kernel size.
Hybrid Attention Block : After the IFFL, the scale of the F i−1 features grows massively and includes much redundant information. Simultaneously, as demonstrated in [16], both channels and spatial areas restore the MRI features during the SR task. Based on these two considerations, we introduce an attention mechanism [18] to augment the network's representation capacity. As illustrated in Figure 3 (b), the Hybrid Attention Block (HAB) comprises two components: spatial attention (SA) and channel attention (CA). Due to the HAB's unique mechanics, the network can be more attentive to informative spatial regions and meaningful cross-channel information. Finally, the HAB's features are multiplied by the input feature maps for adaptive feature refinement.
According to Zeiler et al. [21], each channel in a feature map can act as a feature detector. Thus, CA extracts the global feature information and generates channel weights utilizing inter-channel interaction features. Therefore, we utilize a global max pooling and a global average pooling operation in parallel to capture the global spatial details, generating two different channel information descriptors AvgP ool(F ) and M axP ool(F ). The global average pooling function can be expressed as: where F c (i, j) is the value associated with the position (i,j) in the C th channel feature map. F c ∈ 1×1×c refers to the channel statistic generated by shrinking the input feature map F into spatial dimensions W×H. Moreover, the max-pooling operation is determined as: After the pooling procedure, we employ a multi-layer perceptron (MLP) to reduce the parameter overhead and set the reduction ratio to 0.5. Then, we use element-wise summing to merge the pooled feature vectors and apply them to a sigmoid gating mechanism. As illustrated in Figure 4 (a), F i−1 indicates the feature maps of size W×H×C, with the channel attention output F CA computed as: The MLP comprises two convolution layers with a ReLU activation function in between, aiming to reduce the network's parameters. Finally, the final output F i−1 is obtained by pulsing the spatial input F i−1 with feature attention weights F CA .
We supplement CA by utilizing the SA module. SA restores more position-specific information through generating spatial weights by exploiting the feature's inter-spatial relationship, enabling HARN to focus on critical but often neglected spatial areas. The entire procedure, presented in Figure 4 (b), is as follows. Initially, we apply a global average-pooling operation AvgP ool(F ) and a global maxpooling operation MaxP ool(F ) along the channel axis, which effectively emphasize information regions by reducing the channel's dimension [22]. Following the average pooling, the input feature map F i−1 can be regarded as an efficient feature descriptor. The two feature maps are concatenated, and then are sent to a convolution layer to create a spatial attention map, encoding the regions' weights that are emphasized and suppressed. Mathematically, the complete process is as follows: is the concatenation operation involving the pooling feature maps.
2) Outer feature fusion Section II.B.1 indicates that DAB can have various additional features assisting HR reconstruction. Indeed, we properly align DAB utilizing various parameter setups to exploit fully the hierarchical features it provides. However, the gradient vanishes as the network depth increases, and the loss becomes non-convergent. To solve this matter, we apply a fusion layer to merge all previous transformation feature maps: where F 1 , F 2 , ..., F i−1 , F i represent outputs of different DAB, i denotes the series number of DAB, and FOF F L is the output of the outer feature fusion layer, which is then fed to the next up-sampling stage.

C. LOSS FUNCTION
Several SR techniques utilize a mean square error (MSE)based loss function to reduce the difference between the input and the reconstructed images. Nonetheless, decreasing MSE typically reduces the reconstructed images' perceptual quality due to over-smoothing. To overcome this problem, we utilize a hybrid loss function comprising an image-domain MSE loss at the pixel level and a VGG loss at the perceptual level: where H and W define the image's height and width, and I SR and I HR denote the 3T-like MRI generated by the model and the 3T images acquired from the 3T scanner. Inspired by the content loss [23], we import the VGG loss from the ReLU activation layers of the pre-trained 19-layered VGG network.
Here, S V GG indicates the supplied feature map from the VGG19 network. As a result, the total loss function is represented by: where α is a constant coefficient balancing the two losses, heuristically set to 1e-1.

D. EVALUATION METRICS
We evaluate the image quality utilizing objective and subjective metrics. Considering the objective metrics, we deploy the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) [24]: When an image is normalized using the linear normalization approach, for PSNR, we set L = 1. In Equation (12), C 1 and C 2 are the stability constants, µ x and µ y are the average values of x and y, µ 2 x is the variance of x, σ xy is the covariance of x and y, and σ 2 x is variance of x.  [25]. A cross-validation strategy divides our dataset into training, validation, and test set with a ratio of 7:2:1. Moreover, we employ the Adam optimizer [26] with β 1 = 0.9, β 2 = 0.999, and a learning rate of 1e-4. The proposed model is trained using Pytorch 1.9.0 on an NVIDIA RTX 3090 GPU with 24 GB RAM.

B. SUBJECTIVE MEAN OPINION SCORE(SMOS) TESTING
We conducted a subjective mean opinion score (SMOS) test to quantify the reconstruction ability of various approaches. Specifically, we selected ten versions of each image from the test dataset: input LR image, Bicubic, SRCNN [9], FSRCNN [10], VDSR [11], EDSR [12], SRDenseNet [13], HybridNet [14], ours, and the ground truth HR image. Two radiologists with 5 and 7 years of experience, blinded to the acquisition details, were assigned to score for each image from 1 to 5(a higher score indicates better performance). Thus, each rater scored 1350 images (10 versions of 135 images) presented randomly.
In this testing experiment, we discovered that SMOS has a high degree of dependability because there is no significant discrepancy between the ratings of the identical images. At the start of testing, we collected 20 pairs of different LR and HR images (score 5) for doctors to calibrate the rating criteria. We added the HR and LR images into the test set twice to confirm the raters' reliability. Interestingly, the two doctors' ratings for the same image category showed high similarity. Table 2 and Figure 5 describe the experimental results of the SMOS test.

C. ABLATION STUDY
We conduct the following extensive ablation experiments of PSNR and SSIM to explore the best parameter values for HARN's various components.

1) Study of G, P, and I
Since growth rate (G) is a hyper-parameter of the dense connections, we performed several ablation studies to explore its influence on HARN's performance. As visualized in Figure 6 (a) and (d), the PSNR increases first and decreases as quantity increases. Thus, we choose 16 as the final growth rate to balance the computation complexity and network performance.
To demonstrate the multipath effect structure in VDB, we perform several contrast experiments, with Figure 6 (b) displaying the HARN's various training convergences. Limited by the sample size, the PSNR reduces as the path numbers increase. Moreover, Figure 6 (e) shows the detailed convergence changing in the last 15 epochs, illustrating that the increasing path number may not increase PSNR. Finally, after balancing complexity and reconstruction capabilities, we set the path number of VDB to four.
The numbers of HAB affect the entire network depth and complexity. To investigate the effects of HAB's number on the performance and computational cost, we study parameter I under different HAB numbers. Figure 6 (c) and (f) display the results of HARN's five training convergences. As the VOLUME 4, 2016 HAB numbers increase, the faster HARN converges, but PSNR becomes lower. To preserve a better balance between computational efficiency and performance, we set the number of HAB to four. Figure 7 indicates the ablation experiments of G, P and I based on SSIM. Notably, the training convergences of SSIM and PSNR are highly similar. Thus, we set the same parameters as those analyzed based on PSNR.

2) Study of attention mechanism and learning parameters α
To further validate HAB's effectiveness, we consider a network without HAB as the baseline and investigate the impact of SA and CA at a reduction ratio equal to two. Figure 8 (a) illustrates the convergence curves of several networks, but Figure 8 (c) reveals that the network with CA or SA presents an improved PSNR compared to the baseline. Notably, the cascaded CA and SA network outperform the network solely using CA or SA. Given that CA and SA can generate the weight of each feature map in channel and space, cascading the CA and SA mechanisms combines the channel and spatial information to enhance further the high-frequency features. Furthermore, in this trial, we also verify the effect of the order of CA and SA in the HAB. Figure 9 (a) and (c) show the training convergence changes of SSIM, and the convergences of PSNR and SSIM are almost identical.
The model with L M SE focuses on the loss of each pixel, potentially over-smoothing the image, whereas the model with L V GG produces distorted details. To balance the hybrid loss, we test several values for the balancing factor α, with the corresponding results illustrated in Figure 8 (b) and Figure 9 (b), which are the different ablation experiments based on PSNR and SSIM, respectively. From the two figures, the gap between two losses becomes wider when α decreases. Therefore, the reconstruction performance degrades. According to the results depicted in Figure 8 (d) and Figure 9 (d), we set α=1e-1 finally.

D. COMPARISONS AGAINST STATE-OF-THE-ART METHODS
To further evaluate the proposed network's performance, we challenge HARN against bicubic interpolation and six learning-based methods [9]- [14]. Moreover, to analyze the results more precisely, we calculate the mean and variance of PSNR and SSIM.  Figure 10 depicts a qualitative comparison of the evaluated methods, including two close-up views of selected regions below every reconstruction image: the left image shows the zoomed image of the chosen gyrus region, and the right, the edge information of the left gyrus region. Figure 10 reveals that the competitor algorithms tend to reconstruct fuzzy and over-smoothed details, affecting identifying the depicted de-tails. By comparison, the proposed HARN effectively recovers more contours and minor textures. The zoomed grayscale images show that our algorithm has lower noise and more precise edge information. The HARN's ratings are presented in Table 2, highlighting that the SMOS ratings are closer to the original scores than the competitor methods. Figure 5 shows the distribution of all SMOS ratings.
Furthermore, we employ additional open-source datasets (IXI dataset) to incorporate our experiments. The aim is to verify whether the algorithm can produce more realistic images with good generalization ability on other datasets. As mentioned above, we selected 50 axial images as a new test dataset and marked them as Dataset II. During testing, we exploit the model trained on Dataset I. The right side of Table 2 shows that the HARN does not achieve the best PSNR/SSIM caused by the loss function difference. However, our contrast images are more photo-realistic than the competitor ones. The actual comparison is performed on the chosen 3T axial plane (Figure 11), highlighting that HARN's reconstructed image has more precise details than input LR images and is more comparable to HR images than the competitor algorithms' reconstruction outputs. Consequently, the proposed HARN network achieves a good generalization ability and can be applied to other datasets.

IV. DISCUSSION
This work demonstrates through SMOS testing that learningbased methods achieve superior clinical performance in generating 3T-like MR images from low-field 0.35T images. Furthermore, we demonstrate that high-frequency information can be predicted from LR images. Thus, we generate reliable SR images by proposing a CNN-based algorithm named HybirdAttentionResNet (HARN), which incorporates dense blocks and attention mechanisms for better feature extraction. We collected a unique dataset by scanning 70 subjects from both 0.35T and 3T MRI systems and aligning the paired images before training to explore the mapping correlation between the LR and HR images. Additionally, we conduct several ablation experiments to determine the best parameters of HARN and employ two datasets for evaluation. The experimental results suggest that HARN performs better than current state-of-the-art SR algorithms and has an appealing generalization ability and accuracy.
In contrast to SRDenseNet [13], the dense blocks exhibit sufficient sensitivity for SR tasks. We speculate that our dense attention block combines the multipath structure of the convolution layers to extract more diverse features for reconstruction. In contrast to Zheng et al. [14], our model is optimized for attention mechanisms and content loss, with the proposed attention mechanism having a substantial impact on the network's performance. Specifically, the CA module generates global features, but the SA module assists the network in focusing more on the local regions.
We only scanned the axial brain slices of 70 healthy volunteers in this work. However, the learning-based methods in SR usually require massive and diverse data for training to afford enhanced robustness. However, the relatively smallsized datasets employed in this work are speculated to be responsible for the low PSNR and SSIM values. Future works could involve a GAN- [27] or Transformer-based [28] method, or a more extensive database, which will be used to solve this problem further. Limited by hardware, the input LR images have some noise and artifacts that are difficult to eradicate. The content loss function is an effective way to characterize spatial contents. Maybe emphasizing the content loss on minimizing rice noise could further enhance the clinical SR findings. Reconstruction with less noise is challenging and is part of feature work. Finally, although we evaluated HARN on two brain datasets, applying the same method to other organs is still an open question that will be examined in future works.

V. CONCLUSION
In this study, we collected a unique dataset by scanning 70 subjects with both 0.35T and 3T MR systems to produce LR and HR images. Instead of utilizing the predetermined known degradations, we use real paired training data to learn the mapping relationship between high field and low field images. Moreover, we proposed a residual network (HARN) with a hybrid attention mechanism based on the convolution neural network. After extensive ablation experiments, we set the best parameters for HARN. The experimental results demonstrate that HARN achieves good performance on the PSNR and SSIM metrics with more photo-realistic results. And via the extensive SMOS testing, HARN is proven to be more reliable in reconstructing HR images over scale ×2 than current state-of-the-art reconstructions methods. We also evaluate HARN on an open-source dataset (IXI dataset), with the experimental results revealing that our network achieves superior performance in robustness and accuracy. Overall, HARN is proved to be an effective approach to improve the image quality of 0.35T MR images. In the future, HARN could be used to apply in clinical applications and other image processing tasks, such as image-guided experiments and lesion segmentation, as it can reconstruct high-resolution images with decent quality and accuracy.