Attention Guided Policy Optimization for 3D Medical Image Registration

Learning-based image registration approaches typically learn to map from input images to a transformation matrix. Regarding the current deep-learning-based image rigid registration approaches learn a transformation matrix in a one-shot way. Our purpose is to present a deep reinforcement learning (DRL) based method for image registration to explicitly model the step-wise nature of the human registration process. We cast an image registration process as a Markov Decision Process (MDP) where actions are defined as global image adjustment operations. Then we train our proxy to learn the optimal action sequences to achieve a good registration. More specifically, we propose a DRL proxy incorporating an attention mechanism to address the challenge of large differences in appearance between images from different modalities. Registration experiments on 3D CT-MR image pairs of patients with nasopharyngeal carcinoma and on publicly available 3D PET-MR image pairs show that our approach significantly outperforms other methods, and achieves state-of-the-art performance in multi-m-modal medical image registration.


I. INTRODUCTION
Image registration is the process of mapping images into the same coordinate system by finding the spatial correspondence between images. It is an essential step in analyzing a pair of images that were acquired from various viewpoints, various periods, or using various sensors/modalities [1]. In practice, image registration has been applied to several applications such as disease monitoring and predicting, computer-assisted surgery, and medical information fusion. For example, in image-guided surgery (IGS), registration of pre-operative images and intra-operative real-time images The associate editor coordinating the review of this manuscript and approving it for publication was Orazio Gambino . reduces the risk of tissue damage and enhances the accuracy and targeting of lesions [2], and in the field of 3D conformal radiation therapy (3DCRT), 3D CT-MR image alignment of the head and neck plays a significant role in the preservation of the optic chiasm [3]. Although the process of image alignment has been investigated for more than thirty years, it is still an active field of research. Image registration problems can be further categorized into three classes depending on the number of spatial dimensions involved, such as 2D/2D image registration, 2D/3D image registration and 3D/3D image registration.
In this paper, we focus on 3D multi-modal medical image registration, where the input images are generated from various modalities (e.g., CT and MRI). This task is quite challenging, as the appearance of body structures from different image modalities can be very different. Most of the existing multi-modal image registration methods are built on the premise that images from different modalities share similar latent physical features. Therefore, discriminant image features and a generic similarity metric are the two main focuses for traditional image registration methods. Typically, the registration process is performed by iteratively updating transformation parameters until the similarity metric is optimized [4], where popular similarity metrics are usually calculated on intensities, including mean-square differences, correlation coefficient, sum-of-squared-differences, and mutual information [5]. Although these similarity metrics are efficient, they are not sufficiently robust for multimodal images. Whereas learning-based methods are capable of automatically capturing image features, they face similar problems: the manually designed anatomical features are often infeasible to represent the tissue or organ appearance for multi-modal medical images.
Motivated by the successful applications of convolutional neural networks (CNNs) in the computer vision field, many advanced works on cross-modal image registration have proposed methods based on deep-learning. These deep-learning based registration methods achieve higher registration success rates than conventional methods [6], [7]. More specifically, this kind of registration method often generates generalized image features or similarity metrics by using a deep neural network and hence abandons human-engineered image features or intensity-based similarity measures. For instance, Wu et al. [8] used a convolutional stacked auto-encoder to learn the highly discriminative features of the image pairs for registration. But this method is not end-toend, and hence, it still relies on other conventional image registration methods to find the transformation matrix between two images. Miao et al. [9] applied CNN regression to the estimation of transformation parameters, but these parameters were trained from six regressors in a hierarchical manner, instead of being estimated simultaneously.
Recently, another type of deep-learning-based image registration method is also emerging, in which registration parameters are predicted directly from neural networks [10]. Many of these approaches are regression-based and generally require multi-layered feed-forward neural networks which take unaligned image pairs as input and generate registration parameters. Several other approaches [11], [12], [13] adopt radically different patterns by considering the registration task as a temporal decision-making issue. These methods explicitly imitate the way that human experts perform image registration via temporally action selection. In this kind of approach, an artificial proxy is designed to estimate the registration patterns by interacting with the environment. The proxy firstly analyzes the underlying structures of images to be registered and decides in which direction should the image move among a set of predefined actions. Yet, there is a significant variance in the appearance of cross-modal images, which makes the process of extracting underlying features non-trivial. Therefore, these decision-making-based image registration methods are mainly focused on 2D image registration tasks.
In this paper, we propose a novel approach for end-to-end cross-modal image registration with two distinctive aspects. First, we present a deep reinforcement learning (DRL) framework for cross-modal image registration (see Fig.1), which is trained by asynchronous advantage actor-critic (A3C) [14]. Second, regarding the cross-modal registration task presents a demanding challenge in terms of computational complexity, the proposed method is coupled reinforcement learning with the attention-perception mechanism to probe image areas with more reliable visual clues to guide the registration process to the right direction. Third, we focus on 3D cross-modal rigid image registration. To alleviate the high-dimensional curse of 3D registration, we used compact features extracted from the massive voxels in 3D volumes by 3D convolution combined with an attention mechanism. Furthermore, we proposed a new set of landmarks, made up of random and diagonal points, with the replacement of DoG (Difference of Gaussian) keypoints, to improve training efficiency. We conclude from thorough experiments and detailed analysis that our approach significantly outperforms the baseline and achieves state-of-the-art performance on 3D cross-modal rigid image registration tasks.
Our major contributions are: • We propose to use contextual information for the MR-CT registration. Compared to conventional methods that compute surface similarities, our algorithm learns to exploit the relevant contextual information for optimal registration.
• To extract image features accurately and quickly from the complex cross-modal information, the network architecture incorporating an attention mechanism is designed.
• We obtain a robust reward in 3D volumes using new landmarks, utilizing 1000 random voxel points including diagonals, enabling the model to handle a wider range of perturbed and absence of images.
The remainder of this paper is organized as follows. Section II discusses related work, focusing on reinforcement learning and its application to cross-modal image registration. Section III details our proposed method and the training procedure. Section IV verifies the performance of the proposed method on the real MR-CT and PET-MR datasets. Section V provides a thorough discussion of the proposed method. We draw a conclusion and future work in Section VI.

II. RELATED WORK A. REINFORCEMENT LEARNING
Decision-making strategy is an efficient model for several problems, including computer gaming [15], image processing [16], robotic control [17], path planning [18] and medical diagnosis [19]. Reinforcement learning (RL) is well suited for decision-making and it has made tremendous VOLUME 11, 2023 FIGURE 1. Workflow of the proposed method. The intelligent proxy maps the input state to a state value and a specific action, which can be executed to transition the state to the next time step. Subsequently, the environment gives the corresponding reward.
progress since the seminal work of Mnih et al. [20] on Deep Q-Networks. Many RL methods have achieved human-like levels of performance in a variety of fields. A typical RL model includes a proxy and an environment, and it is defined as a computational approach to learn an optimum policy from proxy-environment interaction. The policy function π is used to guide the proxy to select a specific action on the basis of the current state. According to the Markov Decision Process(MDP) [21], we use S to denote the state set of the proxy, the actions set is represented by A, r is the reward when the proxy makes a certain action in a particular state, and the discount factor γ is used to control the weight of future rewards (usually 0.9 in relevant experiments). Specifically, we use a multilayer network to approximate the policy π. The proxy receives a predefined reward for evaluating the merit of the action a t at time t by interacting with the environment s t at that moment.
RL algorithms can be categorized into two classes: valuebased [20] and policy-based [22]. The value-based method aims to maximize a value function, while the policy-based method directly optimizes a policy. In this paper, the framework of actor-critic [14] is employed, which is a special case of the policy-based method by leveraging both value function and policy function. More specifically, in the framework of actor-critic, the actor resembles the policy function that maps the current state to a particular action, while the critic is a value function, which assesses the merit of a specified action by returning the state value for the current state. Asynchronous advantage actor-critic (A3C) [14] and its synchronous variant (A2C) [23] are good examples of such techniques.

B. REINFORCEMENT LEARNING-BASED IMAGE REGISTRATION
Classic image registration requires hand-extracted features to align different images into the same coordinate frame. Conventional image registration processes are performed by iteratively updating transformation parameters until a predefined metric, measuring the similarity of two images (or image features), is optimized. Although these conventional methods have achieved decent performance, handcrafted similarity metrics and image features fail to conclude a general rule for cross-modal image registration. To overcome this issue, many recent works propose to learn discriminative image features or similarity metrics by taking advantage of deep learning. These learned image features and similarity metrics improved the registration performance greatly, but the interpretability of the deep-learning-based methods is always the Achilles' heel.
Due to their multilayer nonlinear structure, deep neural networks are often criticized to be non-transparent and their predictions are not traceable by humans. To overcome it, Liao et al. [24] pioneerly introduced the notion of deep RL to 3D CT volume registration problem using the framework of deep Q-learning. Liao's method is able to visualize the registration process since it chose the registration action at each time-step, mimicking the registration procedure by a human expert. However, to mitigate the high dimensional registration parameter space for 3D registration, Liao's method trained the proxy via a greedy deep supervised learning where the action selected at each time step is the nearest one to the ground truth. Regarding that the greedy searching strategy may fall into local optimal, Ma et al. [25] perform freely search in 2D alignment parameter space by a dueling network. However, Ma's approach requires a large and costly memory space to store state-action pairs during training. A long training time was thus required, even for the GPU, which takes four days to train. Through CNN-LSTM networks trained with a multithreaded actor-critic method (A3C), a work from [12] reduce the training time to 13 hours on CPU.

III. METHODOLOGY
Generally speaking, following the classical reinforcement learning paradigm, the proposed approach possesses three important components of reinforcement learning, namely: state, action and reward. The intelligent proxy samples a suitable action based on the present state, and executing this action gets the corresponding feedback reward and guides the occurrence of the state transformation. Repeating the above process, our approach can be considered as a stepwise approach, which intuitively is closer to the alignments logic of human experts (as shown in Fig.1). Details such as the problem formulation, the definition of states, actions and reward functions, and the RL model incorporating an attention mechanism are provided subsequently.

A. CHALLENGE DEFINITION
Given a cross-modal image pair as input to the model, i.e., a fixed MR image I f and a moving CT image I m , the objective of cross-modal image alignment is to generate an estimate of the optimal spatial transformation T such that the images of different modalities to be significantly aligned in pixelwise spatial. In other words, the transformed moving image T • I m is aligned with I f . For 3D-rigid transformation, T is parameterized by 3 translations [t x , t y , t z ] and 3 rotations [θ x , θ y , θ z ]. For translations, we write as: T 1 donetes the translation matrix, T 2 represent the rotation matrix, and the final transformation matrix is An artificial proxy is defined by formulating the registration problem into finding T t under an RL framework, learning to perform a series of decisions to warp the moving image to the fixed image, where T t represents the transformation at each time point t. In the following section, we would illustrate the main components of the RL that include the state set S, the action set A, and the reward function R. To follow, we describe our deep actor-critic network with an attention mechanism.

B. STATES AND ACTIONS
Our state s t ∈ S at time point t is represented by an image pair consisting of a fixed image and a warped image: s t = (T t • I m , I f ), where • indicates the align operation. Actually, at the initial time step, all translation and rotation parameters in T are zero, which means that the state s t at this time is composed of the fixed and moving image, while in subsequent states, it contains the fixed and warped moving image. Note that to facilitate training and reduce memory requirements, we resize the images to 64 × 64 × 64 in all experiments. The set of actions A is composed of 12 actions, more specifically, a1 and a2 denote translation of +1 pixel and −1 pixel of the vertical axis, a3 and a4 denote translation of +1 pixel and −1 pixel of the sagittal axis, a5 and a6 represent translation of +1 pixel and −1 pixel of the coronal axis, a7 and a8 represent rotation of +1 • and −1 • along the axial plane, a9 and a10 indicate rotation of +1 • and −1 • along the coronal plane, a11 and a12 represent rotation of +1 • and −1 • along the sagittal plane. In other words, assuming that the proxy performs action a1 at s t , the whole warped image would shift one pixel on positive vertical axis.

C. TARGET REGISTRATION ERROR (TRE)-BASED REWARD
Intuitively, the reward is tied with the improvement of the registration. A well-designed RL reward function R is usually intractable, since RL proxies can quite easily overfit a particular reward and thus generate terrible or unexplained consequences [26]. For the registration case, the challenge is to engineer an excellent reward function that incentivizes the artificial proxy to warp the moving image to the fixed image. To solve this, we leverage the Target Registration Error (TRE)-based reward [11] in our RL model, which can measure the displacements between the transformed landmarks from the warped image and the corresponding landmarks from the ground truth.
The landmarks change with specific tasks. For instance, in the 3D registration we leverage the diagonal pixels that lie from (0,0,0), (1,1,1),. . . , and (63,63,63) in the ground truth of the moving image (size of 64 × 64 × 64), and also the 1000 random pixels are randomly selected from the remaining pixels. In this way, we are able to avoid the high space and time complexity of computing 3D DoG features and ease the high dimensional curse of 3D image registration in the RL model.
These points (1000 random voxel pixels and diagonal pixels) formed the landmark reference set p G . Subsequently, they are distorted by the perturbated transformation matrix, and form the warped landmark setp G . After the action section of the proxy, the warped landmark points are transformed by T t+1 . The reward for performing a particular action is calculated from the Euclidean distance D between the landmarks in the current warped image and their ground truth: where p i ∈ p G andp i ∈p G are the landmark points.
|p G | is the cardinality of p G . A terminal reward of 10 will be triggered when D is less than an assumed threshold ε, which in this paper is set to 1.

D. ATTENTION-AWARE DEEP REINFORCEMENT LEARNING MODEL
For an RL-based image registration method, feature extraction has been proved to be an essential component [11]. Some methods [4], [12], [24] proposed to use deeper CNN to extract features since deeper networks can increase the receptive field, and thus more contextual information can be used to infer high-frequency features for registration [11]. Some others [12], [27] found that the cooperation of CNN and recurrent neural network(RNN) has enabled the model to gain the ability to extract spatio-temporal features.
Despite the proven effectiveness of deeper networks, in the 3D scenario, deeper networks pose a more severe challenge for the training of RL. Our motivation for cooperating with the attention mechanism is to form a lightweight attention-aware RL model that overcomes the difficulties posed by cross-modal data from the perspective of automatic feature extraction and mitigates the obstacle caused by high dimensionality to the training of the RL framework.
In recent years, attention mechanisms in computer vision problems are increasingly used [28], [29]. More specifically, according to Woo et al. [30], channel attention concentrates on what is significant for the input feature map and where spatial attention focuses on the portion that is informative. Regarding that using a deeper neural network would inevitably increase the computational complexity, this work proposed to leverage attention mechanisms for allowing the proxy to focus on more critical areas. Inspired by the convolutional block attention module (CBAM) [30], we combine CNN with attention blocks and form a lightweight and easy-to-embed CNN network structure in our approach, so as to reduce the computational overhead and improve the feature-awareness of the model(see Fig.2).In our attention-aware deep RL model, a CNN-RNN with an attention mechanism is used to represent the actor and the critic. As shown in Fig.1, a unified neural network is proposed to calculate the policy function and the value function, that is, the output of the final fully connected layer is divided into two parts, one is the action part, which gives out the probability of selecting a1 to a12, and the other part is the state value of V , which evaluates whether the current action is beneficial for image registration. In our experiments, a shallow network with 6 CNN layers is used, the first layer with 5 × 5 × 5 kernels to increase the receptive field, the remaining layers are 3 × 3 × 3, and pooling layers with a step size of 2 are used in layers 2,4,6 respectively. Thus, the proposed network extracts high-level hierarchical features from the original input which encodes the contextual information. This allows the proposed approach to focus not only on surface features but more on high-level abstract features automatically acquired by the network, and also, it is robust to variations in image appearance and noise [25]. To minimize the impact of CBAM on the training speed, we do not attach CBAM behind each CNN layer, but experimentally split the CBAM into channel attention and spatial attention. More specifically, each CNN layer is followed by a type of attention layer. In particular, the first 2 layers of CNN with channel attention and the next 4 layers of CNN with spatial attention. The attention map obtained from the channel attention module is then sent to the spatial attention module to learn precisely what the area of interest is, which helps to discriminate the subtle texture difference between moving and fixed images. In other words, this allows the proxy to focus more on the subject of the image during the registration process. Note that we use the long short-term memory (LSTM) model as one of the RNN models in experiments since it can discover changes between different states for long-term features. Considering that the RL proxy does not have any prior knowledge about the input images, we perform an all-zero initialization of the hidden state of the LSTM. Overall, an adaptive feature refinement is thus achieved by integrating the spatial attention and the channel attention to a lightweight CNN architecture, and a better alignment accuracy is thus achieved.

E. ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC
In this paper, we utilize asynchronous advantage actor-critic (A3C) [14] to train the policy network and the value network for our image registration framework. As shown in Fig.3, A3C contains a global network and multiple worker proxies. Each worker proxy consists of a value network V and a policy network π and interacts with its own copy of the environment for updating the network parameters. Specifically, at the beginning of each episode, the worker proxy pulls parameters from the global network and subsequently, the proxy interacts with its own copied environment and pushes them to the global network for each updated worker network parameter. Therefore, the updated network parameters are shared globally. According to [14], the loss of value network is defined as: where R t is the discounted sum of rewards up to T time steps with a discount factor γ ∈ (0, 1]: The loss of policy network is defined as: where H (·) is the entropy and β is a regularization factor. π (· |s t ; θ π ) and V (s t ; θ v ) are represented by unified neural networks in our approach with parameter θ π and θ v , respectively. Thus, the final loss of the combined networks is:

Algorithm 1 Asynchronous DRL-Based Image Registration
Input: cross-modal image pairs {I m , I f } Output: I m aligned with I f Initialization: Global episode e ← 0, e max , time step t i ← 0, t max , coefficient c v , global network shared parameters θ π and θ v , thread-specific network shared where c v is a preset constant coefficient, which controls the balance between these two losses.
The complete training procedure for the proposed method is given in Algorithm 1. In practice, the value network and the policy network share the same network structure. We use stochastic gradient descent (SGD) to update the model parameters.

F. MONTE CARLO ROLLOUT IN THE TESTING PHASE
In the training procedure, the terminal state is reached when the Euclidean distance between the registered landmarks and the target landmarks (D in Eq. (3)) is no greater than 1. Determining a certain terminal state during testing is, however, arduous, owing to the lack of true locations of the landmarks. Although a feasible solution is to define a new stop action, which is performed when the termination state is observed to terminate the registration sequence, it would inevitably cause an increase in the size of the action space, meanwhile decreasing the testing time without the terminal action being triggered. Therefore, several RL-based methods use the number of steps in the testing phase as the same in the training phase. However, we have observed that if the proxy has ever learned the optimal policy, the task will be terminated within a few steps, which are far less than the pre-defined maximum steps. Therefore, in this work, we use a Monte Carlo rollout(as can be seen in Eq. (7)) to overcome the testing phase's unpredictable terminal state problem. Specifically, instead of setting a unique termination condition, we implement 20 simulated alignment paths to proceed forward in the VOLUME 11, 2023 search when the observed state values of the s T come to a pre-defined threshold (9 in experiments). Each path treat s T as the initial state, performs 10 actions according to the policy π (· |s T ; θ ) and obtain the associated state values. Thus, the transformation matrix for the end state of each path and the cumulative state value of the corresponding path can be computed by the proxy. The final transformation matrix is obtained as a weighted average of the cumulative state values and transformation matrices of all paths: The transformation matrix [t xk , t yk , t zk , θ xk , θ yk , θ zk ] is obtained along the k-th simulated path, and V k = T +10 l=T v l is the cumulative value of this path, where v l indicates the state value of s l .

A. DATASETS
Our experimental dataset on head registration in patients with nasopharyngeal carcinoma includes both CT and MR modalities, which were acquired from 98 patients at West China Hospital. The high-resolution T1-weighted images were acquired using a 3D MPRAGE sequence (0.61 × 0.61 × 0.8 mm 3 nominal resolution, TR/TE = 3000/2.14 ms, where TR means repetition time and TE means echo time, and flip angle = 8 • ) and the resolution of CT varies from 0.88 × 0.88 × 3.0 mm 3 to 0.97 × 0.97 × 3.0 mm 3 . All patients who participated in the research were informed about the study procedures and agreed to contribute data. The Department of Radiology at West China Hospital approved the research protocol.
We have firstly cropped all raw images of CT and MR by retaining the area from the eyebrows to the chin, and then resampled the cropped images to an isotropic resolution of 1 mm. For a supervised rigid registration, the ground truth is required. Elastix [31] is a commercial off-the-shelf software that has shown notable performance on many datasets [12], [32]. Therefore, we register 3D CT images to 3D MR images using Elastix and carry out manual visual verification on all the pre-trained images to ensure the reliability. Then we use the pre-trained CT images as the ground truth. All images are normalized by the Min-Max scaling method.
We randomly select 80 pairs of images from 98 pairs of images as training data and the remaining are testing pairs. Before the training starts, the ground truth of moving images is randomly perturbed to generate moving images for training. To better evaluate the robustness of the model, we used images with a larger range of perturbations in the test set. That is, for the training phase, the random perturbation range of the rigid transformation is within [±20pixels, ±20pixels, ±20pixels, ±20 • , ±20 • , ±20 • ]. And in the testing phase, we generated two testing datasets: for E1, it has the same range of perturbations as the training dataset, for E2, we used a wider range of perturbations: [±30pixels, ±30pixels, ±30pixels, ±30 • , ±30 • , ±30 • ] for test dataset E2.
Additionally, in order to assess the clinical feasibility of the proposed approach, we requested and granted access to the CERMEP-IDB-MRXFDG database [33], which contains 37 high-quality images of the human brain from different modalities: CT, FDG PET, T1 and FLAIR MRI, in subjects aged 23 to 65 years. Note that all images in the data are aligned to the standard Montreal Neurological Institute (MNI) space and hence provide ready-to-use ground truth. In our experiments, FDG PET as fixed images and T1 MRI as moving images are selected. The T1 MR images are acquired using a 3D MPRAGE sequence(1.2 × 1.2 × 1.2 mm 3 , TR/TE = 2400ms/3.55ms, inversion time = 1000 ms, flip angle = 8 • ), the resolution of PET is 2.04 × 2.24 × 2.03 mm 3 . And 30 of the image pairs are selected for training and 7 for testing. Other pre-processing steps are consistent with the 3D CT-MR dataset above, two test datasets E1-pet and E2-pet are used, corresponding to E1 and E2 above.

B. COMPARISON METHODS AND EVALUATION METRIC
In this paper, we compare our method to several stateof-art algorithms, including Elastix [31], AIRNet [4], Att-Reg [34], Liao's method [24], and CNN-CLSTM [11]. Elastix is a widely used commercial registration software, which accomplishes alignment by minimizing mutual information. Liao's method is the pioneer in applying reinforcement learning to registration, and it achieves high accuracy alignment using a DQN method with a greedy strategy. AIRNet is a self-supervised image registration network. It uses an encoder to enhance the feature extraction capability of the network and directly predicts the transformation matrix parameters of the input image pairs. Att-Reg is a deep learning method designed for cross-modal rigid registration, which incorporates a cross-modal attention mechanism into the CNN layer and achieves very good results. CNN-CLSTM method uses a multi-proxy actor-critic framework and uses the ConvLSTM layer instead of the LSTM layer to accomplish the 2D rigid registration.
Among these five algorithms, AIRNet, Att-Reg, and Liao's method are designed for 3D registration. In addition, for the sake of comparison credibility, we do not use the hierarchical registration strategy in Liao's method. Target registration error (TRE) is treated as a similarity metric for all experiments, which is calculated by taking the root-mean-square error of the distance for all landmark points extracted from the warped image and the corresponding ground truth: Kp GT i and K p i are the i-th landmark from the ground truth and moving image, respectively. T is the transform matrix, calculated from the current state. The total number of landmarks is denoted by N p . In the subsequent experiments, we used the random voxel points (or pixels) including diagonals as landmarks.

C. EXPERIMENTAL SETTINGS
For the training process, we used 8 asynchronous threads, each trained for 45000 episodes. Adopting an adam optimizer and setting the learning rate as 0.0001. We set 0.01 for regularizer β and 0.99 for the reward discount factor γ . The model received a termination reward when the D which is defined in Eq.(3) is no greater than 1. The episode ends when the number of training steps per episode reaches 500 or the termination reward is triggered. The parameters are updated every 30 training steps. To avoid overfitting caused by selecting duplicate images in adjacent rounds, we randomly select a pair of images and generate different transform matrices for different threads of each episode. Specifically, at the beginning of each episode, an MR image and a ground truth CT image are randomly selected, where the MR as the fixed image, and a random translation and rotation are applied to the ground truth CT image to produce a moving image. During the testing phase, we sequentially read the pre-generated fixed-moving image pairs from the test datasets.

D. METHOD EVALUATION
For evaluating the proposed approach, we compared several of the advanced image registration methods with the results presented in Table 1. As shown in Table 1, the proposed method significantly outperforms other methods and achieves state-of-the-art performance. Moreover, it is worth noting that the registered images can be considered to have achieved perfect alignment when the TRE score is close to 1, since we set TRE <=1 as a termination condition during the training process (see Section III-C for details). 3D image registration is considerably challenging, and the proposed method achieves the state-of-the-art performance: 0.85 on E1 and 1.16 on E2. The Elastix is incapable to deconstruct spatial relationships between features, which are magnified in the 3D volumes. The Att-Reg has the fastest alignment speed and has achieved a considerable result on the E1 dataset, but has difficulty with the E2 dataset. The Liao's method is not able to extract spatio-temporal features, and the DQN algorithm tends to fall into local optimums, resulting in performance degradation on the E2 dataset. The CNN-CLSTM method achieves excellent performance on 2D datasets, but fails on 3D images, due to the fact that more redundant information can be found   in 3D images than in 2D, which makes feature extraction arduous. Compared with the above methods, the proposed approach achieves better performance on 3D datasets. This can be attributed to the proposed method perceives features in terms of both spatial and channel aspects. Besides, more cues are captured by the proposed method by removing the redundancy of 3D kernels in a topologically constrained manner, which leads to a greater generalization capability for dealing with more complex image pairs. A visual comparison of the different 3D registration methods on the CT-MR dataset can be seen in Fig. 4 and Fig. 5. Fig. 4 shows the performance of various methods on the E1 dataset. As can be seen in Fig. 5, E2 faces a more severe challenge due to the large range of missing images and wider displacement. Our method can still achieve good performance on this challenging dataset. Additionally, we carried out the 5-fold multivalidation experiment. We divide the dataset into 5 parts, each time we take 4 parts for training and 1 part for testing, and record the average TRE of each test, the result is shown in Table 2 and Table 3. We can see that our model is reliable, the Att-Reg and AIRNet generate more irrational transformations when faced with images that have a larger range of perturbations.
For the PET-MR registration experiments, the obscure borders of the PET images and the correspondence between brain tissue from different modalities make the mapping of coordinates extraordinarily hard. As shown in Table 4, our model still achieves state-of-the-art performance. Noting that the E1-PET represents a test dataset having the same perturbation range as the training images, while E2-PET is a test dataset with a larger range: [±30pixels, ±30pixels, ±30pixels, ±30 • , ±30 • , ±30 • ]. And visual comparison on the E2-PET dataset is shown in Fig. 6. The merit of the proposed model can be attributed to the inclusion of the attention mechanism, where we have traded a small  sacrifice in terms of speed for a significant boost in featureawareness. Overall, the DRL framework-based approach outperforms both traditional image registration methods and deep-learning-based methods, but in terms of speed of execution, the deep learning-based approach has a tremendous advantage. Whereas the deep learning-based approach has poor generalization performance and cannot align images with large displacement ranges.

V. ABLATION STUDY
In the above experiments, it is obvious that incorporating with the attention mechanism into the RL model gives a more effective and robust registration performance compared with other methods. In this section, we will discuss the importance of the attention mechanism, the sequence of two sub-attention modules, as well as the landmark selection.

A. THE ATTENTION MODEL
In contrast to other RL-based image registration methods, an attention mechanism is added into our method. Note that our method is different from the hierarchical registration mechanism used in Liao's method, which was proposed to handle complex anatomical structures. Using the attention mechanism, the proposed method can focus on the information that is beneficial to the registration during the learning process, while being able to suppress distracting information.
To evaluate the importance of attention mechanism in image registration, we compared with two RL-based registration methods on 3D CT-MR images: one uses CNN-LSTM that exploits spatial features only and the other uses CNN-ConvLSTM that is able to exploit spatio-temporal features [11]. To ensure a fair comparison, CNN layers are set the same for all methods. As shown in Table 5 that the proposed method achieved the best TRE score, which indicates that leveraging the attention mechanism helps the proxy to focus on more important image regions. Nevertheless, as can  be observed from Fig.7, the CNN-LSTM method converges slightly faster than the proposed method. We believe that this is due to the additional pooling and convolution operations introduced in each layer of attention blocks, which slows up the training process.
Since there are two sub-modules in our attention mechanism, we also test whether the sequential order of these two sub-modules influences the accuracy and time cost of our method. We divided the ablation studies into two groups: the first group adds the entire CBAM attention block directly after each convolutional layer, and the other group splits the attention block into channel attention and spatial attention model, and adds only one of these two models after each CNN layer. In this way, we obtained six different ways of adding the attention model: ''AllCht'' denotes using channel attention only; ''AllSpt'' denotes using spatial attention only; ''SptFst'' uses the spatial attention module prior to the channel attention module; ''CBAM'' denotes that each CNN layer is followed by the channel and spatial attention module; ''CBAM-rev'' reverses the order of sub-modules in CBAM; the proposed method uses the channel attention module prior to the spatial attention module. Table 6 and Table 7 show the TRE results. On the basis of earlier work, channel attention has made a remarkable contribution to the enhancement of significant features [35], yet when working with complicated 3D volumes, neither spatial attention nor channel attention alone can effectively boost the training of models, as shown in Table 6 where the ''AllCht'' and ''AllSpt'' experiments demonstrate. I believe that this is due to the fact that in applications where all  CNNs use channel attention, the number of channels is squeezed in each layer to extract meaningful channel information, but a shallow network used in our model would lost the compressed channel information. Using spatial attention only is detrimental to the extraction of global information. As can be seen from ''CBAM'' and ''CBAM-rev'' in Tables 6 and 7, taking the two together as a unified mod-ule, inevitably increases the time cost of training the model, even if such a module has been designed elegantly sufficient. Furthermore, similar to Woo et al.'s experiments [30], prioritizing spatial attention over channel attention does not achieve the best performance. Based on the above experimental results and the statement of Woo et al. we throw out the conclusion that applying both channel-level global  attention and spatial-level local attention to the process of aligning 3D volumes using a DRL framework-based approach enhances the alignment accuracy and accelerates the training of the model.

B. THE CHOICE OF LANDMARKS
Landmark error (LME) is an important indicator of registration accuracy, which calculates the error of corresponding landmarks in the warped and fixed images. Regarding the high complexity of an RL model, diagonal and random points were used in our method for 3D CT-MR registration. However, these points cannot indicate the anatomical structures in the image.
Therefore, we carried out an ablation study to examine the importance of feature points in our method. The edge points, calculated by the 3D canny edge detector, were used as a new landmark. Using the edge points allows the alignment to be more focused on the brain contours. As shown in Table 8, using edge points as landmarks helps the RL model achieve better results on the E1 dataset, but gives an inferior performance on E2 dataset. We believe, this phenomenon can be attributed to the fact that a larger perturbation causes images in the E2 dataset to miss some part. The canny detector has difficulty extracting the complete image edges, and obtained the landmark points are incapable of representing the image structure.

VI. CONCLUSION
In this paper, we propose a new learning paradigm for multi-m-modal alignment based on deep reinforcement learning. Unlike other RL-based image alignment methods, our approach extracts features through spatial and channel attention mechanisms, and then uses LSTM networks to exploit spatial-temporal image features to design a deeply unified policy network and value network. The A3C RL model is used for model training. We evaluate our method on 3D CT-MR dataset and 3D PET-MR dataset and find that our method achieves state-of-the-art performance. Our future work will extend to more complex image registration tasks such as deformable registration and unsupervised registration.