A Spatiotemporal Agent for Robust Multimodal Registration

Multimodal image registration is a crucial step for a variety of medical applications to provide complementary information from the combination of various data sources. Conventional image registration methods aim at finding a suited similarity metric as well as a descriptive image feature, which is quite challenging due to the high diversity of tissue appearance across modalities. In this paper, we present a novel approach to register images via an asynchronously trained reinforcement learning agent automatically. Within this approach, convolutional gated recurrent units (ConvGRU) is incorporated after stacked convolutional layers to extract both spatial and temporal features of the neighboring frames and implicitly learn the similarity metric. Moreover, we propose a customized reward function driven by fixed points error (FPE) to guide the agent to the correct registration direction. A Monte Carlo rollout strategy is also leveraged to perform a look-ahead inference to the elimination of jitter in the test stage. Evaluation is performed on paired CT and MR images from patients diagnosed as nasopharyngeal carcinoma. The results demonstrate that our method achieves state-of-the-art performance in medical image registration.


I. INTRODUCTION
Multimodal registration is concerned with involving two or more images from different modalities into spatial alignment. Usually, a fixed image remains stationary, while the moving image is transformed to match the fixed image. Fig. 1 shows some examples of image pairs to be registered. Image registration is a vital component in current clinical applications, such as treatment planning, and minimally invasive surgery [23]. For example, the combination of Computed Tomography(CT) and Magnetic Resonance(MR) would be beneficial for radiotherapy treatment planning, and the registration on the brain could be helpful for monitoring the development of brain cancer.
Traditional image registration can be categorized into two classes: intensity-based [2] or feature-based [3]. Intensity-based methods operate directly on image intensities The associate editor coordinating the review of this manuscript and approving it for publication was Wei Wei . via some similarity metrics such as sum-of-squareddifferences, mutual information, and correlation coefficient [24]. Feature-based methods register images by using the correspondence between image features, including gradient, geometric shape, contour, landmarks, and the response of Gabor filter [23]. However, both of these two kinds of image registration methods are designed to extract handcrafted image features with poor adaptability to multimodal image registration. Deep learning (DL) methods, by contrast, are capable of automatically learning features from raw data, and they have changed the landscape of image registration research.
In deep-learning-based methods, image registration is regarded as a data-driven optimizing problem. Registration is achieved by maximizing a similarity metric between a moving image and a fixed image, and the corresponding transformation parameters are directly learned. Convolutional Neural Networks (CNNs) are the most commonly used techniques in recent computer vision tasks such as image VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ classification, segmentation, and object detection [34]. In registration, CNNs are employed to extract hierarchical features embedded in the raw visual input and learn the high-level relationship between images [6]. Deep learning techniques used in registration is capable of avoiding the common pitfall of intractable non-convex optimization in traditional registration methods, which always leads to local optima, i.e., the null gradient at saddle points [17]. Moreover, the highly parallel property of CNN makes the registration progress fast and computationally efficient. Work in [18] was the first approach to use CNN to estimate the transformation parameters of the rigid registration directly. A hierarchical CNN regression model was proposed in this work, in which six transformation parameters were divided into three groups to reduce the complexity of regression. The experiments demonstrated that this approach outperformed most traditional intensity-based approaches.
Meanwhile, as a breakthrough in combining image registration with deep learning techniques, deep reinforcement learning (DRL) demonstrated its potential about mastering complicated control policies [4], [5]. The goal of RL methods is to learn a policy by trial-and-error in interaction with the environment. At the same time, deep neural networks provide powerful function approximation and representation learning capabilities. Under this framework, the registration problem could then be reformulated as a sequential decision-making problem mimicking an expert performs image registration via selecting actions continually, as shown in Fig 2. For example, Liao et al. [1] trained the agent in the RL framework to predict the optimal action that best enhances the alignment at each time step in a greedy supervised fashion. Ma et al. [6] adopted a dueling deep Q-network (DDQN), to train the artificial agent to register multimodal medical images with contextual information. However, the greedy strategy in [1] may not be globally optimal, and the DDQN architecture in [6] needs a huge space to store the replay memory. Worse still, both methods ignored the temporal dependencies that may accelerate training. In light of these limitations, our previous work [16] presented an end-to-end RL model with long-term recurrent network architecture (LSTM) [13] and a reward function driven by landmark error (LME), which needed no extra storage and explored transformation parameter spaces freely. However, it still suffers from certain limitations. First, although spatial features are very crucial for matching in the registration task, the fully-connected (FC) formulations in conventional LSTM during both input-to-state and stateto-state transitions do not have the flexibility in capturing spatial contextual pixel connections between states. Second, LME reward computation is rather complicated and hence aggravates the burden of RL training.
In this paper, we mainly focus on the feasibility of using DRL in 2D rigid registration, and propose an asynchronous agent-based multimodal registration method that handles the aforementioned challenges. The main contributions of our work can be summarized as follows: -Different from the traditional registration algorithms, we reformulate 2D registration as a 3D sequence learning task. And we introduce a DRL framework derived from asynchronous advantage actor-critic(A3C) [9] that executes multiple agents in parallel to accelerate convergence. Moreover, using different agents to learn a global policy on multiple instances of the environment simultaneously can decorrelated training into a more stable and effective process, since at any given time, the parallel agents were experiencing a variety of different states. -We propose a spatiotemporal agent to capture both spatial and temporal information for an image sequence by applying a convolutional gated recurrent unit (Con-vGRU). Context information plays an important role in selecting actions in reinforcement learning as well as in RL-based registration. The ConvGRU is used for obtaining the historical characteristics, so as to preserve spatial connectivities in the frames. -We propose a novel reward function driven by fixed-points error (FPE), which is similar to the landmark error proposed in [16] but more computationally efficient. Using the FPE-based reward function facilitates the extension of 3D image registration. -A Monte Carlo rollout (MC rollout) method is adopted in the inference to improve the registration capability. So far, our network has outperformed several state-of-the-art image registration methods on our dataset.
75348 VOLUME 8, 2020 The remainder of this paper is organized as follows. Section II discusses related work on image registration and reinforcement learning. Section III describes the details of the proposed method. Experiments and discusses are respectively presented in Section IV and Section V. The conclusion is given in Section VI.

II. RELATED WORK A. IMAGE REGISTRATION
Much work has been done in medical image registration. Traditionally, the intensity-based methods directly operate on the image intensity values to search one transformation that minimizes (or maximizes) a criterion (such as mutual information and correlation coefficient) measuring the intensity resemblance of corresponding pixels in two images [2]. However, this type of method usually needs a large number of iterations to evaluate the similarity measure, leading to a high computational cost. The feature-based methods select and match suitable features (such as geometric shape, contour, and landmarks) between images to be registered to estimate the ground truth transformation model [25]. These traditional methods are simple to implement but heavily rely on handcrafted features, which has poor generalization ability to multimodal image registration.
Since recently, deep learning methods have attracted researchers' attention and been widely used in medical image registration. Wu et al. [26] applied a stacked auto-encoders (CAE) to extract high-level image features for accurate image registration. However, this method is not end-to-end, and the extracted features still rely on other conventional feature-based registration methods. Learning similarity metrics between multimodal images through CNNs was proposed by Simonovsky et al. [27]. This learned metric outperforms mutual information by a large margin in the task of discriminating image patches from different modalities as aligned or misaligned. Miao et al. [18] formulated rigid registration as a regression task and used CNN to predict the transformation matrix. Reference [19] proposed a self-supervised affine image registration network (AIRNet) that directly predicts the affine transformation parameters. AIRNet uses a densely connected CNN structure, which is composed of two pathways that share the same parameters to capture discriminative features between different images.
There is another line of approaches to handle registration tasks, which is inspired by deep reinforcement learning. For instance, Liao et al. [1] reformulated the rigid registration as a greedy decision-making problem, learning to predict optimal actions in a supervised manner. However, the greedy strategy may be limited to local optima. Ma et al. [6] used a variant of the Q-learning algorithm, dueling deep Q-network (DDQN) [32], to train the artificial agent to register multimodal medical images like human experts. This approach is free to search the parameter space during training but suffers timeconsuming. Our previous work [16] introduced a parallel CNN+LSTM architecture and a landmark error (LME) based reward function, which overcame the problem of Ma's and improved the robustness of multimodal image registration. However, the conventional LSTM does not have the flexibility in capturing spatial contextual information between the consecutive frames, and the computation of LME is rather complicated and hence aggravates the burden of RL training.

B. REINFORCEMENT LEARNING
Reinforcement learning is about an artificial agent interacting with environments, trying to learn a strategy by trial-and-error over a sequence of discrete time steps. At each time step t, with a received state s t from environment E, the agent takes an action a t ∈ A following a policy π , receives an immediate numerical reward r t ∈ R returned by the environment and steps to the next state s t+1 with respect to the state transition probabilities P(s t+1 |s t , a t ). The policy π is often defined as a mapping function from current state to the probabilities of action selection. The return R t = ∞ k=0 γ k r t+k is a sum of discounted future rewards from state s t with a discounting factor: γ ∈ (0, 1]. The goal of the agent in RL is to seek an optimal policy π * to maximize R t . Typically, RL methods can be divided into two types: value-based and policy-based. The action value Q π (s, a) represents the expected return for selecting action a in state s following policy π. Value-based methods iteratively update the Q value to predict the expected cumulative reward following the Bellman equation: Q * (s, a) = E s∼ε [r + γ max Q * (s , a )|s, a], such as Q-learning and Sarsa [8]. And as a breakthrough in combining with deep learning techniques, DQN demonstrated its potential about mastering difficult control policies for Atari games [7]. Thereafter, variety extensions of value-based algorithms has been proposed to improve the stability and efficiency, such as Double DQN [30], Prioritized experience replay [31] and Dueling DQN [32].
In contrast to value-based methods, policy-based methods directly learn a parameterized policy π θ by performing gradient ascent on a scalar performance measure J (θ ). An example of this kind of method is the REINFORCE algorithm [8] which updates the parameters θ in the direction denotes a baseline used to reduce the variance. One common used baseline is the learned value function of the state: V (s). Notice that the action value Q(s, a) is an estimate of R t , then the approximate advantage of the action a in state s could defined as A(s, a) = Q(s, a) − V (s). Then this approach is actually an actor-critic architecture where the policy π is called the actor and the baseline b t is called the critic [8].

III. METHOD
The goal of our method is to automatically align medical images across modalities. The fixed image and the moving image are represented by I f and I m respectively. The mapping from I f to I m is a rigid spatial transformation T parameterized by [t x , t y , s, θ]. T maps a point (x 1 , y 1 ) of the moving image to a point (x 2 , y 2 ) of the fixed image as follows: where [t x , t y ] denotes translation, s denotes scaling, and θ denotes rotation. We now reformulate image registration as a decision task of taking a series of sequential actions to align the images, and in this way, the optimal transformation T * is found. The registration environment is described in Section III-A. We train the spatiotemporal agent (see Section III-B for the agent architectures) from interactions with the environment to learn a registration policy, which selects the alignment action with the highest probability from a certain state at each time step t using MC rollout (see Section III-C). An overview of the framework is shown in Fig 3. The following sections elaborate on the proposed registration method.

A. ENVIRONMENT
The environment consists of a fixed image, ground truth moving image, moving image, and ground truth transformation matrix. In every episode, the environment randomly picks a pair of the fixed image and the ground truth moving image and then randomly generates a transformation matrix T 0 , which applies to the ground truth moving image to generate a moving image as the current training data. The goal of the agent in this environment is to align the generated moving image to the fixed image sequentially.

1) STATE AND ACTION
The state s t is represented by a stacked tensor of I f and I m t , where I f is the fixed image to be registered over a whole trajectory and I m t is the transformed moving image by T t • I m . When received a state, the agent would choose an action a t from the action space A, which consists of 8 candidate transformations that include translations in both x and y directions with ±1 pixel, rotation with ±1 • and scaling with ±0.05. Fig. 3 illustrated the details of the state and action space. Specifically, after action selection, the transformation matrix T t in time t updates to T t+1 by adjusting the corresponding parameters, and the environment updates to next state w.r.t. the state transition T ((I f , In this paper, we assume that the state is fully-observable, so that the RL framework is further formulated as a standard Markov Decision Process (MDP).

2) REWARD
Reward function R(s t , a t ) reflects the expected return of action a t taken in a certain state s t . It provides an immediate feedback for the RL agent to make decisions. A matrix-based reward function used in work [6] is inversely proportional to the Euclidean distance between T g and T t . However, in this way, the unit discrepancy between scaling and the other two parameters in the transformation matrix might cause erratic training. Although the landmark error (LME) reward function helps convergence [16], searching the key-points using the scale-invariant feature transform (SIFT) [22] is burdensome and make [16] hard to extend to 3D registration. In rigid registration, the transformation T is applied to the entire moving image, including the points in the background. We propose a fixed points reward (FPE) that only considers the coordinates of the points without concerning the corresponding pixel values or whether they are in the background area. Specifically, we alternatively pre-choose a group of fixed points for all training image pairs. These fixed points P G are uniformly selected across the ground truth of moving image, and then they are warped using the perturbation transformation matrix T 0 and transformed back using T t . We define the reward function with fixed points error (FPE) as: where p i and p i are a pair of fixed points from the ground truth and the transformed moving image respectively, #{P G } calculates the number of fixed points. The default number of fixed points is set to 100 (a grid of 10 rows and 10 columns), and the performance of different numbers of fixed points is discussed in section V-C. Especially, the FPE is presented by D, which is the average Euclidean distance between ground truth and transformed fixed points. We assume that the registration terminal is triggered when D is smaller than a threshold.

B. RL FRAMEWORK 1) A3C ALGORITHM
Our method is derived from A3C [9], which is a multi-thread asynchronous variant of a popular RL model actor-critic. Instead of using experience replay as in [6], we asynchronously apply multiple agents with the same network model to multiple instances of the registration environment. Our A3C model can run on a single machine with a multi-core CPU and perform much faster than most of the improved DQN methods [9]. Notably, our algorithm operates in a forward view via multi-step learning. Both the policy and value function are updated after every t max steps or when the agent reached the terminal state. The estimate of the advantage function, A(s t |a t ; θ v ), is given by where k is the learning steps upper-bounded by t max in each thread. The parameters of the policy function and value function are updated by: where η is the learning rate. Note that H (·) is the entropy of policy π , which is used in the objective function to improve the algorithm's robustness and stability, and the hyperparameter β controls the strength of the regularization term. Typically, the policy network and value network are designed to share all layers except the output.

2) NETWORK ARCHITECTURE
The main network architecture of the proposed spatiotemporal agent is shown in Fig. 4. The input of our model is a 2-channel image comprised of a CT and an MR. We construct the network with a CNN part to extract features from the received state and a ConvGRU part to encode both spatial and temporal information from neighboring frames [11].
In this work, we mainly focus on GRU since it has shown superior performance to LSTM in many temporal sequential learning tasks and with a lower memory requirement. However, traditional GRU is insufficient to extract spatial dependence. To this end, by replacing traditional affine operation with convolution operation in both the input-to-state and VOLUME 8, 2020 state-to-state transitions of GRU, as shown in Fig. 6, we leverage the ConvGRU architecture to model image sequence. In our CNN-ConvGRU architecture, the CNN part is composed of 10 convolutional layers, each of which is followed by an exponential linear unit (ELU) activation, as shown in Fig. 5. We perform downsampling by directly using a convolutional layer with a stride of 2. Note that layer normalization is employed after every downsampling. Using layer normalization helps to improve both the training efficiency and the generalization performance for the recurrent part. Following the convolutional layers, the resulting feature maps for each of the recent time step are fed into the convolutional GRU part to extract both temporal information and spatial correspondence. The network ends with a global average pooling layer followed by two branches of FC layers: one represents the policy function π (·|s t ; θ ) with softmax, and the other represents the state-value function V (s t ; θ v ). We update the value network and the policy network simultaneously by combining Eq.3 and Eq.4 into a final loss as follows: where L π is the policy loss, L v is the value loss and c v is a constant coefficient which controls the balance between the two losses above.

C. MONTE CARLO ROLLOUT
After training, the registration agent applies the learned policy to the unknown test dataset. The registration process will be stopped if the predicted state-value v t has achieved the threshold tr, which nears the terminal reward set in the training phase. Nevertheless, in practice, we have observed that the predicted transformation parameters often jitter around specific values, so that the termination is hard to reach. To handle this nonstationarity, we adopt a Monte Carlo method, which is a simple extension of importance sampling, to simulate multiple searching paths so as to improve the accuracy of transformation matrix. Given a starting state s t , whose state value v t nears tr, the agent explores N trajectories within a fixed searching depth D mc simultaneously. More specifically, in each trajectory, the action for each node is selected following the policy π (·|s t ; θ ), and the corresponding state-value is generated by the value network. Let T i , parameterized by [t xi , t yi , s i , θ i ], denotes the final transformation matrix of trajectory i and V i = t+D mc j=t v ij denotes the total value for this trajectory, where v ij is the value of the jth node at trajectory i. The final estimation of transformation parameters are obtained by a weighted averaging: Note that by performing Monte Carlo sampling, the final predicted matrix is an approximation of the ground truth, which avoids the jittering around a particular value.

IV. EXPERIMENTS A. DATASET
We evaluate the effectiveness of our proposed framework on a 2D multimodal registration dataset. This experimental dataset was collected from 99 patients (male/female: 66/33; age range: 21-76 years) diagnosed as nasopharyngeal carcinoma, each with 100 paired 3D MR and CT slices. These patients underwent radiotherapy at West China Hospital, where they received a CT scan on a Siemens SOMATOM Definition AS+ system and an MRI scan on a Philips Achieva 3T scanner. The original resolutions of CT images range from 0.88 × 0.88 × 3.0mm 3 to 0.97 × 0.97 × 3.0mm 3 and the original resolution of MR images is 0.61 × 0.61 × 0.8mm 3 . All patients were fully informed and consented to participate in the study. Our study protocol was fully approved by the radiology department of West China Hospital.
After removing invalid backgrounds, all CT and MR images were resampled to the same resolution of 1 × 1 × 1mm 3 . In our experiment, the MR was treated as the fixed image and the CT was the moving image. Particularly, the pre-registration of CT and MR pairs from the same patient was realized by 3D Elastix [14] with default parameters. In light of the strong correlation between adjacent slices, we selected only six slices out the 100 for each modality to represent considerable variations between images. Since this paper mainly focuses on investigating the feasibility of using DRL for multimodal registration on 2D images, the structural smoothness of slices along the third dimension is not considered at the moment. Finally, our experiment obtained 594 aligned pairs of MR-CT, with 474 image pairs selected from 79 patients as the training data, 120 image pairs selected from the rest 20 patients as the testing data.

B. TRAINING AND TESTING
In each episode at the training procedure, the rigid-body perturbation of the moving image is randomly generated within translation [−25, 25]

FIGURE 7.
Details of the proposed method in testing procedure. ST agent represents the spatiotemporal agent, tr is the threshold and T max is the maximum registration steps. The registration process is stopped When the current step T t > T max . Otherwise, MC rollout is carried out when the state value V > tr . and finally saw 192,000 pairs of training images. We use a discount factor of 0.99, an entropy coefficient of 0.01, and a bootstrap frequency of 32 steps to update the network weights. Our model is trained by Adam optimizer with a learning rate of 0.00001 and adopts MSRA initialization [15] for all weights. The episode is terminated whenever the agent receives a terminal signal (FPE less than 1) or reaches the maximum number of steps (set to 500). We trained our agent for 2.5 days using 8 CPU cores and a Tesla P100 Pascal GPU.
We evaluate the method on two test data sets (E1 and E2), both consisted of 120 image pairs with different moving image perturbation distribution ranges. More specifically, the moving images of ''E1'' dataset are perturbed within the same perturbation scope as the training data, and the moving images of ''E2'' dataset are perturbed with a larger range of translation [−30, 30] for both x and y axes, rotation [−45 • , 45 • ] and scaling [0.75, 1.25]. MC rollout is applied to all testing, where the number of simulating branches N = 20, searching depth D mc = 10, and the stopping threshold tr is 10. The details of the proposed model in testing procedure is shown in Fig. 7.

C. COMPARING WITH OTHER METHODS
We evaluated the efficiency of our proposed model by comparing it with several state-of-the-art methods on dataset E1 and E2, respectively. For most methods, we use the default parameter settings as recommended in their corresponding papers or the readme of some public registration toolkits. The Scale Invariant Feature Transform (SIFT) [33] is one of the most popular techniques for rigid image registration. We set a sift distance ratio of 0.8 to match key-points and use the Random Sampling Consensus (RANSAC) algorithm to improve the results. For Elastix, mattes mutual information with default setting is used. We use Elastix 2D as the registration gold standard since the ground truth moving image was generated by Elastix 3D in the preprocessing stage, as mentioned in Section IV-A. The pure-SL method rephrases image registration as a regression task, inputting MR-CT pairs and outputting transformation parameters: [t xi , t yi , s i , θ i ]. For a fair comparison, the pure-SL model contains the same CNN architecture as in our proposed RL model. Similarly, AIRNet [19] learns to directly predict the affine transformation matrix in a self-supervised manner. It uses a densely connected CNN structure, which is composed of two pathways that share the same parameters to capture discriminative features between different images. Furthermore, we compared the proposed method with our previous work [16] and another RL model based on Deep Recurrent Q-Network (DRQN) [29]. For the DRQN method, since it is designed for the partially observable games, we made some minor changes to their code to fit our registration environment and dataset. And we implement AIRNet and Sun's work by using default settings and optimizations recommended by their authors. Note that SIFT and Elastix run on the CPU, and all DL-based and DRL-based models work on the same GPU as our proposed method.
In all experiments, we employee target registration error (TRE) [20] as the quantitative measure, which is defined as the Euclidean distance of corresponding points between the predicted image and ground truth: where p i denotes the i-th landmark point on the ground truth moving image, p i denotes the corresponding point in the moving image, T denotes the predicted affine transformation matrix, N is the number of points and • represents the align operator. VOLUME 8, 2020

D. EVALUATION RESULTS
All methods were evaluated on the test datasets E1 and E2, and the quantitative accuracy among all methods is summarized in Table 1. It is obvious that all deep learning based methods are far better than SIFT, and all RL based methods achieved higher registration accuracy than pure-SL and AIRNet. We observe that the proposed method surpasses Elastix 2D registration by a large margin and performed a more robust and reliable 2D registration. By using the convolutional recurrent architecture, the proposed model is significantly improved, compared with DRQN and our earlier RL model [16] which only leverage the fully connected recurrent architecture. Moreover, as reflected in Table 1, our proposed method is the only approach that has a TRE value smaller than one on dataset E1 and even E2, the one with a larger range of perturbation. This indicates that all moving images were perfectly aligned to fixed images. Moreover, our model is also the only method whose performance on E2 is slightly better than E1, which implies the proposed model has the best generalization ability among all methods. A qualitative analysis of registration results is visualized in Fig. 8. We can see that the CT images registered by our method are more robust and accurate.

A. EFFECTIVENESS OF SPATIO-TEMPORAL NETWORK
Our previous work in [16] incorporated a fully-connected long short-term memory (FC-LSTM) to extract the correlated temporal features among consecutive frames. The FC-LSTM is a common structure in sequence learning for modeling long-term dependencies. While GRU is a newer generation of RNN which has a similar performance to LSTM but using a simpler and lighter structure. In the context of image registration, spatial information is as important as temporal information and should be taken into account. Traditional fully-connected recurrent architectures were hardly to encode spatial features and powerless to solve spatial dependencies. In regard to this, we introduce the convolutional GRU (ConvGRU) in our model to take spatial information into consideration, which improves the accuracy of the registration results. Although a convolutional operation is used, has a smaller computational cost than most fully-connected recurrent architectures [35]. Table 2 compares the proposed CNN-ConvGRU with CNN-GRU as well as pure CNN architecture. Note that for a fair comparison of these networks, we set the same ten layer CNN structures for all models and all of them are trained with the LME reward in the RL training.

B. MODEL ARCHITECTURE ANALYSIS
A CNN-ConvGRU architecture is proposed in this work, where a deeper CNN structure is used to extract image features that contain high-level contextual information, and the convolutional GRU network is leveraged to extract the hidden information among consecutive frames in our RL algorithm. It has been proven in many experiments that deeper networks lead to better results since a large receptive field and more semantic information are obtained. Experiments of whether a deeper architecture is beneficial to our RL-based image registration are shown in Table 3. It is obvious that the registration   result of a deeper CNN structure (10 layers) is significantly improved compared to the shallow structure (3 layers) in all recurrent models. For a better illustration, we also compared the performance between two models with LSTM structure and GRU structure respectively. The results shown in Table 3 indicate that the model with the GRU structure is slightly better than the model with the LSTM structure in our registration scenario.

C. REWARD ANALYSIS
The reward function is an essential component in RL and it directly affects the choice of actions. As mentioned in Section III-A.2, the commonly used reward functions intuitively rely on the prior knowledge, such as the ground truth transformation matrix. This section discusses how reward functions effect registration training and testing, and compares the proposed fixed-points reward (FPE) with LME reward [16] as well as the matrix-based reward. The comparisons of the learning speed between three kinds of rewards are illustrated in Fig 9. We notice that RL using matrix-based rewards converges slightly slower and is in fact very unstable. The training using LME reward is a little faster than FPE reward. However, FPE reward based training uses fixed coordinate points, eliminating the need to calculate SIFT points for all CT images. FPE reward is more elegant and more accessible to extend to 3D registration considering the computational efficiency. Furthermore, the registration results, as shown in Table 4 indicate that the RL model using FPE reward has the highest registration accuracy. Moreover, we also analyzed the performances of the different number of fixed points in FPE. The experiment shown in Table 5 compares three RL models using FPE reward but with numbers of 10, 100 and 400 fixed points, respectively. The result indicates that the model using 100 fixed points is the best choice for both E1 and E2. The reason may be that 10 fixed points are not sufficient enough to represent the textural information of images to be registered, while 400 fixed points are quite dense for our images, which have a small resolution of 168 × 168mm 2 . Moreover, an RL model using a sparse reward is also compared. More specifically, the sparse reward is set to positive 1 when the moving image is successfully registered (the episode is terminated) and zero otherwise. However, as reflected in Fig. 9, the sparse reward based model fails to converge in our experiment.

D. MONTE CARLO ROLLOUT IN REGISTRATION INFERENCE
After training, the agent applies the learned strategy to register multimodal images. We consider that the registration is completed when the predicted state value v t achieves a threshold tr, which is the same as the terminal reward in training. However, in the testing phase, the transformation parameters may jitter around specific values, and it is hard to reach the termination. To this end, we propose to use the MC Rollout algorithm to handle this problem. In the process of MC, the agent is required to explore N trajectories within a fixed searching depth D mc simultaneously from a certain    state, as elaborated in Section III-C. Note that the MC is only activated when the predicted state value nears tr. The comparison between RL models with and without using MC is shown in Table 6. The performance of the model using MC is significantly improved. It is worth noting that MC sampling is a computational burden but leads to a more robust and accurate registration performance. Interestingly, we found in our experiment that the average registration time for an RL model using MC is shorter than the one without MC. This is because the agent might not be able to access the terminal threshold tr on some testing images, and if no MC is used, the agent has to explore 500 steps for the registration of those images, which slows down the registration process.

E. THE IMPORTANCE OF LAYER NORMALIZATION
A proper normalization in deep learning can be a crucial factor in the effective training of the deep neural networks. Batch Normalization (BN) is adapted in most neural networks in computer vision problems to normalize inputs to accelerate convergence. However, the proposed RL model does not accept mini-batch inputs but has a single tensor composed of a stacked fixed and moving images. In this regard, we use layer normalization (LN) [21] instead to normalize all features in a layer for a single input. This work redefines image registration as a sequence learning task, which is also very suitable for using LN. The comparison of our RL model with or without using LN is shown in Table 7. It is clear that LN plays an important role and provides a significant improvement in registration accuracy.

VI. CONCLUSION
In this paper, 2D registration is redefined as a 3D sequence learning task. We have presented a multimodal registration approach based on a spatiotemporal agent. The agent learns the registration policy using the A3C algorithm with an FPE reward. Both spatial and temporal information is taken into account in our network by leveraging both CNN and ConvGRU layers. This leads to enhanced performance in the registration task. Experimental results demonstrate that our approach outperforms several state-of-the-art methods. In addition, we analyzed the influence of the network structure, reward function, Monte Carlo and layer normalization on registration.
One of our future work is to extend the current framework to 3D rigid registration and the other work is RL-based 2D deformable registration. For 3D rigid registration, the curse of dimensionality may cause over-fitting. To address this problem, we plan to incorporate residual blocks and attention mechanisms into our network to improve the learning process. For deformable registration, a large number of parameters in the deformation field is a big challenge, which makes the optimization process extremely difficult. We would like to use pre-trained networks to assist RL learning and apply dimension reduction to the action space.
YOUBING YIN received the Ph.D. degree from The University of Iowa. He was a Senior Research Scientist of VIDA Diagnostics. He is currently the Vice President of research and development with CuraCloud Corporation. His research interests include the development and commercialization of machine learning and imaging-based solutions in healthcare. He is familiar with strict developing procedures in this highly regulated industry (FDA, CE, and CFDA). He was a Key Developer of an FDA approved quantitative lung imaging software, which received the Best Product Award from the European Respiratory Society.
KUNLIN CAO received the Ph.D. degree in electrical and computer engineering from The University of Iowa, USA. She was a Lead Scientist of the Biomedical Image Analysis Laboratory, GE Global Research Center. She has over ten years of research experience in medical image analysis, and extensively involved in image segmentation, image registration, feature extraction, functional information quantification, motion detection, and object localization/tracking for real-time imageguided applications. She was with CuraCloud Corporation, involved in a variety of research projects and product development in the field of a quantitative assessment of anatomical and functional changes toward precise disease diagnosis/prognosis/treatment planning using artificial intelligence technologies. She published over 60 articles in international journals and conferences in the medical image analysis area, which were cited more than 1000 times. Her current research interests include multidimensional image and signal processing, medical image analysis (CT, ultrasound, MR, and PET), artificial intelligence, and computer vision.