Improve the Resolution and Parallel Performance of the Three-Dimensional Refine Algorithm in RELION Using CUDA and MPI

In cryo-electron microscopy, RELION is a powerful tool for high-resolution reconstruction. Due to the complicated imaging procedure and the heterogeneity of particles, some of the selected particle images offer more disturbing information than others. However, in the current RELION, all these particle images are treated equally. In our work, we extend RELION’s model with one scalar parameter to score the contribution of a particle depending on the error between the experimental particle and the corresponding reprojection. This scores down weight potentially poor particles, hence accelerating the convergence. Besides, by now there is no sophisticated memory management system for RELION, fragmentation on GPU will increase with iterations, eventually crashing the program. In our work, we designed the stack-based memory management system to guarantee the stability of RELION and to optimize the memory usage condition. Also, to reduce memory usage, we developed a customized compressed data structure for the memory-demanding weight array. In addition, to speed up the GPU version of RELION, we proposed two highly efficient parallel algorithms for weight calculation algorithm and weight selection algorithm. Experiments show that compared with RELION, the optimized three-dimensional refine algorithm can speed up the converge procedure, the memory system can avoid memory fragmentation, and a better speed-up ratio can be obtained.

However, the imaging procedure in cryoEM is quite complicated, with abundant randomness. In general, the noise, artifacts of particles in the same micrograph may differ greatly. The general quality of even selected particles will differ [12]. Particles with severe artifacts and noise ("bad particles") may negatively influence the reconstruction procedure, a problem that has not received enough attention. In our work, inspired by the training procedure of neural networks, we extended the model parameters of RELION with one scalar parameter and updated it based on the idea of back-propagation. In our method, the scalar of "bad particles" will decrease with iterations, while that of "good particles" will increase. This modification speeds up the converge procedure and boosts the final resolution.
The expectationmaximization (EM) algorithm is an iterative method to find the best hypothesis for the probability distribution of hidden parameters in statistical models. However, to estimate the probability distribution of the orientation parameters for each particle (in the expectation step), RELION needs to compare each particle image with every calculated projection [13], leading to a huge computation load [14] and large memory requirement. Taking the dataset used in the tutorial of RELION 2.0 [15] containing 6,496 b-galactosidase 128*128 particles as an example, the 3D refinement stage requires more than 2 GB memory per process and 40 CPU hours even though this dataset is quite small.
In addition, RELION adopts a "modified version of the expectation-maximization algorithm". In its "Expectation step", RELION performs a two times finer procedure. At the end of the first pass, RELION selects a sub-domain of all orientations for each particle image. The selection rule reserves the orientations with high probability whose sum reaches 99.9 percent of the sum total of the probability value of all orientations. One naive solution for this selection step is to first sort all probability values and then sum them sequentially from largest to smaller until their sum reaches 99.9 percent of the sum total. It is obvious that this method is a sequential algorithm and has high data dependency. The processing times of sorting and summation will increase dramatically as the number of orientations increases. To reduce the time cost of this step, we drew inspiration from the parallel top-k selection and stream compaction algorithms and designed an efficient parallel top selection algorithm.
To reduce the running time, many cryoEM software programs [16], [17], [18], [19], [20] have been parallelized on GPUs. The reviews [21], [22], [23] have introduced the advances of the parallelizaiton softwares. For example, CryoSparc utilizes the GPU card to speed up the reconstruction and 3D classification. And it A GPU parallel version of RELION has also been proposed [24]. Also, the software THUNDER [19] accelerates the image processing and calculation of re-projections with GPUs. However, most parallel software in cryoEM overlooks memory-related operational stability and robustness, which heavily relies on GPU hardware characteristics and the memory accessing pattern. For memory-intensive applications like RELION, memoryrelated code will dramatically influence its stability, robustness and performance. One important aspect is memory fragments. Frequently randomly allocating and de-allocating memory on the GPU may divide the successive memory into several pieces of discontinuous memory. We call those small pieces of memory "fragments". When the program tries to allocate a large piece of memory, even though the total available memory is sufficient, the allocation can still fall and report an "out of memory" error because of the fragments. However, no sophisticated memory management system can prevent the fragmentation problem. In addition, no stable memory management strategy has been introduced to RELION. For example, when processing the dataset EMPIAR-10028 with a particle size of 360*360 on GPU NVI-DIA K20c, an "out of memory" error occurs during later iterations. To improve the stability and robustness of RELION, we developed a stabilizing memory management system, especially for large datasets. In this system, we utilize the characteristics of iterative methods to avoid fragmentation. This memory manage system can also be applied to other iterative methods.
Moreover, even with a fragment-free memory management system, when the total memory consumption of RELION exceeds the GPU memory, an "out of memory" error will still be reported. If we can reduce the memory consumption, this issue would be eased fundamentally. As we mentioned, the weight array is the data structure that demands the most memory. To reduce its memory consumption, we analyzed the data structure of the weight array and determined that we can reduce its memory requirements by carefully redesigning its structure.
In summary, in this work, to speed up the convergence procedure and boost the final resolution, we propose an extended 3D refine algorithm by extending the model parameters with one scalar parameter and updating it based on the idea of back-propagation. In addition, we designed a memory-efficient and memory-stabilizing management system to guarantee the robustness of our program and the efficiency of GPU memory usage. Then, to reduce the memory usage, we developed a novel data structure. Finally, we developed a weight calculation parallel algorithm and a top selection algorithm to speed up the calculation.
Our paper is organized as follows: in Section 2, we introduce related work; in Section 3, we show the extended 3D refine algorithm; in Section 4, we introduce our stabilizing management system and memory-efficient data structure; in Section 5, we present the weight calculation parallel algorithm and the top selection algorithm. The results of the experiments are presented in Section 6. We then conclude the paper.

RELATED WORK
In this section, we introduce the related work from three aspects: statistical methods used in cryoEM, the basic workflow of the refine algorithm of RELION and efforts to accelerate it, and parallel top-k selection algorithms.

Statistical Methods
As mentioned, the structure determination method used in RELION is a Bayesian-based method and is therefore a statistical method. In general, the method of RELION is less sensitive to the starting model and permits recovery of the underlying signal from noisy data compared to traditional cross-correlation-based methods [25], [26], [27]. In general, statistical methods try to find the best hypothesis for the distribution of parameters in a statistical model given the observations. There have been numerous contributions beyond RELION applying this method based on different kinds of image formation models in cryoEM [25], [28], [29].
For example, in a previous work [28], the data model was X i ¼ P if A ki þ sG i (where A ki indicates the underlying image, P if is the transformation with the ith image, s is the standard deviation coefficient, and G i indicates the independent noise). The model aims to find the most likely set of parameters (Q) to construct a model describing structural data through optimization of the following log-likelihood function: The reconstruction algorithm in RELION is also a typical statistical method. It is solved by a modified version of the expectation maximization algorithm based on the work of [13]. It formulates the image formation model in Fourier space as follows: where X i is the projection image obtained by experiments, CTFi is its contrast transfer function for the ith image, and V is the underlying structure in the dataset. The operation P L l¼1 P f jl V l extracts a slice from V , which is equivalent to the real-space projection. In RELION, the parameter set Q is V l ; s; t, where s 2 and t 2 are the variance of the noise and the signal component.
In this image formation model, the orientation parameters for projection images are unknown, and we use G ðnÞ if to indicate the posterior probability of orientation assignment f for the ith image in the nth iteration (i 2 ½0; I). Here, the orientations f 2 ½0; F (F ¼ ðrRot; rTilt; rPsi; tX; tY Þ) comprise 3 rotation (rRot; rTilt; rPsi) and 2 translation (tX; tY ) parameters.

Efforts to Accelerate RELION
GeRelion [30] is a parallel GPU version 1.4. To port the program to the GPU, GeRelion proposes a four-level schedule model. In addition, the weight array is assumed to be sparse and stored in compressed sparse row (CSR) matrix mode: one continuous vector stores the value, and an aux vector records the indexes of the weight array. Because of the usage of the aux vector, the memory consumption of this structure may exceed that of the original structure when the weight array is not sparse. Based on this consideration, we developed a new data structure that can reduce its memory cost.
RELION [24] also proposed a parallel GPU version for its expectation-maximization algorithm. In contrast to its CPU version, the parallel version runs in single precision mode by default to reduce memory consumption. In addition, during the expectation step, accessing the re-projection images consumes a lot of time. RELION stores re-projection images in texture memory to make use of its on-chip cache, which can reduce the memory accessing time. However, this strategy is not always effective. As the size of the re-projection image increases, it may exceed the size of the texture memory cache. In this case, there would be frequent data swapping between the cache and texture memory. Consequently, the performance of the cache will be heavily diminished. Based on this observation, we designed a different data accessing strategy to speed up the memory accessing.
Both GeRelion [30] and RELION [24] provided detailed analyses of the performance of each major step. According to their analyses, the expectation step is the most time-consuming step. The most memory-demanding data structure is the weight array G.

Parallel Top-k Selection Algorithms
In our work, we developed a parallel top selection algorithm for RELION based on the idea of parallel top-k selection algorithm. Here we introduce its current research condition. Top-k problem requests to find the top K elements from an un-ordered list [31]. One general kind of algorithm is selecting by Sorting. The basic idea of these algorithms is to sort the entire list and then query the top K elements. Although the top-k selection problem is a different from the top selection problem in RELION, the basic idea of current implementation of RELION is the same.
Another kind of algorithm is to select based on partitions. Many parallel selection algorithms are based on partitions.
One advantage of such algorithms is that they can avoid the sorting operation. The basic steps of "Bucket Select" are as follows [32]. First, "Bucket Select" determines the size of buckets, assigns each element to a bucket and then identifies the bucket containing the kth order statistic. And it projects the entries in that bucket onto a new set of buckets. After a number of iterations, when the bucket width equals 1, we find the kth element. Fig. 1 shows an example of "Bucket Select". It uses the linear projection function Formula (2.3) to define the boundary of buckets.
"Bucket Select" owns two phases. In Phase I, it reduces workload by creating a new vector containing the candidates. In Phase II, the workload is reduced by redefining the max and min.
"Bucket Select" can avoid the sorting operation. However, first, its efficiency reduces with the bucket width. Then this method can only find the kth element. There is still one step away from getting the top K elements.
So in our method, we first added an small dataset sorting to speedup the boundary converge. Then we implemented one more kernel to get the qualified data out based on the work [33].

EXTEND THE REFINE ALGORITHM TO ESTIMATE THE RELIABILITY OF PARTICLE IMAGE
In general, we use statistical method to find the most likely perameter set Q for the model constructed with the observation data and prior information. In work [28], they proposed to extent the model parameter set Q with a multiplicative and an addictive factor for the signal in each experimental image to normalize the noise. In RELION 2.0, the variance of signal and noise has been involved in its statistical model. In Relion 3.0, per-particle CTF correcition and beam-tilt correction is introduced. These parameters can provide more detailed estimation for each particle so as to improve the reconstruction resolution. However, for each particle, the data can be used is so limited to achieve accurate statistical estimation. Besides, in the beginning iterations, the estimation about many particles' orientaions may be far from accurate. So the corresponding estimation about the signal, noise and CTF parameters are inaccurate. These parameters may be very heplful for high resolution refinement. But, when thay are inaccurate, they will slow down the convergence. Seeing these shortages, we try to estimate each particle in a more general way. Due to the heterogeneity of particle, random noise and some other influence factors, the quality of particle images differs a lot. Some "bad" particles may introduce interference to the reconstruction result. If we can use a more generalized scaler parameter to estimate the realibility of each particle image and its related parameters, it can be used to restrain the influence from "bad particles" and enhance the contribution from "good" ones. And this scaler parameter would help accelarate the convergence of the expectation-maximization algorthm, especially for the begining iterations. We desire to design this scaler parameter to reflect this realibilty of the particle images and their related parameters based on the understanding of the iteration procedure of expectation-maximazation algorithm. As we know, during the iterations, the estimation of relating parameter should be gradually getting close to the actual condition with the increase of the resolution. And the re-projections generated from the reconstruction result and these estimated parameters should be becomming more and more similar to the actual particle images. While as for the particles with low realibilty, this phenomenon doesn't hold. Because these particle didn't contribute positively to the reconstruction result and gradually the reconstruction may become more and more different to these particles. And these will shown in the error between re-projection and the particle image.
So based on this obervation, we desiged the concrete formula for this scalar parameter. For each particle, we store its best orientation and its corresponding error value jX i À CTF i P L l¼1 P f 0 V j 2 and compare it with the re-projection. After one iteration, the error value of one particle will change. And if the error value gets smaller, it indicates that this particle made positive contributions and vice versa. By now we get the Formula (3.1): : The second and third formulas are the same as RELION's formula written in work [4]. But, the first formula introduced the scalar parameter s i to estimate the realibility of particle images. And the forth formula describes the concrete formula of this scalar parameter, within and err ðnÞ i is calculated by the following formula: where diff n i indicates the difference between the reprojection and particle image in the nth iteration. When , s i would equal to 1, which means the same as the original version. And diff n i > diff nþ1 i indicates the particle has made positive contribution. The corresponding scalar parameter s i is larger than 1. As scalar parameter s i is a relaxation factor for each particle, to normalize the value of s i , we used sigmoid function [34]. In our function, the range of s i is [0.75, 1.25].
The general work flow of the modified algorithm is shown in Fig. 2. In the expectation step (shown in the upper left corner of Fig. 2), computer-generated projections (labeled as "re-projection") of the structure are compared with the experimental images (label as "real particle image") to acquire the probability distribution over orientations of the images.
In the maximization step, the 3D structure is updated combing the prior information (lower left corner of Fig. 2). And according to the updated 3D structure V , we can update the variance of signal t. Then, comparing the new reprojections and particle images we can update the variane of noise s ij . Also, comparing real-images with its re-porjection at best orientations of iteration n and n þ 1, we can update the scaler parameter s i .

MEMORY OPTIMIZATION STRATEGIES
To optimize the memory usage of this program, we first designed a new data structure to reduce the memory consumption of the weight array. Then, to ensure the stability of RELION, we proposed a stack-based memory scheduling algorithm that is in charge of the memory operations between the CPU and GPU.

Customized Compressed Data Structure
As we mentioned previously, as the number of sampling points increases, the memory consumption of the weight array grows dramatically. Consequently, reducing the memory consumption of the weight array is of great importance. As RELION adopts a modified expectation-maximization algorithm to iteratively optimize the results, the weight array would be used twice per iteration. In the first pass of one iteration, RELION calculates the difference between each particle and each sampled re-projection, which will fill the weight array densely. However, in the second pass of one iteration, only part of the sampling points will be considered, making the weight array a relatively sparse array.
One simple method to reduce the memory consumption of the weight array is to use the classical compressed data structure Compressed Sparse Row. Instead of storing every each one of the elements in the weight array, CSR stores only the non-zero elements and their corresponding subscripts in each row. In general, when the number of nonzero elements (n) is less than half of the total number of elements (N), this array is considered sparse (n < 0:5 Ã N). However, as for RELION, the weight array cannot always meet this criterion. For example, in the first several iterations, due to the low resolution of the estimated parameters, the weight array tends to be denser than later iterations. Under this condition, the memory consumption of the CSR structure may exceed the original memory consumption, which would damage the performance.
To cope with this issue, we designed a customized compressed data structure. The sparse weight array is generated in the second pass, during which only sampling points with high probability are considered with fine sampling steps, and one sampling point would generate 32 fine sampling points in the second pass. The weight of these 32 sampling points is stored successively. Thus, in contrast to the general sparse arrays, as shown in the top half of Fig. 3, the non-zero elements of the sparse weight array are partially continuous. Making use of this trait, we can reduce the memory cost for the subscript. In the bottom half of Fig. 3, instead of recording the column subscript of each non-zero element, we only store the index of the first non-zero elements in the index array.
In Fig. 3, we list the details of this data structure. This data structure consists of three parts. The first part is the "Value Array", which stores all of the non-zero elements of the weight array. Then, the "Indices Array" stores the subscript of the first subscript of the 32 consecutive non-zero elements. Finally, we use "Row Point Array" to indicate the start position of each particle's non-zero elements in "Value Array".
For each particle, we can compare the compression ratio between the CSR and our structure. Let the total number of elements be N and the non-zero elements in the matrix be n. For each particle, size original, size compressed and size new refer to the memory cost in bytes for the original version of RELION, CSR and our method. Formula (4.1) shows their concrete memory cost. (Here size value and size subscript indicate the number of bytes used for one non-zero element and the subscript.) Based on these formula, we can conclude that when n > 0:67N, the memory consumption of CSR will exceed the original version. However, the memory consumption of our method will not exceed the original version unless n > 0:98N.

Stack-Based Stabilizing Management System
RELION adopts a modified version of the expectationmaximization method to perform reconstruction. During this procedure, numerous parameters (such as the particle images and the current resolution) need to be taken into account. These parameters become different variables used on the CPU and GPU. In addition to the large number of variables, the memory requirements of some of these variables vary during the iterations according to the sampling rate and resolution, among other factors. Consequently, the memory management system needs to handle not only the large number of variables but also the variety of memory requirements of these variables.
However, there is no fully automatic memory management system for the GPU. The native memory allocation on the GPU is generally random. Repeated allocation and deallocation will introduce memory fragmentation, which will split the memory space into many small pieces. In the later stage of the iterations, these fragments will introduce memory allocation failure and cause the program to crash. To manage these memory operations, RELION uses a linked list to manage its GPU global memory. In the linked list, each node indicates one piece of the consecutive memory, and the adjacent memory will be merged into one node. Thus, the linked list can reduce fragments to a degree. However, this method cannot prevent the generation of fragments. When available memory is separated by used memories, the issue of fragmentation still exists.
Importantly, the GPU memory allocation operation and the kernel running on the GPU are asynchronous. This property leaves us more freedom to manage the GPU memory. One important property of the expectation-maximization algorithm used by RELION is that it is an iterative method. In contrast to direct methods, iterative methods update the estimated parameters based on the estimation generated from previous iterations. Consequently, the algorithm is processed in one round after another. Many of the memory operations within the iteration are predictable and repeatable. For example, the allocation of projection images is only needed at the beginning of each iteration, and de-allocation is performed at the end of each iteration.
In addition, within one iteration, hundreds of thousands of particles are processed, but the basic processing procedure of each particle is similar. For example, each of the particle images needs to be transformed from real space to Fourier space first. Therefore, at the beginning of processing one particle, we need to allocate the memories for the particles in Fourier space and free them at the end of processing one particle.
Based on the introduction above, we can conclude that the memory operation can be managed in groups based on their life cycle: global scope, iteration scope, and per particle scope. One example is shown in Fig. 4. At the beginning, variables needed by the whole program will be allocated. Then, at the beginning of each iteration, variables in the iteration scope will be allocated. Variables such as the particle images in Fourier space will then be allocated for each particle. Finally, at the end of their life cycle, the particles will be freed.
Another consideration is that, under some conditions, the memory requirement exceeds the total available memory on the GPU. In some cases, we can reduce the memory requirement by reducing the number of particles processed at one time. In our program, we also consider the extreme condition. Within the processing of one particle, the particle images need to be compared with many re-projection images from different orientations. When the size of one re-projection grows, the total memory requirement grows rapidly. Therefore, we add one more group for the orientations. When the memory of the GPU is limited, we can process all the re-projections in several batches, and the corresponding variables will be allocated and de-allocated at the beginning and end of the batch. This group is shown in Fig. 4. Based on these rules, we can build a stackbased memory management system thoroughly to avoid memory fragments. This memory management system can also be used in other iterative methods.

PARALLEL OPTIMIZATION METHODS
In this section, we try to speed up the GPU parallel algorithm of RELION by proposing two highly efficient parallel algorithms: the in-kernel top-sum parallel algorithm and weight calculation parallel algorithm.

In-kernel Top-Sum Selection Parallel Algorithm
As introduced in Section 2.3, in the field of computing technology, there is a traditional problem called "top-k", which involves selecting the top K (number K is constant) large or small elements from an unordered list. Considerable efforts have focused on the classic "top-k" problem in parallel.
In RELION, one similar but more complicated problem occurs. Because RELION implements a modified version of the adaptive expectation maximization algorithm, it first adopts a coarse grid to estimate the contribution of orientations. Then, a sub-domain of orientations f whose total calculated probabilities exceed a fraction (99.9 percent) of the sum of all probabilities is retained. At the end of the first pass, we need to find the stopping point at which the sum of the values larger than the stopping point barely exceeds 99.9 percent of the total probability. Instead of selecting the top k largest elements, we need to select the largest elements whose sum reaches 99.9 percent of the sum total. We define this problem as "Top-sum" selection. To the best of our knowledge, no effective parallel solution has been achieved. To acquire the correct stopping point, the sequential version of RELION first allocates a block of memory to store the weight array of one particle and then eliminates zero elements from the array. RELION then sorts the array. Finally, the elements are summed in descending order one by one until the sum exceeds 99.9 percent of the sum of the total probability. The basic idea of the parallel version is similar. The data are sorted, and the elements are then selected sequentially. This approach has two weaknesses. First, this method requires a lot of memory to store a complete copy of the weight array on the GPU. As we discussed above, the size of the weight array may increase significantly in some iterations, while the memory of the GPU is quite limited. Second, the sorting operation is quite timeconsuming and unnecessary. Sorting on the GPU costs many synchronization and write operations, which is very time consuming. We only need to know the stopping point, so sorting is unnecessary.

Algorithm 1. In-kernel Top-Sum Algorithm
Based on the consideration above, we propose our own parallel method based on the following considerations: 1. We should not sort all of the data. 2. To obtain the partial 99.9 percent of the total probability, the data must be partially sequential. 3. During the process of summation, the information about partial sums will be lost. With these considerations, we propose a parallel top-sum algorithm. In general, this algorithm has four basic steps (1. Bucket select, 2. Locate bucket, 3. Select data, 4. Use bucket sort to obtain stop point). We show the basic work-flow of our method in Fig. 5. We summarize our in-kernel top-sum algorithm with pseudo code in Algorithm 1. In Algorithm 1, Lines 5 to 10 shows Step 1, and Lines 11 to 13 show Step 2. Then Step 4 is shown in line 18 to 19. We also use Algorithm 2 to show the details of Step 3. First, in the "Bucket select" step, we choose some values as the pivots ðP 1 ; P 2 ; P 3 :::P m þ 1Þ to split the data into different buckets (B 1 ; B 2 :::B m ). For example, if w i is larger than pivot P 3 and smaller than pivot P 4 , w i belongs to bucket B 3 . In this method, we select these pivots uniformly P j ¼ jÃðmaxðw i ÞÀminðw i ÞÞ m þ minðw i Þ. In practice, we use a thread to process w i in parallel. As shown in the left part of Fig. 5. First, we locate the bucket that w i belongs to. Then, we record the sum (S½m) and number (C½m) of each bucket. Finally, we convert the sum of each bucket to the prefix sum (Ps½m).

Algorithm 2. Select Candidates within One Bucket
With the prefix sum array (Ps½m), we can determine which bucket (B½t) contains the stopping point. If the number of elements in the corresponding bucket is less than the total number of threads tNum, we can enter step 3. Otherwise, we will repeat step 1 and step 2.
In step 3, we select the candidates (w i s) in bucket B½t. In contrast to the traditional parallel algorithm, because the number of candidates is less than tNum, the shared memory array S can store all candidates. We use a hash algorithm to load these candidates to shared memory. In step 4, we sort these candidates using a Bitonic sort algorithm so that we can locate the precise stopping point.

Weight Calculation Parallel Algorithm
As we mentioned, RELION compares each particle image with the re-projection images and then estimates the particle orientations based on the differences. One important function used to compare the particle image with re-projection images is "getAllSquaredDifferencesðÞ". In this function, we can calculate the squared difference of every component of the particle image and re-projection image and generate the weight value to indicate the difference. Fig. 6a illustrates this comparison, where w if indicate that the ith particle image is compared with the re-projection at orientation f. We can use Equation (5.1) to describe this procedure: ððReðp Fk Þ À ReðX ik ÞÞ 2 þ ðImðp Fk Þ À ImðX ik ÞÞ 2 Þ; where X ik is the kth component for the ith particle image and p fk is the kth component of the re-projection at orientation f. One detail to explain is that the calculation of the first iteration differs from Equation (5.1) by calculating the difference with normalized cross-correlation. However, both crosscorrelations and squared differences contain two parts: comparing each component and reducing to one weight result.
This procedure is shown from one particle's perspective. As illustrated in Fig. 6b, we need to calculate the weight for each re-projection. In this whole procedure, as shown in Fig. 6c, each re-projection needs to be compared with each particle image.
To calculate one weight value w if in parallel on the GPU, the most straightforward method is loading one whole particle image and one re-projection in one thread. If we use size image to indicate the size of the image, calculating the weight value requires size image Ã size image times of global memory accessing, which makes the performance of this algorithm rely heavily on the memory accessing speed. As for the GPU, global memory accessing is 100 times slower than shared memory [35]. Thus, if we can replace this global memory accessing with shared memory, the running time can be greatly reduced.
However, one shortage of shared memory is that its size is quite limited. Based on these observations, we designed a new weight calculation parallel algorithm. We use Fig. 7 to show the details of this method. First, we can load only a part of several particle and re-projection images. In the upper left corner of Fig. 7, the white part shows the part loaded from global memory to shared memory. Using the data marked in white color, we can only calculate part of w if . However, one benefit is that we can calculate many different w if s. For the example shown in Fig. 7, the first part of weights value S 1 ¼ fw 11À1 ; w 12À1 :::w 21À1 ; w 22À1 :::w 31À1 ; w 32À1 :::w 44À1 g. After loading the second (in light gray color) part of the data to shared memory, we can acquire the second part S 2 ¼ fw 11À2 :::w 44À2 g. Similarly, the third (in dark gray color) part is loaded to shared memory to calculate S 3 ¼ fw 11À3 :::w 44À3 g. After that, we can acquire the complete weight value (w 11 ¼ w 11À1 þ w 11À2 þ w 11À3 ), and at this point, several weight values w 11 ; w 12 :::w 44 can be acquired.
The most important advantage of this method is that the time spent accessing global memory is reduced. Take the example shown in Fig. 7  . Compared with the original method, it reduced the time spent accessing global memory by size image Ã ðsize image À 2Þ. As shared memory is available for all threads of one block, we can reuse the data and reduce the data accessing time. Taking a 32*32 block as an example, the global memory accessing time can be reduced by 32 times. For the step "Get Squared Differences (GSD)", the majority of the time is spent in the expectation step, and reducing global memory accessing can efficiently reduce the processing time of this step.

Environment and Test Dataset
In our experiments, we adopted three different datasets from the EMPIAR [36]. The first dataset is 6,496 b-galactosidase (EMPIAR-10017 [15]). The size of each particle is 128 Ã 128. Th second dataset is "Plasmodium falciparum 80S ribosome bound to the anti-protozoan drug emetine", which contains 100,000 particles with particle size 360 Ã 360 (EMPIAR-10028 [9], [1]). The third dataset is also b-galactosidase (EMPIAR-10204) with 5000 particle with size 100 Ã 100. The experiments are carried out on a machine running the Ubuntu operating system 14.04, 64-bit with an Intel(R) Xeon(R) CPU E5-2680 v3 at 2.50 GHz. The GPU card is NVIDIA GeForce GTX Titan X, with 3072 stream processers and 12288 MB global memory.

Performance of the Extended Refine Algorithm of RELION
In this section, we compared our algorithm with the original version of RELION. To illustrate the performance, we recorded the resolution after each iteration for the three datasets mentioned above. The corresponding results are shown in Fig. 8. From these three images, in general, we can find that our method converge faster than the original version during the iterations. Especially in the middle iterations, our method can reach better resolution in less iterations. Take the dataset EMPIAR-10204 as example, in the Fig. 8c, we can find that from the seventh iteration, our method starts to get better resolution compared with the original verison. In the middle iterations, these advantages become more and more obvious. This is caused by the scalar parameter we introduced. During the first iterations, the projection error of all the particles are quite similar, the corresponding scalar parameter value is also quite close to each other. However, the estimation grows more and more accurate with the iterations. The difference between particles become evident. Especially, in the middle iterations, the value of scalar parameters for different particles differs with each other. And this is the stage where these scalar parameters make more significant contribution to the refinement procedure.
And we can see around the fifteenth iterations, the differences between the resolution of our method and the original method becomes quite small. The resolution generally converge to the final resolution. As for the final resolution, the dataset EMPIAR-10017 can acquire 7.25 A and 7.69 A for our method and the original version. For the Plasmodium falciparum 80S ribosome, the final resolution is 3.25 A (our method) and 3.45 A (the original version). And for dataset EMPIAR-10204, its final resolution is 3.80 A (our method) and 3.93 A (the original version), respectively. Based on the final resolution, we can see that our method can increase the resolution in a certain degree.

Performance of Stack-Based Memory Management System
We proposed the stack-based memory management system for RELION, here we use the dataset EMPIAR-10028 to show the running procedure of the memory management system. Different from the original version of RELION who allocates the GPU memory randomly, our system allocates the memories without fragment. So we can use the top pointer of the "stack" to indicate the used GPU global memory. The memory before the "top pointer" is used memory. And memory with the larger address is continuous free memory. In Fig. 9, the horizontal axis indicates each allocation operation during the running of RELION. And the vertical axis indicates the memory address. As we can see, the memory consumption of RELION shows the general cyclical pattern due to its iterative method. Also, our method can help group these memory usages with our stack-based memory management system.

Memory Consumption Optimization
We modified the data structure of weight array, which saves a substantial amount of memory. To show the performance of our customized data strucuture, we use the data set EMPIAR-10017 as benchmark. In Fig. 10, we record the max memory consumption of the weight array of one particle of one iterations. As shown in Fig. 10, the most memorydemanding array appears in the middle of the iterations. In these iterations, our structure significantly reduces the memory requirements. The upper bound of the size of weight array is mainly the sampling number in the first pass in both float mode and double mode. The memory consumption of weight array is generally 1/32 of the original version.

The Speedup of Top-Sum Parallel Solution
To test the performance of our parallel top-sum algorithm, in this part, we compared our method with the original "sort and select" method. To show the performance, we record the average running time for different size of unsorted weight arrays. From Fig. 11, we can see that for the weight array with a small number of elements (N ¼ 44640; 62400; 104160; 479232), the running time of both methods is quite similar. However, for weight array with a larger number of elements, the running time of the naive "sort and select" method increases significantly. While the running time of the parallel time of our method didn't, which means our method can acquire better performance when processing larger weight array. Besides, as our method can process multiple particles at the same time, the overall performance would be much better comparing to processing only one particle at one time.

The Speedup Ratio of Weight Calculation Parallel Algorithm
In this section, we test the performance of our weight calculation parallel algorithm. In contrast to the texture memory, the performance of our method is not restricted by the total size of the re-projection. We compared our method with original GPU version of RELION. Fig. 12 shows the speedup ratio of our method compared with original RELION. We tested these two methods on dataset EMPIAR-10028. In Fig. 9. The top pointer of used memory. Fig. 10. Maximum memory consumption of the weight array for one particle for b-galactosidase in double mode. Fig. 11. The average processing time of different weight arrays. As the number of sampling points for orientations and traslations may vary, the size of weight array changes accordingly. In the first few iterations, the sampling points number is quite small. Around the fifteenth iteration, the number of sampling points increases significantly. The calculation amount of the function also raises, correspondingly. So that we can make better use of computing resources of GPUs at these iterations, which would lead to better speedup ratio. As shown in Fig. 12, in the first few iteratiosn, the performance of the two methods is similar. However, in iterations of larger datasets (around the fifteenth iteration), the speedup ratio increases and the advantage of reusing shared memory becomes apparent.

The General Speedup Ratio
We summarize the total speedup ratio for each and record the results in Fig. 13. As our program is based on the MPI version of RELION, we don't modify the master-slave MPI parallel model. In this experiment, we compare the CUDA +MPI version with MPI version using five processes and one thread for each process. Different from the MPI version, each slave process utilize one GPU to calculate it.
For the first sixteenth iterations, the general speedup ratio is around 25x. And between iteration 20 and 23, its speed up ratio can acquire about 75x speedup ratio. This speedup ratio is caused by the larger number of sampling points.
To compare the performance of RELION's parallel version and our method, we recorded the general speedup ratio using dataset EMPIAR-10017 and EMPIAR-10204. In Table 1, we show the speedup ratio with different number of processes (in the second column) and particles per job (in the first column) in column "EMPIAR-10204 Speedup Ratio" and "EMPIAR-10017 Speedup Ratio". In general, we find that the speedup ratio is quite stable with the variation of number of processes. This is caused by the fact that we actually didn't change the architecture of the MPI processes. The speedup ratio is mainly due to the optimization of the weight calculation and selection step. As for dataset EMPIAR-10017, it speedup ratio is about 1.19 and EMPIAR-10017 owns about 1.17 speedup ratio. And, comparing the first three rows with the rest, we can find the speedup ratio are larger with more particles in each job. This is caused by the fact that more particles lead to more calculation amount to be processed in parallel.

CONCLUSION
In this work, we optimized the 3D refine algorithm from three aspects: the reconstruction algorithm, memory usage, and time performance. 1. We optimized the 3D refine algorithm by extending the model parameter and updated it based on the idea of back-propagation. 2. To optimize the memory usage condition, we designed a memory-efficient and memory-stabilizing management system to guarantee the robustness of our program and the efficiency of GPU memory usage. Then, we developed a novel data structure for RELION to reduce the memory consumption. 3. To speed up the GPU parallel algorithm of RELION, we proposed two highly efficient parallel algorithms: an in-kernel top-sum parallel algorithm and a weight calculation parallel algorithm. Experiments showed that the optimized 3D refine algorithm can speed up the converge procedure and increase the final resolution and that the memory system can avoid memory fragments. We achieved a better speedup ratio compared to RELION.  Zhiyong Liu received the PhD degree in computer science from the Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS). He is the chair professor of ICT, CAS. His current research interests include highperformance algorithms and architecture, parallel processing and bioinformatics.
Fa Zhang received the PhD degree in computer science from the Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS). He is a professor at ICT, CAS. His current research interests include bioinformatics, biomedical image processing, and high-performance computing.
" For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/csdl.