Two-Stage Column Block Parallel LU Factorization Algorithm

Parallel computing is increasingly important in computer architectures, parallel architecture has become ubiquitous in our everyday lives. Novel architectures and programming models pose new challenges to algorithm design and system software development. This paper presents a two-stage column block parallel LU factorization algorithm for multiple-processor architectures. Any given matrix is first partitioned into large blocks, and then, every large block is partitioned into a number of small blocks according to the number of processors. Finally, the small column blocks are allocated to processors in an orderly “serpentine arrangement.” Each iteration of the column block parallel LU factorization is separated into two stages of operation. In the first stage, the first-step factorization operation is processed in advance and nonblocking communication is used to reduce the processor idle and waiting time and improve parallelism. In the second stage, the large blocks are used to satisfy more powerful processors, such as GPUs, which require more data to exploit their computing capabilities. Experiments are conducted on a multicore system and multi-GPU system with different configurations to test the algorithm’s performance. Compared with other related column block parallel LU factorizations, the two-stage algorithm exhibits better load balancing and parallel execution time performance.


I. INTRODUCTION
Parallelism appears to be the future of computing. Innovations in hardware architecture, such as multi-GPU and hyperthreading processors, have made parallel computing resources available for universal desktop computers, laptop computers, and even embedded devices [1]- [3]. Parallel computing will be increasingly important in the modern economy and businesses, but conventional programing techniques may not work well in new architectures [4]- [7]. Novel architectures and programming models pose new challenges to algorithm design and system software development. Exploitation of these innovations requires the extension of parallel programming techniques to all areas of software The associate editor coordinating the review of this manuscript and approving it for publication was Stavros Souravlas . design and development [8]- [10]. Many research studies have focused on parallel programing technology for multicore and multi-GPU architectures [11]- [14]. Gaussian elimination is a classic parallel programming algorithm used to solve dense linear algebra systemsthat is widely employed in scientific and engineering models [15], [16]. Cholesky, LU, and QR factorization routines are included in almost all popular linear algebra libraries, such as Linear Algebra Package (LAPACK) [17], Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) [18], and Matrix Algebra on GPU and Multicore Architectures (MAGMA) [19], [20]. MAGMA is a collection of next-generation linear algebra libraries for heterogeneous architectures. A highlevel parallel programming model is used in [21] to implement an algorithm for dense linear algebra on multicore and GPU architectures. Kim et al. [22] present a task-parallel VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ algorithm to deal with the data dependencies for sparse incomplete Cholesky factorization. A right-looking algorithm for LU, Cholesky, and QR for GPUs is described by Volkov and Demmel [23] to optimize the performance of matrix factorization. A row block cyclic distribution strategy is used in [24] to improve load balancing and reduce the communication cost for Cholesky factorization. Endo et al. [25] describe a load balancing method for making the best use of the computational capability of all processors in a system to update a trail matrix. A static distribution policy is used in [26] for matrix factorization on heterogeneous architectures. Choi et al. [27] also use a static block cycle distribution strategy to develop a parallel block Cholesky factorization algorithm for multiple-processor systems. A blockcyclic redistribution scheme is presented in [28] to derive the optimal total cost for some particular instances of the redistribution problem. Park et al. [29] present an efficient algorithm for array redistribution to reduce the overall time of communication based on a generalized circulant matrix formalism. A dynamic scheduling policy is presented in [30] on the basis of activity on edge network for block parallel Cholesky factorization to improve parallelism and load balancing. Deisher et al. [31] also describe a dynamic load balancing strategy for matrix factorization in a multicore system. A column block uniform allocation policy is used in [32] for parallel LU factorization on heterogeneous platforms to improve load balancing and communication.
Agullo et al. [33] perform an extensive analysis and comparison of static and dynamic Cholesky factorization algorithms for multiple-processor platforms. A pipeline technique is used in [34] to redistribute data on a multiprocessor grid during runtime. This pipeline technique can significantly reduce the amount of time required to complete a dynamic data transfer task. Dong et al. [35] design a heterogeneous algorithm for Cholesky factorization and QR factorization using a static data distribution to realize load balancing. However, they do not consider the parallelism in the first step of each iteration of the LU factorization. In this paper, we present a two-stage block parallel LU factorization algorithm to develop parallelism of LU factorization and further improve load balancing on multiple processors. The remainder of this paper is organized as follows: Section II discusses the column block parallel LU factorization algorithm. Section III presents the two-stage column block parallel LU factorization algorithm for multiple-processor architectures. Section IV presents the experimental results, and a brief conclusion is presented in Section V.

II. COLUMN BLOCK PARALLEL LU FACTORIZATION ALGORITHM
The LU factorization of any given matrix A uses a series of Gaussian eliminations to form π A = LU , where π is a permutation matrix, L is a unit lower triangular matrix, and U is an upper triangular matrix. Because the LU factorization algorithm for an N × N square matrix can be easily modified to fit an M × N rectangular matrix, we only discuss the LU factorization algorithm for a square matrix in this paper. The computational complexity of the LU factorization is O(2N 3 /3) floating-point additions and multiplications, and partial pivoting increases a further O((N 2 + N )/2) floatingpoint additions and multiplications for an N × N matrix.
Given an N ×N matrix A to be partitioned into n×n blocks, the column block parallel LU factorization can be implemented in n iterations. In fact, no extra storage is required for the factorized results L ij and U ij because they will overwrite the corresponding blocks A ij of the original matrix A.
Step 2: Execute row swapping on the column blocks located in the left or right of the k-th column block based on π k , i.e., Step 3: Update the blocks (A (k) k(k+1) · · · A (k) k(n−1) ) located on the right side of block A kk into (U k(k+1) · · · U k(n−1) ) according to the following equation: Step 4: Update the blocks located in the lower right of block (trailing submatrix) A kk , i.e., For convenience, we also use vector form to denote any partial column blocks in the later discussion, e.g., Typically, these four steps can be implemented by calling four BLAS and LAPACK subroutines, i.e., dgetrf(), dlaswp(), dtrsm(), and dgemm(), for double-precision data. We use the LAPACK subroutine dgetrf( − → A kk , − → L kk , U kk , π k ) to represent LU factorization of the column block corresponding to Eq. (3), the LAPACK subroutine dlaswp(RestColumn(A, − → A kk ), π k ) to represent the row exchange on the column blocks of A except for − → A kk corresponding to Eq. (4), the BLAS subroutine dtrsm(L kk , RightBlock(A kk )) to update the column blocks located on the right side of block A kk (or U kk ) corresponding to Eq. (5), and the BLAS subroutine dgemm( − → L (k+1)k , LowerRight(A kk )) to update the blocks located in the lower right of block A kk corresponding to Eq. (6). Generally, n column blocks of the matrix A will be distributed to p processors P 0 , · · · , P p−1 cyclically for a system with p processors, and then any processor P i (i = 0, · · · , p−1) contains n/p column blocks i, i + p, · · · i + (n/p − 1)p, which form a submatrix S. Suppose that the k-th column block of A is located in the k -th column block of the submatrix S of P y (y = k mod p); then, − → A 0k and − → A kk become − → S 0k and − → S kk for P y , respectively. The column block parallel LU factorization for multiprocessor systems can be described as follows in Algorithm 1.
The algorithm can easily be modified to support multi-GPU systems by using corresponding routines such as cublasDswap(), cublasDtrsm(), and cublasDgemm(). However, Algorithm 1 does not consider the parallelism on the first step of each iteration. In fact, in any iteration k, every processor needs a broadcasting and synchronization operation, i.e., MPI_Bcast() in Algorithm 1, to wait for the current processor P y (y = k mod p) to run the routine dgetrf() and broadcast the results − → L kk and π k , which will lead to processor idle and degenerate the parallelism. Hence, we introduce a two-stage column block parallel LU factorization algorithm for multiple-processor architectures.

Algorithm 1 Column Block Parallel LU Factorization Algorithm for Multiprocessor Architectures
; / * Broadcasting and synchronization. * / dlaswp(RestColumn(S, − → A kk ), π k ); / * Each processor swaps its column blocks originally located on the left or right of − → A kk . * / dtrsm(L kk , RightBlock(A kk )); / * Each processor updates its blocks originally located on the right of ; / * Each processor updates its blocks originally located in the lower right of A kk . * / }

III. TWO-STAGE COLUMN BLOCK PARALLEL LU FACTORIZATION ALGORITHM
The two-stage column block parallel LU factorization algorithm is still implemented in n iterations. However, each iteration will be separated into two stages of operation. In the first stage, the matrix's small column blocks are allocated to processors in an orderly ''serpentine arrangement'' to obtain better load balancing, and then nonblocking communication is used to reduce processor idle and waiting time and improve the parallelism. The second stage will run on the large matrix blocks to satisfy more powerful processors, such as GPUs, which require more data to exploit their computing capabilities.

A. MATRIX PARTITION AND DATA DISTRIBUTION
Suppose that an N × N matrix A is partitioned into n × n large blocks, which are then each partitioned into 2p × 2p small blocks again according to the number of processors p. In other words, matrix A is partitioned into 2pn × 2pn small blocks, which can be denoted as For any given large column block will be distributed to p processors in a VOLUME 8, 2020 ''serpentine arrangement.'' Specifically, the 2p small column blocks − → A 0(2pi) , · · · , − → A 0(2pi+2p−1) will be allocated to processors P 0 , P 1 , · · · , P p−1 , P p−1 , · · · , P 1 , P 0 in an orderly manner; that is, any processor P x (0 ≤ x ≤ P − 1) will be allocated two small blocks − → A 0(2pi+x) and − → A 0(2pi+2p−1−x) for a given large column block − → A 0i . Consequently, the processor P x contains 2n small column blocks ), which form a submatrix and will be denoted as S or S x . Moreover, we assume that the (2pi+j)-th column block of A is allocated to one processor P and is the (2pi + j)'-th column block of the submatrix S of P; − → A 0(2pi+j) then becomes − → S 0(2pi+j) for P. Hence, we use the different symbols − → S (2pi+j)(2pi+j) or − → A (2pi+j)(2pi+j) to denote the same column block in different situations. Figure 1 shows the matrix partition and data distribution in a four-processor system. For example, P 2 is allocated 2n small column blocks , which form a submatrix denoted as S or S 2 .

B. TWO-STAGE COLUMN BLOCK PARALLEL LU FACTORIZATION ALGORITHM DESIGN
For a given matrix A as shown in Formula (7), two-stage column block parallel LU factorization is implemented in n iterations on a p-processor system. Initially, processor P 0 factorizes − → A 00 (or − → S 00 ) to obtain − → L 00 , U 00 , and π 0 , as in Eq. (1). Then, each iteration will be separated into two stages of operation to obtain better parallelism and load balancing. The first stage is aimed at computing the k large column blocks, i.e., 2p × k small column blocks (the value of k will be discussed later). For any given iteration i (i = 0, . . . , n − 1), the first stage consists of 2p steps to process the 2p × k small column blocks 2pi, · · · , 2pi + (2pk − 1), i.e., k large column blocks i, · · · , i + k − 1. Any processor P z (0 ≤ z < p) will process its 2k small column blocks ( , which form a submatrix and will be denoted as S (i) or S (i) z . Figure 2 shows the relationship of A, S, and S (i) at the start of the first stage of the i-th iteration for a four-processor system. In this case, the first i row blocks 0, · · · , i − 1 of matrix A have been processed completely, and their is no operation to be carried out in the later steps, as shown in Figure 2 (a). The four processors are allocated different submatrices S partitioned from A, as shown in Figure 2 The operations of the j-th step of the first stage at the i-th iteration of the two-stage algorithm are described as follows: (1) Every processor P z receives − → L (2pi+j)(2pi+j) and π 2pi+j , which are the factorization results of the (2pi+j)-th small column blocks − → A (2pj+j)(2pi+j) computed by processor P x in the previous step, where (2) If the processor number is y, where then P y processes the (2pi+j+1)-th small column block in advance, i.e., 1) Run column swapping on − → A (2pj+j)(2pi+j+1) according to π 2pi+j .
(4) According to π 2pi+j , every processor P z runs row swapping on its small column blocks in the k large columns, as in Eq. (4), but not including the small column blocks 2pi + j and 2pi + j + 1.
The second stage of the i-th iteration will process other column blocks and can be described as follows: (1) All processors gather the i-th large column block and obtain: (2) Every processor P z runs row swapping on its column blocks on the left or right of ( − → A ii , · · · , − → A i(i+k−1) ) according to i as in Eq. (4), where i is composed of π 2pi+j (j = 0, · · · , 2p − 1).
(3) Every processor P z updates its corresponding blocks in A i(i+k) , · · · , A i(n−1) and located on the right of A i(i+k−1) by using L ii , as in Eq. (5).
Obviously, the communication in the first stage is nonblocking, which will improve load balancing and parallelism because the task is approximately equal, and the processors hardly have idle and waiting time.
When the BLAS and LAPACK routines dgetrf(), dlaswp(), dtrsm(), and dgemm() are used correspondingly, the twostage column block parallel LU factorization algorithm for multiple-processor architectures can be described as follows in Algorithm 2.
For a shared-memory multicore system, at the j-th step of the first stage of any iteration i, the broadcast operation MPI_Ibcast() is not necessary because all CPU cores can directly access matrix A in shared memory, but the operation corresponding to the wait operation MPI_Wait(req[j]) is needed to verify the previous step's factorization results − → L (2pi+j)(2pi+j) and π 2pi+j . Alternatively, Algorithm 2 can be easily modified to run on multi-GPU systems by using related routines, such as cublasDswap(), cublasDtrsm(), and cublasDgemm().

C. PERFORMANCE ANALYSIS
According to the two-stage column block parallel algorithm, because the small column blocks are allocated to processors in an orderly ''serpentine arrangement,'' all processors are allocated identical matrix data and have almost identical computing tasks. Therefore, all processors hardly have idle and waiting time only if all processors can receive the factorization results − → L (2pi+j)(2pi+j) and π 2pi+j in the j-th step at the first stage of every iteration i in a timely manner because a nonblocking communication operation is used. In fact, we draw the following conclusion: If k ≥ 1 2 (p + 3), then all processors can receive the factorization results in a timely manner and hardly have idle and waiting time in the first stage.
3) When 0 < j 0 ≤ p − 1, the task load of P x before dgetrf( − → A (2pi+j 0 )(2pi+j 0 ) ) has been completed is and the task load of P 0 before starting step j 0 is 2650 VOLUME 8, 2020 According to T 1 ≤ T 2 , we obtain Because t 2 < t 1 , so long as 2kt 1 ≥ (j 0 + 1)t 1 + t 1 , that is, and 4) When p ≤ j 0 ≤ 2p − 2, the task load of P x before dgetrf( − → A (2pi+j 0 )(2pi+j 0 ) ) has been completed is and the task load of P 0 before starting step j 0 is According to T 1 ≤ T 2 , we obtain Because t 2 < t 1 and j 0 ≥ p, we have Hence, only if (p + 3 − 2k)t 1 ≤ 0, that is To summarize 1), 2), 3), and 4), when k ≥ 1 2 (p + 3), all processors can receive the factorization results in a timely manner and hardly have idle and waiting time in the first stage. All processors have almost identical task loads, and the algorithm exhibits better parallelism. Figure 3 shows the execution sequence and dependency of all basic computing tasks in the first stage of the first iteration in a three-processor system. In this case, the value of k is 3, which satisfies Eq. (25). Then, the total task load of every processor includes 29 dTrGe() tasks and 2 dgetrf() tasks, i.e., 29t 1 +2t 2 , which is consistent with Eqs. (13) and (14). Every red arrow indicates the dependency of the two tasks belonging to different processors. It is observed from these arrows that no processors wait idly for other processor to process their predecessor tasks. For example, considering the computing task dTrGe( − → A 9,11 − → A 9,12 − → A 9,17 − → A 9,18 − → A 9,23 ), the total task load that needs to be processed by P 0 before it is 15t 1 , and the total task load that needs to be processed by P 2 before it is 13t 1 +2t 2 . Hence, P 0 can receive the results of dgetrf( − → A 9,9 ) in a timely manner and has no idle and waiting time.

IV. EXPERIMENTAL RESULTS
The experiment is performed on an AMAX XG-48201GK system consisting of two Intel Xeon E5-2620 v4 CPUs and eight NVIDIA Titan XP GPUs. Every CPU contains eight cores, and the frequency is 2.1 GHz. Table 1 shows the main hardware resources in our experiments.

A. LOAD BALANCING
For the traditional algorithm described as Algorithm 1, the factorization operation on − → A kk in every iteration is only run by a single processor; then, the operation is the broadcast of related data and synchronization, so the waiting time would be considerable for a system with more processors. The twostage algorithm presented in the paper improves the load balancing and reduces processors' idle and waiting time by distributing data in a ''serpentine arrangement'' style and using a nonblocking communication mode in the first stage in every iteration. To evaluate the processor idle time and load balancing of the algorithm, the metric Average idle ratio is The smaller Average idle ratio is, the lower the waiting time and the better the load balancing performance are for a given algorithm. The best is when the Average idle ratio is equal to 0, which means no synchronization waiting time and the best load balancing; on the contrary, the worst is when the Average idle ratio approaches 1.
In our experiment, we test the Average idle ratio for the traditional algorithm described in Algorithm 1 and the twostage algorithm presented in this paper. The experiment is performed on the multicore system configured with 8 cores and 16 cores, respectively. Figure 4 shows the Average idle ratio comparison of the traditional algorithm and the two-stage algorithm for different multicore configurations and different matrix sizes N . The two algorithms run with larger Average idle ratios on a 16-core system than on an 8-core system. The reason could be that, when the CPU cores increase from 8 to 16, the parallel execution time decreases, but the communication time does not decrease. The Average idle ratios of the two algorithms in the different systems decrease when the matrix sizes increase from 8,192 to 36,384. Generally, when the matrix is larger, the parallel execution time will be larger and the synchronization waiting time will be relatively smaller. Most of the Average idle ratios of the traditional algorithm are between 2% and 10%. However, the Average idle ratios of the two-stage algorithm are less than 1% for the different systems. Compared with the traditional algorithm, the two-stage algorithm exhibits better parallelism and load balancing.

B. PARALLEL EXECUTION TIME
The parallel execution time of an algorithm is related to the load balancing strategy, storage mode, communication and synchronization method, etc. for a specific system. Parallel execution time is the most important metric for a parallel algorithm. Parallel execution time is tested in our experiments to measure the performance of the algorithm. We compare  the two-stage algorithm with the traditional algorithm on a multicore system and then compare it with the two representative LU factorization algorithms ''magma_dgetrf'' and ''magma_dgetrf_mgpu'' in MAGMA on a multi-GPU architecture. MAGMA is the state of the art in dense linear algebra library development for multicore and multi-GPU hybrid architectures. The magma_dgetrf and magma_dgetrf_mgpu algorithms are representative LU factorization algorithms for multicore and multi-GPU hybrid systems in MAGMA. These two magma_dgetrf and magma_dgetrf_mgpu algorithms are right-looking Level-3 BLAS versions of the algorithm for computing the LU factorization of a general M × N matrix A using partial pivoting with row interchanges. The former is a hybrid CPU/GPU routine where the matrix is initially in the CPU host memory, and the latter is a hybrid CPU/multiple-GPU routine where the matrix is distributed across multiple GPUs' device memories. Figure 5 shows a performance comparison of the two-stage algorithm and traditional algorithm for double-precision matrices on different multicore systems configured with 8 cores and 16 cores. Figure 6 gives a performance comparison of the two-stage algorithm, magma_dgetrf, and magma_dgetrf_mgpu for double-precision matrices on different multi-GPU systems configured with four GPUs and eight GPUs. In Figures 5 and 6, the x-axis represents the different algorithms and different system configurations, the y-axis indicates the size N of the double-precision matrices, and the z-axis shows the parallel execution time of the different algorithms in seconds. When the size of matrices N is increased from 8,192 to 20,480 for the multicore system or from 16,384 to 40,960 for the multi-GPU system, the parallel execution time increases gradually. In most cases, having more processors available in the system leads to lower parallel execution time for a given algorithm and matrix. However, when the matrix size is smaller, such as N =16,384 and 20,480, the parallel execution time of magma_dgetrf on four GPUs is less than that on eight GPUs. The reason is that the smaller matrix cannot provide sufficient computation for more GPUs, but the greater number of GPUs means more data transfer between the CPU and GPU, and the communication and synchronization cost accounts for a considerable proportion of the parallel execution time for magma_dgetrf. The performance of the two-stage algorithm is better than that of the traditional algorithm on the multicore system and that of magma_dgetrf and magma_dgetrf_mgpu on the multi-GPU system.
In multicore and multi-GPU heterogeneous system, the panel is factorized on the CPU core and the trailing submatrix is updated on the GPU for the algorithms magma_dgetrf and magma_dgetrf_mgpu, while the panel factorization and the trailing submatrix updating are run on the GPU for the two-stage algorithm. The former overlaps the left panel factorization and trailing submatrix computed by using the right-looking ahead technique to decrease the waiting time of GPUs, while the latter uses a data distribution strategy and partial asynchronous operation among GPUs to decrease the waiting time of GPUs. Since the panel is factorized on the CPU core and the trailing submatrix is updated on the GPU, the algorithm magma_dgetrf or magma_dgetrf_mgpu has to copy the column block data from the GPU to the corresponding CPU core and then broadcast the factorization results from this CPU core to other CPU cores and to corresponding GPUs. By contrast, the two-stage algorithm only needs to broadcast the data from the GPU to other GPUs directly because the panel factorization and trailing submatrix updating are run on the GPU. Obviously, the communication cost of the algorithm magma_dgetrf or magma_dgetrf_mgpu is larger than that of the two-stage algorithm. Increasing the CPU cores and GPUs in the system will lead to a rapid increase in the communication costs. In fact, the time of panel factorization run by the GPU is much less than that run by the CPU core or the communication cost. The communication cost of the algorithm magma_dgetrf or magma_dgetrf_mgpu will exceed the total costs of the communication time and the panel factorization time of the two-stage algorithm when we increase CPU cores and GPUs in the system. As shown in Figure 6, the performance of the two-stage algorithm is better than that of magma_dgetrf and magma_dgetrf_mgpu on the multi-GPU system when the number of GPUs is increased to four in our experiment. Furthermore, the communication cost is related to matrix size, column block size, hardware architecture, bandwidth, etc. Figure 7 shows the comparison  Scalability is often used to evaluate the acceleration of an algorithm when solving a definite problem and the capability of an algorithm to solve potentially larger problems when more computing resources are available. An experiment is conducted on the multicore system to measure how the parallel execution time accelerates when the number of CPU cores is increased gradually for a fixed-size matrix. Then, the performance variation trend is tested when the size of the matrix and the number of CPU cores increase simultaneously. Figure 8 shows the parallel execution times of the two-stage algorithm and the traditional algorithm for a  26,880×26,880 double-precision matrix. The two algorithms are run on the multicore system, and the size of small column blocks is 4. The algorithms exhibit similar scalability. Their parallel execution times decrease gradually with as the number of cores increases from 2 to 16. However, the twostage algorithm has better performance than the traditional algorithm in a system configured with various cores. Figure 9 shows the performance variation trend of the two-stage algorithm when the number of CPU cores changes simultaneously. The x-axis presents the number of cores from 1 to 16 and the size of the matrix from 2,048 to 32,768. The y-axis indicates the parallel execution of the two-stage algorithm. The curve approximately follows a straight line in a logarithmic scale, which agrees with the theoretical results.

V. CONCLUSION
Parallel computing has undergone great development in the past few years and will be increasingly important in future computing architectures. Novel architectures and programming models pose new challenges to algorithm design and system software development. Traditionally, column block parallel LU factorization algorithms consist of n iterations. However, they do not consider the parallelism of the first-step factorization operation in each iteration. This paper presents a two-stage column block parallel LU factorization algorithm for multiple-processor architectures. Any given matrix is first partitioned into large blocks to satisfy more powerful processors, such as GPUs, which require more data to exploit their computing capabilities. Then, these large blocks are partitioned into a number of small blocks according to the number of processors, and the small column blocks are allocated to processors in an orderly ''serpentine arrangement'' to improve load balancing and parallelism. Each iteration of the column block parallel LU factorization algorithm is separated into two stages of operations. In the first stage, the firststep factorization operation is processed in advance, and nonblocking communication is used to reduce processor idle and waiting time and improve parallelism. In the second stage, the large column blocks are used to provide sufficient data to satisfy more powerful processors, such as GPUs. Experiments compare the load balancing and parallel execution time of the two-stage algorithm and the traditional algorithm on a multicore system and then compare the parallel execution time of the two-stage algorithm and two representative algorithms on a multi-GPU system. The two-stage algorithm exhibits better load balancing and parallel execution time performance. Furthermore, experiments on scalability also show good acceleration performance of the two-stage algorithm when increasing the number of processors for larger problems.