Performance Comparison of GPU-Based Jacobi Solvers Using CUDA Provided Synchronization Methods

In this manuscript, variants of Jacobi solver implementation on general purpose graphical processing units (GPGPU) have been purposed and compared. During this work, parallel implementation of finite element method (FEM) using Poisson’s equation on shared memory architecture as well as on GPGPUs has been observed to identify computationally most expensive part of FEM software, which is linear algebra Jacobi solver. Sparse matrices were used for system of linear equations. Nine implementations of Jacobi solver have been developed and compared using various synchronization and computation methods like atomicAdd, atomicAdd_block, butterfly communication, grid synchronization, hybrid and whole GPU based computation methods, respectively. Experiments have showed that Jacobi implementations based on our implemented Butterfly communication method have outperformed CUDA 10.0 provided critical execution methods like atomicAdd, atomicAdd_block and grid methods. The GPU has achieved a max speedup of 46 times using GTX 1060 and 60 times using Quadro P4000 with double precision computations when compared with sequential implementation on Core-i7 8750H. All the developments were performed using C/C++ GNU compiler 7.3.0 on Ubuntu 18.04 and CUDA 10.0.


I. INTRODUCTION
High Performance Computing is referred as a branch of computer science used for solution of large and highly complex problems in domain of science, engineering and business.
In High Performance Computing, many-core processors have gained more popularity than multi-core CPUs. It consists of hundreds and even thousands of processing units on which thousands or millions of threads can be executed concurrently. Generally, GPUs cannot be stand-alone device that acts as coprocessor or an accelerator for CPU. In the beginning, GPUs were utilized to drive computations needed to show graphics on computers. The most important factor that drives rapid development of GPU hardware is need for realtime 3D graphics in games [1]. Until 2006, it was very challenging for programmers to write programs for early graphics chips in higher level programming interface, as underlying code must fit into APIs that intended to paint The associate editor coordinating the review of this manuscript and approving it for publication was Weipeng Jing . each pixel. For that purpose, one must has complete knowledge of Direct3D and OpenGL techniques to program these chips. However, it was not inspiring for researchers until GPUs were considered as special purpose processors. With the release of CUDA, every thing changed dramatically. NVIDIA facilitated ease of parallel programming by devoting Silicon areas on chips. As GPUs have been evolved into massively parallel many-threaded multi-core units that supports highly efficient computation of large blocks of data in parallel and high memory bandwidth. The algorithms that include large blocks of data are done more efficiently in GPGPUs rather than on CPUs. The GPU programming strategies are not similar to traditional CPU programming because of dramatical differences in hardware which are discussed below. CUDA facilitated ease of parallel programming on NVIDIA GPUs by providing multi-threaded Single Instruction Multiple Data (SIMD) architecture. In programming context, the set of GPU instructions are referred as kernel, a function written in CUDA C. When host thread invokes kernel, the Compute Workload Distribution (CWD) unit enumerates blocks of grids and maps them to unoccupied multiprocessors according to execution capacity of that GPU. On each SM block of threads execute concurrently. As soon as blocks finish, CWD unit initiates new blocks on available free multiprocessor. Each MP is designed in a way to execute hundreds of threads in parallel. For management of such a large amount of threads Single Instruction Multiple Thread (SIMT ) unit is employed. Each MP's SIMT unit schedule, manage and execute threads in group of 32 threads, known as Warp. Threads of warp run together at same program address but they have unique instruction address counter and register state. When one or more blocks are given to MP, it first partitioned them into warps that are scheduled and managed by warp scheduler. The SIMT unit selects a warp on time of instruction issue, that is ready to execute and issue next instruction to the active warp. At a time, one common instruction is executed by a warp. Therefore, maximum efficiency can be attained if all threads of warp agree on same execution path.
The GPU execution model is based on launching the kernel on grid of blocks. Each block comprises of group of threads and threads of same block can cooperate and synchronize with each other via fast shared memory. While threads of different blocks can't synchronize with each other. The grid and block dimensions can be one, two or three and are accessible within kernel by using built in identifiers gridDim and blockDim, respectively and they determine total number of threads [2]. Each thread has unique local id within a block that can be one, two or three dimensional and each block has unique global id. These are combined to get global unique id per thread.
There are three memory spaces in GPUs as Figure 1 shows, discussed in descending order by speed: registers, shared and global memory.
• The fastest available memory space on GPU is Registers. Each thread of MP has Registers of private scope. If threads use more number of registers than are available physically, registers will also expand to L1 cache and global memory. Thus, when there are high number of threads, the number of registers available per thread is restricted, which is one of the major reason of why high occupancy may actually affect performance of GPU • The shared memory is second fastest memory space on GPU. This memory space can be as fast as registers if addressed properly. In GPU computing, shared memory is most powerful tool. The main difference between shared memory and registers is, its ability to share data among several threads. The shared memory is visible to all threads within one block, thus enables cooperation.
The limited size of on-chip memory which is 49 KB per block in device of compute capability 6.x or more, is the main hurdle in utilizing registers or shared memory. This memory is organized into 32 banks, that serves 32 threads of one wrap concurrently.
• The global memory is the third and slowest memory, which is main memory for GPU. However, it has impressive bandwidth and high latency. Thread-level parallelism is used to hide these memory latencies. The GPUs transfers full cache line across bus for coalesced read, just as CPUs. The cost of transferring full cache line on bandwidth is similar as transferring a single element. The execution of multiple threads requires some basic means of coordination. We must present mechanism for coordination of threads at block-level and at grid-level. The NVIDIA introduces a barrier synchronization method __syncthreads() for block-level coordination. When this method is called in kernel, all threads of a block have to wait at this point until all threads in the block reach at that point. This guarantees that all threads of block have completed a particular phase before any of them can start next phase of execution. Generally, a barrier synchronization is one of the most popular method of coordinating activities of parallel threads. However, there is also a need of certain mechanism for inter-block synchronization. The package of Cooperative groups describes both block-level and grid-level synchronization strategies. For grid-level synchronization, cudaLaunch-CooperativeKernel has to be implemented instead of normal kernel launch <<< >>>. The device must have support of cudaLaunchCooperativeKernel otherwise it fails to implement grid-level synchronization. When grid.sync() method is called in kernel all blocks of grid have to wait at barrier synchronization point until all of them reach at this point. The basic pattern of execution of barrier statement in grid-level synchronization is similar as in block-level, the only difference is instead of block of N threads the grid of N blocks is taken.
What are atomic operations? The operation in which thread can perform memory transaction without interference from other threads, is known as Atomic Operation. They are beneficial in multi-threaded application for synchronization of memory accesses to prevent race condition.

A. OVERVIEW OF FINITE ELEMENTS METHOD
Majority of engineers and scientists do simulation and modelling of physical phenomenon by complex differential equations. Nowadays, most of the problems are approximated by using numerical computation method. Typically, numerical computation method transforms complex partial differential equations into set of discrete equations that are to be approximate using computers.
The most suitable engineering analysis method to approximate solutions of partial differential equation is Finite Element Method (FEM). The partitioning of domain into mesh of discrete points (DOFs) is done to solve differential equations under some appropriate boundary conditions. The principle steps involved in FEM are five, as shown in Figure 4. The accuracy of numerical approximation depends on size and shape of mesh. The line elements reside at the boundary of and will represent how a force will affect boundaries. While triangle elements will reside inside and will be used to calculate discrete problem functions. In this work, meshes are generated by a third party software called GMSH [3], which is used for discretization. The boundary elements are used to implement domain constraints like Dirichlet or Neumann etc. The mesh has to be read at initial stage in Finite Element Solver(FES) and is considered as critical part for parallel implementation.
• Construction of mesh elements: FEM is a technique which can be implemented to solve any given problem. During this work a third party mesh generator GMSH is used to generate 2D and 3D discrete domains called meshes. These meshes are refined into smaller elements to get better approximation. Meshes will be read by software at its initialization stage. A separate class Gmesh is implemented and its object is responsible for creating element objects like node and elements based on information from mesh file. The node object stores coordinates while element stores node ids, equation constants, force applied and formulation for stiffness matrix and load vector.
• Construction of linear system of equations: There are two type of elements irrespective to their shape which are domain element and boundary element. Domain elements contribute in development of stiffness matrix A and load vector b as can be seen in Figure 2. While boundary elements only contribute in load vector. For a 2D square domain all the boundary elements which 31794 VOLUME 8, 2020 have either y-coordinate equal to zero or x-coordinate equal to zero are dirichlet boundary D elements with a constant force zero. The system of linear equation requires a sparse matrix and load vector which are implemented as nested binary trees and array, respectively. There are many compact sparse matrix formats available like CSR (compressed sparse rows), CSC (compressed sparse columns)and COO (Coordinate list) etc, but these formats required structure of sparse matrix be known at construction time. Using object for each element in mesh and allowing each element to populate global stiffness matrix and load vector only allows sparse matrix to populate without knowing possible structure. Secondly, the matrix structure has to be modified while applying Dirichlet constraints. It was better to use dynamic matrix data structures like two level nested binary trees where there will be approximately constant time of data insertion, deletion and searching as well. VOLUME 8, 2020 • Application of Dirichlet constraints: In Finite Element Solver, the dirichlet constraints are applied after assembling A and b. The application of dirichlet boundary includes removal of row and column entries of dirichlet DOFs from sparse matrix A and modification of load vector b. The dirichlet constraints are applied by setting the rows and columns of dirichlet DOFs of A to zero except the diagonal entry which is set to 1. While the DOF entry in load vector is set to appropriate force. The example of dirichlet application for mesh sq is shown in Figure 3. For a spare matrix A it is implemented by removing i th row, i th column from each remaining row and finally setting diagonal element of i th row to 1.
• Solve the linear system of equations using Jacobi solver: In the last stage of this work iterative linear algebra solver Jacobi is implemented to solve linear system of equations. Initially, it is implemented for a multi-core shared memory processor and than into GPGPU.

II. BACKGROUND
The emergence of parallel computing in scientific areas and improvements of computer hardware have introduced high performance and throughput. Most of the work in the field of parallelization of FEM on GPU was performed after 2000, and from last few years the number of publications on this work has increased readily. Majority of computations in FEM solver are of single instruction multiple data flavor, that's why these are well suited for many core architecture. GPU based FEM solver implementations have been proposed in different areas of engineering, like in electromagnetic [4] and in solid mechanics [5]. Several strategies are proposed to improve performance on many-core GPUs as in [6] for optimization of memory transfers between device and host, appropriate thread utilization, optimization of global memory latency, occpancy gain, increased coalesced memory access, reduced thread divergence and increase global memory store or load efficiencty. A review of GPU base FEM implementation in the domain of magnetics was presented by Hoole et al. [7].
A collection of studies have been performed in last decade, with definite aim of offloading part or all of computationally expensive parts of finite element method on massively parallel architecture. There were two components in finite element pipeline, that were computationally expensive and challenging: global assembly phase and solution phase of linear system.
Early works on assembly phase like Bolz et al. [8] and Rodríguez-Navarro and Susín Sánchez [9] presented simplest assembly approaches that were designed on the basis of specific application. They solved each non zero entry in parallel on global linear matrix independently, which was best fit for many core architecture. However, these approaches were developed for specific applications which enabled them to solve only simple expressions, and not worked for general finite element applications. Few more complicated, but more general many-core assembly approaches have recently been introduced.
First on device assembly process was developed by Cecka et al. [10]. In global matrix, common DOFs between elements were map at same location. Therefore, these common DOFs share same memory location in global memory. The race condition can occur, when multiple threads access same share memory location concurrently. For global matrix parallel updation, atomic operation was required to avoid the race condition, which resulted in performance bottleneck. Atomic operations were more expensive as they made global memory accesses serialized. Moreover, most of the GPUs do not have support of atomic operations of double precision.
As an alternative of atomic operation, different techniques like, coloring [11], [12] and task dependent graph [13] have been published to avoid race condition. These techniques optimized overheads of synchronization but couldn't alleviate them completely. The main drawback of both of these approaches was pre-processing which was difficult to parallelize. However, it was done only once, so its computational time effect can be neglected.
Fu et al. [14] introduced a hierarchical assembling approach that overcome the overhead of synchronization efficiently. They developed batch-based strategy for assembling, that guaranteed the coalesced access to global memory and synchronization. The contribution of every DOF was stored in shared memory and final results were accumulated in global memory in chunks. The coalescing of memory access was attained by storing coordinates of node in different arrays rather than single.
Cecka et al. [2] suggested variety of optimization strategies like, optimization of latency of global memory, effective workload distribution, proper shared memory usage and computational power and optimization of transfer of data between host and device. The data containers selection for storage of elemental data was effective for coalesced memory access. The best selection of data container that was utilized for storage of elemental data, coordinates of nodes etc based upon target architecture specifications. The coalesced memory access can be used to improve efficiency of global memory.
Markall et al. [15] suggested node-wise storage pattern for CPU and element-wise storage for GPU to achieve good performance. This on-device storage pattern increases coalesced memory access of global memory. They proposed two approaches for assembly of global matrix: local matrix and addTo approach.
The addTO approach map entries of local matrix into global directly using mapping function, it required expensive atomic and coloring operations to avoid conflicts and to guaranteed validity of results. While, in second approach exclude global matrix generation, solution vector was assembled after computation of elemental matrix by vector multiplication. Thus, this approach resulted in efficient memory bandwidth usage and improved coalesced memory access patterns. They concluded that second approach was well suited for many core GPUs while fist approach was well suited for multi-core CPU.
Sanfui and Sharma [16] developed an advance version of addTo GPU based two kernel approach. In descertization of 3D domain, each DOF was further connected with three DOFs. Therefore, non-zero entries of every DOF will be in three rows of global stiffness matrix. The number of neighboring DOFs determined number of non-zero entries of global stiffness matrix. Determination of column and row indexes associative with each DOF was done by first kernel, and stored in 1D array. Second kernel assembled contribution on the basis of indexes at thier particular DOFs. Then, entries of local mtrix were mapped to entries of global matrix. The aforementioned approach was modification of addTo apporach presented by Markall et al. [17].
Cecka et al. [10], [18] published that there were three available approaches for assembly of global matrix on GPU.
• Assembly by non-zero entry in global matrix row • Assembly by row of matrix • Assembly by element The choice among the above mentioned approaches was dependent on multiple factors like application, order of shape functions and elemental matrix size. They tested the performance of these approaches on many-core GPUs. The authors finally developed two ways for assembly: Assembly by global non-zero entries and assembly by shared non-zero entries. For first approach kernel comprises of two phases. Each thread performed computation for each element and accumulate results in global memory at non-conflict spaces in first part of kernel. In second part, accumulation of each element's contribution was done. For 2nd approach, shared memory was used for storage of data to avoid global memory access latency. This approach was limitized due to shared memory limited size. The approach which include element-wise assembling, each thread performed computations of an element required large number of atomic operation to avoid race condition.
Meng et al. [19] introduced Edge-wise assembly approach to optimize atomic operations. For efficient and faster memory access, each thread assembled row in shared memory. Computation of each thread was independent, thus it resulted in minimization of overheads due to serialization of global access. The number of non-zero entries of global matrix in each row were not guaranteed to be constant. Therefore, if there was a huge difference then it may lead to unbalanced workload distribution among threads. Furthermore, the parallelization pattern for solver phase of FEM was determined by storage pattern of global matrix.
Few more approaches for assembly of matrix were investigated by Reguly and Giles [20], which were assembly by local matrix and matrix free. The first approach directly implemented assembly in sparse matrix storage, while second approach did not include any assembling process. Matrix free approach performed computation in each iteration of solver to compute local matrix. This strategy resulted in optimization of memory requirements for local or global matrix storage, by including some extra calculations in each iteration which was not issue for smaller matrices in many core GPUs. Another author Dziekonski et al. [21]- [23] introduced multiple GPU based assembly for direct sparse matrix. In this approach, first assembling was done in coordinate (COO) storage pattern and finally conversion of COO into CSR was done on many-core GPUs. The limited GPU memory size resulted in limitization of scalability of assembling and numerical integration process for larger matrices. To mitigate this issue, the author in [21] has developed an iterative strategy, which included partitioning of memory in a way to avoid overflow of GPU memory. Chunk of data was provided to each GPU and they did calculation independently. After completing their calculations, final results were accumulated on host memory.
In literature, lot of work has been done by focusing sparse matrix by vector multiplication like, Bolz et al. [8] published that matrix by vector product is most important part of FEM codes. Bell and Garland in [24], [25] also performed parallel sparse matrix by vector multiplication in parallel on many-core GPUs.
The stuides on optimization of sparse matrix by vector multiplication on GPU first performed in 2003 by Bolz et al. [8]. Bell and Garland [24] introduced GPU based parallel implementations for various storage formats like CSR, CCS, COO, HYBrid and ELL etc. Baskaran and Bordawekar [26] proposed four various approaches for optimization of Sparse matrix vector multiplications for different GPUs.
• without synchronization parallelism • optimzation of mapping of thread on basis of connection with optimized memory access • optimize global memory access • data reuse The algorithms for SpMV using the above mentioned sparse matrix storage fromats were already implemented in NVIDIA's library, Cusparse [27]. All optimization VOLUME 8, 2020 approaches that were proposed recently were compared with implementations of Cusparse based SpMV. It was important to have basic knowledge of these simple storage approaches for understanding of recently proposed SpMV optimization approaches. Li and Saad [28] reviewed implementations of GPU based SpMV multiplication associated with iterative solvers in an efficient way and also discussed preprocessing approaches.
Reguly and Giles [20] elaborated COO, CSR and ELLpack etc. The Coordinate (COO) storage pattern was simplest one for construction point of view, as it kept record of row and column index of each non-zero entry of sparse matrix. For parallel multiplication on many core GPU, each thread performed multiplication of one non-zero entry of matrix by a vector entry. This format resulted in excessive global memory accesses by each thread for 3 different arrays in arbitrary fashion. The GPU based SpMV implementations using COO required a barrier to avoid conflicts while updating resultant vector in parallel. That's why, COO was not used in practice, sometimes it was implemented as moderator between FEM assembly phase and construction of other storage formatts that were required for iterative solvers [29]. The CSR storge pattern optimzed overheads of global memory access by minimizing row vector access. Bell and Garland [24] proposed two CSR versions: scalar and vector CSR. First version utilized one thread for multiplication of each row of matrix by vector. While second version used wrap of threads to perform this operation. Scalar CSR results in non-coalesced memory access of vector, whereas vector CSR overcome this issue.
Margaris et al. [30] considered two approaches: row-wise and column-wise for distribution of workload among threads in parallel jacobi solver implementation. They concluded that row-wise distribution performed better than column-wise distribution. Majority of previous work have been focused on parallelizing Jacobi iterative solver on GPU for solution of dense linear system of equations. Zhang et al. [31] proposed single on-device algorithm. They launched 256 threads per block as shared memory is limited in size, so total 30 × 256 = 7, 680 threads can be launched at a time and made X shared in block. They considered 2 CPU intel Xeon X5482 that consists of 4 cores and 3.2 GHz speed, and GPU NIVIDIA Tesla C1060 that contains 30 SMs for experimentation. They concluded that with this algorithm they achieved 59 times speedup for floating point and 19 times speed up for double precision computations than single threaded CPU. Wang et al. [32] proposed CUDA based algorithm. They stored x k and x k+1 in shared memory to optimize global memory access latency. They considered maximum threads per block 512 in GPU. The matrix can be divided into several rows when dimension is less than 512 and into 2D blocks when dimension is more than 512. They considered CPU pentium Dual core E7200 @2.66 GHz frequency, 2GB memory and 30 GFlops computing power as well as GPU NIVIDIA GTX280 that contains 240 cores for experimentation. They concluded that with this algorithm they achieved 1.5-3 times speedup as compared to [33].
Lin and Chen [34] proposed two hybrid algorithms and named them as primary and improved algorithm. The first algorithm was comprised of typical hybrid implementation of jacobi while second algorithm was proposed to optimize data movement from GPU to CPU in each iteration and achieved maximum speedup of factor 10.2 in later one. They concluded that when matrix size increases speedup also increases which shows that this particular algorithm has best scalability.
Torun et al. [35] proposed GPU based jacobi algorithm for eigen analysis problem. They implemented matrix by matrix multiplication instead of matrix by vector in jacobi solver and considered coalesced memory access for global memory transaction. They proposed four techniques: Traditional Access Method, Symmetric Access Method, Maximum Coalesced Access Method and One Step Parallel Jacobi (OSPJ) method on the basis of memory access patterns and concluded that OSPJ is best than other three and achieved good speedups. For calculation of reachability probability, in [36] sequential and parallel GPU based jacobi solver were compared. By this comparison it is found that Jacobi solver is best for sparse matrices and BiCGStab is best for denser ones.
Ahamed and Magoulès [37] proposed first GPU based algorithm of jacobi solver for solution of sparse linear system of equations. they performed experimentations on 3D problems like 3D laplace equation and 3D heat equation etc. and achieved 23 times speedups with NVIDIA GPU Tesla K20c.

III. GPU FINITE ELEMENT SOLVER
In most of the documented previous work, it is observed that they had focused: • on solving dense matrix. • with a linear consecutive data structures. • only on mathematics of jacobi solver as they do not add algorithms that consider on-device synchronization and communication methods.
In this work, sparse linear system is considered while the current data structure is not compatible. A is converted into CSR where three vectors one for row pointer, column id and values, as shown in Figure 6. We have proposed multiple implementations of GPU based jacobi solver by considering on-device communication and synchronization methods and compared their results.

IV. GPGPU BASED PARALLEL JACOBI SOLVER FORMULATION
In Jacobi Iterative Solver, the computation of solution for linear system of equation is an embarrassingly parallel part. Therefore, GPU is used to offload this stage of FES, while remaining stages are executed on shared memory architecture in parallel. The main objective is to compute solution in minimum amount of time that reduces overall application time.   • Initialization of data in memory allocated in CPU.
• Data has to be copied from host to device using built-in function CudaMemCpy() with parameter CudaMem-CpyHostToDevice to specify direction for copy.
• Host thread invokes kernel, which transfer control from CPU to GPU and also specify grid size and block size required for program.
• Data has to be copied back from device to host using built in method CudaMemCpy() with parameter to specify direction of copy CudaMemCpyDeviceToHost.
• The device memory has to be released using built-in method CudaFree(). Writing programs in CUDA and build up environment is relatively simple task. But it requires deep knowledge of architecture details and knowledge of writing codes in parallel. The key part of CUDA programming is kernel calls, where programmers must determine data parallelism required by code. First, we must decide how workload has to be distributed among blocks of threads. In the beginning, we opted block-partition for distribution of tasks among threads but it results in performance loss. As the memory accesses to global memory must be coalesced and to shared memory must be conflict free. Otherwise, the performance of memory will bounds kernel. Therefore, cyclic partition is preferred over block in GPU as it gives coalesced memory access. Threads with consecutive thread ids access contiguous memory locations as can be seen in Figure 8. The partitioning of given workload into an appropriate number of threads is the point which defines successful parallel program.
Second, we must decide how to use hardware resources available to us which gives better performance. There are three levels of memory in latest GPUs: local, shared and global detail of which is discussed above. We optimized shared memory accesses due to its limited size and bank conflicts. We preferred to store and access data from global memory as it is accessible to all threads and has large size like 6 GB in GTX 1060 and 8 GB in Quadro P4000.
Example of basic CUDA program is shown in Figure 8. The HybridThree() is the entry point of kernel, a function written in CUDA C represented by __global__ identifier. The HybridThree <<< BPG, TPB >>> () is kernel invocation by host, which will enumerates BPG number of blocks with TPB threads each. These values can be set for each kernel call. An explicit barrier is required between kernel launches, to guarantee that all threads of first kernel have finish their work before second kernel launch. When kernel is return c 10: end procedure launched on GPU, the hardware scheduler enumerates blocks of threads and maps them to available multiprocessors, and continues to do it until required resources are available.
While implementing Jacobi solver in parallel on CPU there were two problems needs to be addressed. The data validity issue of solution vector and the race condition may occur while calculating normals of residual and solution vector. The major bottleneck in GPU based implementation is reduction of norms of residual and solution vector. As on GPU block-level as well as grid level communication methods are needed. Table 1 shows synchronization and communication characteristics of hybrid implementations of Jacobi solver. The details of these implementations are provided in later sections. There are two algorithms used to distribute number of rows among the blocks and among the grid.

A. HYBRID SOLVERS
In all algorithms device code, host code, CUDA provided methods and synchronization points are written in blue, black, green and red color respectively.

1) SOLVER ONE
The Hybrid Solver one is proposed for device of compute capability 2.1 or more which has compatibility with CUDA v 6.5. This CUDA version was lacked of any on-device global synchronization method. This solver is proposed without any synchronization and communication method, as can be seen in first row of end for 12: end procedure 13: /* -Jacobi Solver for Host -*/ 14: CudaMemCpy (A, b, c) host to device 15: while (k<totaliterations) && (error > 10E-6) do 16: SpMV <<< BPG, TPB >>> (js) 17: DeviceBarrier() 18 end while synchronization point was required, the computations were switched from device to host.
• Figure 9 demonstrates execution flow of solver on CPU as well as on GPU, detail of which is given below.
• Data validity issue: We proposed two separate kernels, first for matrix by vector multiplication as can be seen in Algorithm 1 and second for computation of residual vector, solution vector and two vectors of size DOFs for normals of residual and solution vector as shown in Algorithm 2.
• Race condition issue: To avoid race condition which can occur while computing scalar values of residual and solution vector, a global communication method was needed. As this solver has no support of global communication method. Thus, computations were switched from device to host with vectors of size total number of unknowns, which are named as DOFs.
• Finally, error is calculated for decision of next iteration.
• Device synchronization issue: The device synchronization points are needed after each kernel invocation to guaranteed that all threads have finished their work before another kernel launch. These synchronization points are represented in brick red color in Figure 9 as well as on line 17 and 20 of Algorithm 2.
• The main issue in this particular solver is large number of memory copying from device to host in each iteration.

2) SOLVER TWO
The Hybrid Solver 2 has optimized memory copying issue which is discussed above. The approach opted in this solver has optimized number of memory copying from DOFs to Total blocks count as can be seen in Algorithm 3 and 4.
• Second row of Table 1 shows that block-level communication and synchronization methods are opted in this particular solver.
• Figure 10 demonstrates execution flow of solver on CPU as well as on GPU, detail of which is given below.
• Data validity issue: is resolved as discussed in Section IV-A1.

end while
The implementation detail of butterfly communication method can be seen on page number 107 of [38]. * Synchronization: After block-level reduction, syncthread() method is used for block-level synchronization point, where all threads of each block have to wait, to guaranteed that all blocks have updated shared value. These synchronization points can be seen in Figure 10 and on line 13 and 19 of Algorithm 3 in brick red color.  As this solver has no support of global communication method, thus, computations were switched from device to host with vectors of size total number of blocks.
• Finally, error is calculated for decision of next iteration.. • Device synchronization: is done as discussed in Section IV-A1.

3) SOLVER THREE
The Hybrid Solver 3 is proposed for device of compute capability 6.1 or more as well as for CUDA v9.1 on windows. As NVIDIA introduced block-level communication method in this CUDA version. The basic implementation pattern of Hybrid Solver 2 and 3 is similar, the only difference is in calculation of reduced values of normals of residual and solution vector.
• Third row of Table 1 comprises implementation details of solver three, which shows that both block-level communication and synchronization methods are opted in this particular solver.
• Figure 11 demonstrates execution flow of solver on CPU as well as on GPU, detail of which is given below. As this solver has no support of global communication method, thus, computations were switched from device to host with vectors of size total number of blocks.
• Finally, error is calculated for decision of next iteration.. • Device synchronization: is done as discussed in Section IV-A1.

CudaMemCpy(A, b, c)
• Fourth row of Table 1 comprises implementation details of solver four, which shows that gird-level communication method as well as block-level communication and synchronization methods are opted in this particular solver.
• Figure 12 demonstrates execution flow of solver on CPU as well as on GPU.
• Data validity issue: is resolved as discussed in Section IV-A1.

CudaMemCpy(A, b, c)
-Grid-level * Reduction: CUDA provided method atomic_Add() is used by thread 0 of each block, which can be seen on line 23 and 24 in green color. VOLUME 8, 2020 In this way, scalar values of norms of residual and solution vector are computed. * Synchronization: After grid level reduction, a global synchronization point is needed where all blocks of grid have to wait, to guaranteed that all blocks have updated global value. As the current CUDA version has no such synchronization point, that's why scalar values are copied back to host. Then, square root of scalar values of norms of residual and solution vector is taken by host thread for calculation of error and for decision of next iteration.
• Device synchronization: is done as discussed in Section IV-A1.

5) SOLVER FIVE
The basic implementation pattern of Hybrid Solver 3 and 5 is similar, as in both Jacobi solver is implemented by invoking two separate kernels as discussed above and shown in Algorithm 6. The only difference is in computation of scalar values of norms of residual and solution vector.
• Fifth row of Table 1 comprises implementation details of solver five, which shows that gird-level communication method as well as block-level communication and synchronization methods are opted in this particular solver.
• Figure 13 demonstrates execution flow of solver on CPU as well as on GPU.
• Data validity issue: is resolved as discussed in Section IV-A1.
• Race condition issue: -Block-level • Device synchronization: is done as discussed in Section IV-A1.

6) SOLVER SIX
The Hybrid Solver 6 is also proposed for device of compute capability 6.1 or more and for CUDA v9.1 or more.
• Sixth row of Table 1 comprises implementation details of solver six, which shows that only gird-level communication method is opted in this particular solver.

end while
Then, square root of scalar values of norms of residual and solution vector is taken by host thread for calculation of error and for decision of next iteration. VOLUME 8, 2020 • Device synchronization: is done as discussed in Section IV-A1.

B. FULL GPU SOLVERS
The Hybrid versions are proposed for devices which are lack of cudaLaunchCooperativeKernel support. As again and again switching from device to host and vice versa is not good practice. Thus, the implementation of single Jacobi Iterative solver method on device required grid-level synchronization points. We proposed global lock-based synchronization points as discussed in [39], [40]. In lock-based synchronization point global variable mutex is used to count number of blocks that reach the synchronization point, when block finish its computation one of its thread will add 1 to mutex through CUDA provided atomicADD(). Then all threads will check value of mutex with goal value, if mutex is equal to goal value then all threads can continue further computations. We found after performing several experiments with variety of test data, that it is not efficiently works for large meshes. Therefore, we decided to opt CUDA provided global synchronization points which are available only when cudaLaunchCooperativeKernel support is enabled. In windows, the default working mode is WDDM, TCC mode is required for enabled cudaLaunchCooperativeKernel support. In ubuntu, the working mode of device is TCC, which enabled cudaLaunchCooperativeKernel support.
It is necessary for using runtime cudaLaunchCoopera-tiveKernel API to ensure the co-residency of blocks of threads on GPU, grid-size needs to be consider carefully.
Multiple Full GPU versions are proposed for devices which has support of cudaLaunchCooperativeKernel. This support enables the implementation of Jacobi solver in a way that once CPU thread invokes kernel it will exits from device only when convergence criterion is satisfied. Table 2 shows implementation detail of Full GPU Solvers by checking those columns that contain synchronization and communication methods, which are opted by particular solver.

1) SOLVER SEVEN
The Solver seven is modification of solver version 4. The only difference is instead of invoking two separate kernels, it invokes single kernel, which is cudaLaunchCooperativeKernel.
• First row of Table 2 comprises implementation details of solver seven, which shows that gird-level and block-level communication as well as synchronization methods are opted in this particular solver.
• Figure 15 demonstrates execution flow of solver on CPU as well as on GPU.
• Data validity issue: To solve issue of data validity, instead of separate kernel invocation, this particular solver has used temporary global scoped vector to store updated values of solution vector. At the end of each iteration the solver has copied temporary vector into solution vector which will be used in matrix by vector multiplication of next iteration to ensure validity of solution vector in each iteration.   24: syncThreads() 25: s ← s * 2 26: end for 27: if tid == 0 then 28: atomicAdd(normRes, blkRes[tid]) 29: atomicAdd(normDes, blkDist[tid]) 30: end if 31: syncBlocks() 32: error ← normRes normDist 33: end while 34: end procedure

2) SOLVER EIGHT
The Solver eight is modification of Hybrid version five, the only difference is instead of invoking two separate kernels with continuous switching between host and device it invokes single kernel, which is cudaLaunchCooperativeKernel.
• Second row of Table 2 comprises implementation details of solver eight, which shows that gird-level and block-level communication as well as synchronization methods are opted in this particular solver.
• Figure 16 demonstrates execution flow of solver on CPU as well as on GPU.
• Data validity issue: is resolved as discussed in Section IV-B1.
• Third row of Table 2 comprises implementation details of solver nine, which shows that only gird-level communication as well as synchronization methods are opted in this particular solver.
• Figure 17 demonstrates execution flow of solver on CPU as well as on GPU.
• Data validity issue: is resolved as discussed in Section IV-B1.

V. EXPERIMENTS AND RESULTS
Experiments were performed for evaluation of performance by using data set sq6 to sq9, detail of which can be seen on left side of Table 4. The sequential timing was taken on CPU Intel Core-i7 8750h and timing of parallel implementation was taken on two NVIDIA GPUs, hardware detail of which is given on right side of Table 4. Red color is used to highlight best speedup obtained for each mesh using different Jacobi implementations in case of GTX 1060 as well as in Quadro P4000 in Table 3 and 5. Table 3 demonstrates the speedup of on-device Jacobi solvers with multiple meshes on GPUs. The block workload distribution approach was used for larger meshes. These results showed that block partition was not suitable on GPU architecture, the main reason of poor speedups was non-coalesced memory accesses while fetching data from global memory space. For this reason we preferred to use cyclic partition over block in GPU implementation. Table 5 demonstrates the speedup of on-device Jacobi solvers with multiple meshes on both GPUs. The cyclic workload distribution approach is used for larger meshes. Figure 18 demonstrates the speedup of all solvers for meshes sq6, sq7, sq8 and sq9 on two GPUs named as GeForce GTX 1060 and Quadro P4000. The approach for workload distribution in all solvers is block partitioning. As we can see these solvers perform poor with larger meshes as compared to small. The overall speedup of solver 2 and 4 is equal and better as compared to others. Figure 19 demonstrates the speedups of all solvers for meshes sq6, sq7 and sq8 on two GPUs named as GeForce GTX 1060 and Quadro P4000. The approach for workload distribution in all solvers is Thread-level parallelism i.e number of threads are equal to total DOFs. The overall speedups of solver 2 and 4 are better as compared to others. On the   contrary, solver 3 and 5 did not perform good as these implemented block-level CUDA provided communication method which results in performance loss.

A. CONCLUSION
The study of FEM for shared memory has identified that Jacobi iterative solver is most time consuming part of software. Thus, we offload this expansive part on GPGPU, as it is well suited for massively parallel codes. We implemented multiple designs for GPU based Jacobi iterative solver to meet synchronization and communication requirements. For this purpose, we proposed nine solvers that varies from each other on the basis of opted synchronization and communication approaches.
The synchronization and communication approaches were opted for block and grid-level thread-safe reduction. The comparison of grid and block-level synchronization and communication methods opted by solvers can be seen in Table 2. After comparing each solver on the basis of speedup we came up at the point that Solver 2 and 4 were best as these achieved speedup of 46 on GTX 1060 and 60 on Quadro P4000 as compared to others.
We have compared CUDA synchronization methods with our CUDA implementation of Butterfly synchronization method. Where solvers 2, 4 and 7 were implemented using Butterfly algorithm, while solvers 3,5 and 8 were implemented using CUDA provided synchronization method like atomicAdd_block, atomicAdd and grid synchronization. All the experiments had shown that our Butterfly synchronization technique had outperformed CUDA provided synchronization method as can be seen in Table 5.
Even though, solver 2 and 4 were differ in terms of grid-level synchronization methods. As in solver 2, where grid-level synchronization was required we switched from device with vector of size blocks and did remaining computation on host to guaranteed thread-safe reduction. On the contrary, solver 4 had implemented on-device grid-level synchronization methods for thread safe reduction on device and switched from device with only two scalar values in each iteration. We observed that this implementation difference had no significant impact on performance of these solvers.

B. FUTURE WORK
The above mentioned work has more dimensions like study of dynamic thread safe sparse matrix data structures. During this work, due to lack of time we have not considered global memory access latencies and texture memories etc. The current work has to be studied for more dimensional meshes, as the dimension of mesh and the type of problem affects the structure of sparse matrix. Furthermore, a work need to be done for advance iterative solvers such as conjugate gradient. In this implementation, the solver part is implemented on GPUs but it will be better if whole application can be ported on many-core GPU.
GPUs are growing very rapidly, as new GPUs are being introduced every few months or so. The physical design of these GPUs are changing, so in future work more advanced GPUs will be considered.