Performance Portable Batched Sparse Linear Solvers

Solving large number of small linear systems is increasingly becoming a bottleneck in computational science applications. While dense linear solvers for such systems have been studied before, batched sparse linear solvers are just starting to emerge. In this paper, we discuss algorithms for solving batched sparse linear systems and their implementation in the Kokkos Kernels library. The new algorithms are performance portable and map well to the hierarchical parallelism available in modern accelerator architectures. The sparse matrix vector product (SPMV) kernel is the main performance bottleneck of the Krylov solvers we implement in this work. The implementation of the batched SPMV and its performance are therefore discussed thoroughly in this paper. The implemented kernels are tested on different Central Processing Unit (CPU) and Graphic Processing Unit (GPU) architectures. We also develop batched Conjugate Gradient (CG) and batched Generalized Minimum Residual (GMRES) solvers using the batched SPMV. Our proposed solver was able to solve 20,000 sparse linear systems on V100 GPUs with a mean speedup of 76x and 924x compared to using a parallel sparse solver with a block diagonal system with all the small linear systems, and compared to solving the small systems one at a time, respectively. We see mean speedup of 0.51 compared to dense batched solver of cuSOLVER on V100, while using lot less memory. Thorough performance evaluation on three different architectures and analysis of the performance are presented.


I. INTRODUCTION
S OLVING large number of small or medium size, independent, sparse, linear systems at the same time is starting to emerge as an algorithmic need in several domains such as multilevel finite element methods, modeling collision in plasma [1], and reentry simulations [2]. For example, a multilevel finite element method (FE 2 ) requires a microscopic finite elements computation for each integration point of the macroscopic scale mesh [3]. Those finite element computations usually share the same mesh and lead to sparse matrices which share the same sparsity pattern and can be solved independently. Other examples of batched systems can be found in multiphysics or chemical simulations where several degrees of freedom (such as species mass fraction) are associated to each node of the mesh such as in combustion simulation [4]. The introduction of these large number of small and independent linear systems in computational codes is due to the reformulation of algorithms to utilize the hierarchical parallelism available in modern Graphic Processing Units (GPUs). There are some problem domains where the many small linear systems are fully dense [2]. One can solve these dense linear systems effectively using modern manycore CPUs and GPUs [5], in fact this area of research has been the focus of the community for past several years. Libraries such as Kokkos Kernels [6], MAGMA [7], cuSOLVER provide implementations of such solvers. However, several recent formulations result in these small systems themselves being sparse. This is the focus of this paper.
When solving large number of small or medium size, independent, sparse, linear systems at the same time, applications have four choices (i) Solve the systems with a parallel sparse linear solver one at a time; (ii) Convert the small sparse systems to dense linear systems and use available batched solvers; (iii) Assemble a very large block linear system where small linear systems are placed in the diagonal and use a parallel sparse linear solver; (iv) Develop a batched sparse linear solver. Fig. 1 compares option (i) -(iv) for 20,000 linear systems of varying sizes. Option (i) uses no additional memory but is extremely inefficient in a GPU like architecture as a lot of kernels are launched sequentially and those kernels are too small to utilize the available performance of the GPU. While option (ii) (cu-SOLVER batched dense in this example) is fast it is inefficient in terms of memory usage. As the size of the linear systems grow it runs out of memory. Option (iii), on the other hand, uses constant factor more memory (one copy of small systems into a large system) but cannot perfectly utilize the natural parallelism available in this problem. This is represented by the cuSOLVER sparse block QR in Fig. 1. Finally option (iv) has a small memory footprint and can use the hierarchical parallelism available on GPUs to solve independent systems. We used two codes -cuSOLVER sparse QR and batched GMRES -to evaluate this. The application of the sparse QR solver on the block diagonal system starts to fail for number of rows larger than 270 per block with 20,000 blocks; i.e. for total numbers of rows larger than 5.4 million. The batched GMRES was able to solve these with no additional memory overhead. The performance difference between option (i) and option (iv) can be up to 540x. These results motivate us to develop a set of batched sparse linear solvers that can be used on CPUs and GPUs.
In addition to developing batched algorithms to solve independent sparse linear systems, we also require our This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Fig. 1. Wall-clock time of the cuSOLVER sparse QR solver applied on all the systems sequentially (i) and on the blocked system formed by gathering the batched system (iii), of the cuSOLVER batched dense solver (ii), and of the Kokkos Kernels batched GMRES with a left Jacobi preconditioner and a classical Gram-Schmidt orthogonalization process (iv) to solve a batched system of 20,000 randomly generated, symmetric positive definite matrices. The matrices are symmetric positive definite as this is a requirement for the cusolverDnDpotrfBatched batched dense solver. The sparsity pattern is the one of a 2D Laplacian discretized using FEM on a V100 NVIDIA GPU. The batched dense solver cannot be successfully used when number of rows equal to 190 or more because it was impossible to allocate enough memory to store the batched matrix in a dense format. The application of the sparse QR to the block diagonal system starts to fail for number of rows equal to 270 or more. For larger number of rows such as 300, the batched GMRES has a speed-up of about 540x compared to the only other option that works (using the cuSOLVER sparse QR). Our proposed solver (iv) was on average 76x and 924x compared to approach (iii) and approach (i), respectively. We see mean speedup of 0.51 compared to approach (ii) while using lot less memory. implementation to be performance portable. We would like to develop a hierarchical parallel algorithm, implement it once using C++ and evaluate it on several GPUs and CPUs. Our interest in performance portability arises from the diversity of computer architectures that are of interest to our target users. At the least, we would like to support Intel CPUs, and AMD and NVIDIA GPUs. We achieve this by using the Kokkos library [8] to implement our hierarchical parallel algorithms. The performance of our batched linear sparse solvers will depend primarily on the performance of our batched sparse matrix vector multiply kernel. As a result, we devote considerable effort to improve the performance of batched sparse matrix vector multiplies including choosing the right parameters for the kernel implementation using auto-tuning approaches.
To summarize, the primary contributions of this work are: r Performance comparisons on CPUs and GPUs that demonstrate that the solver achieves state-of-the-art performance. We are aware of a parallel work to develop batched sparse linear solver [9], [10]. We have developed these methods as part of the same project targeting similar applications. The solvers developed here are already integrated into the Trilinos and the PETSc libraries. The latter uses these solvers for solving the Landau collision operator [1]. We will not focus on these application use cases as part of this paper and instead focus on the algorithm and performance aspects.
The paper is organized as follows: we introduce the notations used in the rest of the paper in Section II. In Section III, we recall some key concepts of the Kokkos programming model which will be heavily used in the rest of the paper. In Section IV, we introduce the batched linear system and we review the three existing strategies to solve it using Krylov methods. In Section V, we discuss, and analyse the implementation of a batched SPMV. In Sections VI and VII, we discuss the use of the batched SPMV within batched Krylov methods to solve batched sparse linear systems. Finally, in Section VIII, we provide some conclusions and future work.

II. NOTATIONS
In the rest of this paper, we use the notation of Kolda et al. [11], which we will restate concisely here. The number of dimensions of a tensor is called order. Vectors (tensors of order one) are denoted by boldface lowercase letters such as y, matrices (tensors of order two) by boldface uppercase letters such as Y , and higher-order tensors by boldface Euler script letters such as Y . The i-th entry of y is denoted by y i , the (i, j)-th entry of Y by y ij , and the (i, j, k)-th entry of third-order tensor Y by y ijk . A colon is used to indicate all elements of a dimension. A colon within a parenthesis is used to indicate all elements within a range of one dimension; for example the i-th to j-th entries of y are denoted by y (i:j) which is still a tensor of order one and therefore needs a boldface lowercase letter. A fiber is obtained by fixing every index but one. The j-th column of Y is the fiber denoted by y :j , and the i-th row of Y is the fiber denoted by y i: . Third-order tensors have tube, column, and row fibers, denoted for Y by y :jk , y i:k , and y ij: , respectively. Slices are obtained by fixing all but two indices. Third-order tensors have frontal, horizontal, and lateral slices denoted for Y by Y i:: , Y :j: , and Y ::k , respectively. Finally, for each third-order tensor A made of N frontal slices of size n × n, we associate a linear operator A: For the rest of the paper, the number of frontal slices N will be denoted as the batch size and the number of column and row fibers per frontal slice will be denoted as the number of rows/columns.

III. KOKKOS
We will now recall the key concepts of Kokkos [8] which will be extensively used in this paper. Kokkos is a C++ performance portability library which enables single source performance portable codes by providing a portable programming model for shared-memory parallelism.
The first key concept is the Kokkos::View, a data abstraction for arrays of zero or more dimensions. This View is templated on the type of the entries, the layout of the underlying memory, and the device type which is a combination of native programming model (OpenMP, CUDA) and where the View has to be allocated (CPU, GPU). For example on an architecture which has a CPU and a GPU the users can choose if they want to allocate a view on the GPU memory or on the host memory. If the user allocates a view on the GPU and want to set every entry to zero, the execution space will be used to deduce the backend to loop over those entries such as CUDA, HIP, SYCL, or OpenMP. Kokkos provides tools to copy Kokkos::View between host and device. The layout of the underlying memory can be chosen by the users, Kokkos::LayoutLeft as in Fortran (column-major), Kokkos::LayoutRight as in C++ (row-major), or Kokkos::LayoutStride.
In addition to the abstractions for running a code on multiple architectures such as CPUs and GPUs, Kokkos also provides abstraction for three levels of hierarchical parallelism: the team level, the thread level, and the vector level. A Kokkos thread team is a collection of threads that can synchronize among themselves and that share a common memory level within the memory hierarchy. Kokkos uses an abstraction called scratch_memory_space which is used to allocate temporary views accessible to all threads inside this common memory level. The actual mapping of scratch_memory_space to a specific memory is hardware specific and determined by Kokkos. The maximal number of teams is not architecture dependent, it is only limited by the integer size type. However, the maximal team size, i.e. the number of threads within a team, is architecture dependent. Finally, the vector level needs to be vectorizable. These three levels allow applications to write a hierarchical parallel algorithm within a CPU/GPU in a portable way.

IV. BATCHED KRYLOV METHODS
A batched linear system is a collection of independent linear systems which may or may not share some properties: where N is the batch size, i.e. the number of matrices A 1:: , ..., A N :: . In this paper, we assume that the matrices A 1:: , ..., A N :: are sparse square matrices all of which have the same number of rows n and the same sparsity pattern as illustrated in Fig. 2. However, as discussed later, the implemented method is general enough to work with cases where the sparsity pattern and the number of rows are the same only for slices associated to a Algorithm 1: Parallelization Over Individual Problems. subset and can be different for other subsets. Alternatively, the batched linear system can be written using (1) as: There are currently three known strategies to solve the batched linear system (3) using Krylov methods.
The first approach is to parallelize over individual problems as illustrated in Algorithm 1. A team parallel for is used to loop over the N slices. Every iteration of the loop solves one system using the chosen Krylov method. The team-based approach can use a second level of parallelism within the Krylov method (not shown in Algorithm 1). Every system converges independently; There is no coupling between two different systems of the batched linear system. However, both vectorization and coalesced memory accesses in the SPMV kernel are graph or sparsity pattern dependent. This is the approach followed by Ginkgo [9]. This approach relies on the assumption that the number of non-zero entries is relatively small such that the common data associated to the graph of the sparse matrices can be stored in cache and reused between different solves in the batch.
We consider both of the remaining two strategies as subset strategies. Instead of iterating over one system at a time, the outermost parallel for loop iterates over subsets of m systems at a time as illustrated in Fig. 3 and in Algorithm 2; every iteration of the outermost parallel for loop has to solve a subset of systems. The two subset strategies differ on how they solve the subset of systems at the team level. These approaches allow to reuse common variables, and improve both data parallelism and memory access patterns. These strategies are deeply connected to the method of ensemble propagation in sampling-based uncertainty quantification [12]. These approaches do not require that all the systems of the batched linear system have the same number of rows and the same sparsity pattern; instead they require that all the systems within a same subset share the same number of rows and the same sparsity pattern.
If users want to gather systems that have slightly different number of rows or sparsity patterns within a subset, it is still possible to compute a new sparsity pattern which is the union of the sparsity patterns of all problems within the subset and use that pattern. It will increase the size of the allocated memory Algorithm 2: Parallelization Over Subset Problems. and the number of required operations but it could be beneficial in some cases. We do not cover this use case in this paper.
The first subset approach is to solve a coupled problem at every iteration of the outermost parallel for loop. Mathematically, the coupled problem can be seen as the gathering of the matrices associated to the current subset into a block diagonal matrix followed by the application of the Krylov method on the gathered block diagonal system. In the field of ensemble propagation, this approach is known as ensemble reduction [13]. The convergence of this approach depends on the union of the spectra of all the matrices; in the best-case scenario, the union of the spectra is not worse than the worst spectra among the current subset.
The second subset approach is to solve the problems inside a given subset independently. The systems are kept mathematically independent, the Krylov method is applied on all those systems at the same time. For a given subset, the Krylov method will iterate up to the convergence of the last system of the subset. In the field of ensemble propagation, this approach is known as ensemble-typed dot product or without ensemble reduction [13]. This approach requires every kernel to be a team level kernel (e.g., SPMV, dot product) which support subsets of values instead of one value so that the Krylov solver can be implemented using them. In CUDA terminology, every kernel needs to be callable within the device. The main drawback of this approach is the code divergence. The Krylov methods might require different numbers of iterations for different systems to converge within the same subset. This can lead to issues such as overflow if not treated carefully as discussed in [13].
In the rest of this paper, we will restrict ourselves to the last approach. In particular, we will restrict ourselves to the implementation of the kernels that are at team level which support subsets of values and how to combine them to implement a batched Krylov methods at the subset level. This batched Krylov method can be called from a parallel-for so multiple teams can solve independent subsets as shown in Algorithm 2. For more details on the code divergence in the GMRES, we refer to [13]. Moreover, in the rest of this paper, if not stated otherwise, we will only discuss the implementation at the team level and therefore, to reduce the verbosity of the formula, we use the following notation shortcut: A := A ( 1 : 2 ):: where 1 and 2 are as in Algorithm 2.

V. BATCHED SPARSE MATRIX VECTOR PRODUCT
Now that the batched linear system has been introduced and that the strategies for solving them has been discussed, we will discuss the batched SPMV and its implementation to support the subset strategy. The batched team-level SPMV illustrated in  Note the use of m systems as we are focused on the team-level SPMV. Note that Fig. 4 just shows the SPMV portion of (4). This can be generalized to an SPMV that handles entire batch of size N quite easily by allocating m systems to each of N/m teams.
In this paper, we restrict ourselves to the case where the matrices in the frontal slices A 1:: , ..., A m:: are represented using a Compressed Row Storage (CRS) format. Assuming that these frontal slices share a common sparsity pattern, such as a common sparsity pattern inherited from a common mesh, the row-offset and column-index arrays of CRS matrixes can be shared. We represent A with the following data structure that we will denote as batched CRS matrix: r a column-index array c per team that collects the column indices of the entries present in the sparsity pattern listed row by row; r a row offset array r per team that collects the positions of the beginning of each row in c; r 1 = 1, the length of r is n + 1, and r n+1 is set equal to the number of entries present in the sparsity pattern plus one; r a value array D, a second-order tensor, that gathers in its row fibers d 1: , ..., d m: the values of the represented entries of the frontal slices A 1:: , ..., A m:: listed row by row. An example of a batched CRS matrix is illustrated in Fig. 5. Using the CRS format, the sparse matrix-vector product for m matrices together can be written as follows: in which * is the Hadamard product of matrices, that is, the element-wise product of matrices; or equivalently: Fig. 5. Illustration of the r, c, and D arrays for a batched CRS matrix. The red arrow represents the coalesced memory access direction using a right layout, the blue arrow represents the coalesced memory access direction using left layout. The blue entries correspond to a current team whereas the gray entries represent entries associated to a different team. Our algorithms are row-based algorithms.
In the case of the right layout, one team will access one contiguous chunk of memory. In the case of the left layout, one team will have constant jumps into the memory addresses when going from d mi to d 1(i+1) .

Algorithm 3: Serial Batched SPMV.
The operation can be illustrated using the serial pseudo-code of Algorithm 3 where neither threading nor vectorization techniques are used.

A. Performance Bounds for Sparse Matrix Vector Multiply
It is important to recall three concepts associated with the memory bandwidth to clarify the bounds derived in this section: the requested global memory load throughput, the required global memory load throughput, and the global load efficiency metric.
The requested global memory load throughput is the throughput associated to the load of requested data from the main memory; it corresponds to the amount of useful data loaded from the main memory per second.
The required global memory load throughput is the throughput associated to the load of required data from the main memory. When loading data from the main memory, depending on the memory access pattern, some unused data might have to be  The global load efficiency metric is the ratio between the requested and required global memory load throughput and is smaller or equal to one. The global load efficiency depends on several factors such as the memory access pattern, the alignment of data, the usage of data within a cache line, or the coalesced memory loads on GPUs. Fig. 6 illustrates one example. If X is in right layout (C++ style memory layout), and with double as the data type, for an architecture such as Intel CPUs, the cache line, i.e. the smallest amount of data transferred through the memory hierarchy, is 64 bytes long and can store 8 double values. If the entries of X are aligned, for the highlighted row fiber = 1, and k = 10, the SPMV will first load the entries x 1(1:8) even if only x 1(5:7) are needed and then x 1(9:16) even if x 1 12 and x 1 16 are not needed. In this example, the global load efficiency of the row is 9/16 as 9/16 of the loaded data of X are useful and, therefore, the best expected throughput is reduced.
The computation of (5) is limited by the memory bandwidth as its arithmetic intensity, which is of 2 flops per loaded d i , is significantly smaller than the machine flops/byte ratio on most commonly available systems. Therefore, in order to deduce bounds on the throughput, we first need to measure the memory bandwidth on the considered architectures. To do so, we used a Kokkos implementation of the STREAM Triad Memory Bandwidth benchmark [14] which measures the effective memory bandwidth. The benchmark measures the wall clock time associated to the computation of: where a, b, and c are vectors of same size and q is a scalar. Given the wall clock time, the size of the vectors, and the data type of the vector entries, the benchmark can deduce the effective memory bandwidth. The measured memory bandwidths are listed in Table I. These measurements are optimistic in the sense that the computation associated with (7) has the best memory access  II  MAXIMAL BOUNDS ON THE THROUGHPUT IN GFLOPS FOR THE DIFFERENT  SCENARIO AND ARCHITECTURES ASSUMING A GLOBAL LOAD EFFICIENCY OF  100% COMPUTED USING THE MEASURED MEMORY BANDWIDTH pattern for the most common architectures and can be vectorized perfectly; the associated global load efficiency is 100%. A worse memory access pattern such as accessing a vector based on sparsity pattern of a matrix can deteriorate the associated global load efficiency. Moreover, the vector instruction sets can influence the measured memory bandwidth as illustrated in Hammond et al. [15] for Intel Skylake CPUs. We can deduce four bounds on the throughput of the computation of (5) based on the computed memory bandwidth as done in Phipps et al. [12] and assuming a global load efficiency of 100%: Assuming that the entries of the tensors and the indices are stored using 8 bytes and 4 bytes respectively and using the measured memory bandwidth of Table I, we can deduce the  bounds on the throughput of Table II assuming a global load efficiency of 100%. We will use these bounds to interpret the results of our implementations. Those bounds depend linearly on the measured memory bandwidth. Note that the optimistic batched bounds are very good for solving sparse linear systems on different architectures.

B. Base Team-Level SPMV Algorithm
The base team-level SPMV algorithm has a for loop over the m slices each of which uses a standard SPMV at the team level with a parallel for loop using threads over the rows followed by a parallel reduce using vector lanes over the non-zeroes of a given row. The implementation is illustrated in Algorithm 4.
The two main drawbacks of this implementation are associated with the use of the vector lanes and the memory access pattern. The vector lanes are used inside a parallel reduce which implies some synchronization and the memory access pattern is graph dependent as the column indices c i dictate the access pattern of the right-hand side.

Algorithm 4:
Base Implementation of the Batched SPMV. Fig. 7. Illustration of the computation of Fig. 6 using a block diagonal matrix. The highlighted entries of A and X are the entries that need to be loaded to compute the inner product of the fiber of indices ( , k) = (1, 10). The mn products between row fibers of A and column fibers of X can be computed independently.

C. Team Batched Algorithm
As an alternative, we can rethink the m SPMV as mn independent products between row fibers of A and column fibers of X as illustrated in Figs. 6 and 7. Each product can then be associated to a unique vector lane which will sequentially loop over the non-zero entries and compute the reduction.
There are multiple ways to associate these products to vector lanes. We could use a parallel for loop at the thread level over the rows and a vectorized loop over the m slices. This choice strongly impacts performance as left layout views X will result in memory access that will be coalesced whereas the right layout views will not. Instead of writing a specialization of the kernel depending on the layout, we decided to fuse the two levels of parallelism (using a Kokkos construct called TeamVector-Range) over the mn indices and then use a function, getIndices, to map the index j ∈ {1, 2, . . . , mn} to a pair of indices ( , k) depending on the layout as illustrated in Algorithm 6. The SPMV algorithm is illustrated in Algorithm 5. In terms of the batched SPMV with the data structures we use, the best layout is the left layout as the vectorization implies a better memory usage and fewer branches.

D. Implementations of Batched SPMV
We highlight the implementation details of batched SpMV next. The implementation of the base algorithm, Algorithm 4, using the Kokkos programming model is shown in Listing 1.
The implementations of Algorithm 5 and a specialization of Algorithm 6 for layout left using the Kokkos programming model is shown in Listing 2 and Listing 3 respectively. On current CPU architectures, the best performance of teambased Kokkos kernels are always observed when team size equals to one; i.e. we did not observe any performance gain when hyper-threads are used. When using the TeamVector-Range in Listing 2, we observed that the compiler failed to autovectorize the code even if it is vectorizable. To overcome this issue, we wrote a custom implementation of the SPMV kernel for CPU architecture and left layout using SIMD data types [16] as illustrated in Listing 4.

E. Performance Evaluation of SPMV
As discussed previously, there are two performance aspects that drive the performance of the kernel, the memory load efficiency and the throughput at which data are loaded from the memory. We start by showing the advantage of Algorithm 5 over the base implementation of Algorithm 4. To do so, we  compare both the global load efficiency and the device memory read throughput as a function of the parameter m on the V100 architecture for default values of the team size and vector length for layout left and right with the gri30 matrices from PeleLM application code [17], [18]. These metrics have been measured using NVIDIA's nvprof profiler. The memory load efficiency and the throughput are illustrated in Figs. 8 and 9 respectively.
The memory load efficiency of Algorithm 4 is independent of the layout and is about 20%. For Algorithm 5 and the right layout, the efficiency is 33% for m = 1 and tends to 25% when m increases. Finally, for Algorithm 5 and the left layout, the efficiency is 33% for m = 1 and increases up to 100% as m increases and when m is a multiple of 4. Regardless of the layout, the memory access pattern in Algorithm 5 is better than the one  in Algorithm 4. However, if the layout is left and if m is chosen correctly, the efficiency of Algorithm 5 is perfect; this can lead to 4× improvement compared to Algorithm 4.
We can observe that the device memory read throughput shown in Fig. 9 increases while increasing m for Algorithm 5 up to the point where it reaches a plateau at around 650 GB/s and 550 GB/s for the left and the right layout respectively. Similar to the global memory load efficiency, Algorithm 5 is better than Algorithm 4 for the device memory read throughput.
The throughput of the batched SPMV is illustrated in Fig. 10. As previously described, this throughput depends on both the global memory load efficiency and the device memory read throughput. As Algorithm 5 has the best memory load efficiency and device memory read throughput, this implementation has the best throughput as shown in Fig. 10.
a) Choosing parameters With Autotuning: For a given batched matrix and a given layout, Algorithm 5 requires values for three hyperparameters, the number of matrices per team m, the team size, and the vector length. In this paper, we used an autotuning strategy to select the value of the hyperparameters in order to have an idea of the best achievable performance. Therefore, the deduced values are highly architecture and problem dependent.  The used autotuning strategy is a Bayesian optimization strategy with Gaussian processes using scikit-optimize that minimizes the wall-clock time (and therefore maximizes the throughput) of the batched SPMV. The computed parameters for two different real world application problems are listed in Table III for 3  different architectures where left and right layout corresponds to Fortran and C++ respectively. Finally, we tested the Algorithm 5 for the left and right layout for two matrices from the PeleLM application. The matrices are called gri30 and the isooctane in the experiments. Fig. 11 shows the performance on three different architectures using the numerical parameters of Table III. The performance is represented as a percentage of the maximal achievable throughput, i.e. the optimistic batched bound of Table II which assumes a 100% global memory load efficiency. Any percentage between 50% and 100% correspond to very good performance as it implies that the performance is above the pessimistic batched bound which assumes a 100% global memory load efficiency as well. Any percentage below 50% implies that the global memory load efficiency is below 100%; for example a percentage of 30% and 40% imply that the global memory load efficiency is at most 60% and 80% respectively. Therefore, the performance for the left layout are good in all architectures as they are all above 50% except for the case of the isooctane matrices on the MI50. As expected, the performance for the right layout is worse as the memory load efficiency is now suboptimal and the global memory load efficiency cannot be 100%. However, the performance is still around 25% which is good knowing that the memory load efficiency of Algorithm 5 is between 33% and 25%.

VI. BATCHED CONJUGATE GRADIENT SOLVER
Now that the batched SPMV has been discussed, we describe solving batched linear systems using Krylov methods. The first method that we describe is the batched unpreconditioned conjugate gradient (CG) method which requires the slices A 1:: ...A m:: to be symmetric and positive definite. We develop batched vector operations in addition to the batched SPMV. This allows us to implement batched CG solver as shown in Algorithm 7.
The developed strategy is a batched CG where m systems are solved independently but concurrently. The iterative process of the Krylov method continues up to the convergence of every one of the m systems. While waiting for the slowest solution to converge, all the previously converged solutions are not updated anymore; their corresponding steps α and β are set to zero. This stopped update allows to enforce that no overflow, underflow, or NaN is generated when dealing with non-zero vectors which have zero norm due to rounding errors. The algorithm is illustrated in Algorithm 7 where spmv is the Algorithm 5. The implementation can be found in [19].
The batched CG has been tested using randomly generated SPD matrices which have a Laplacian-like sparsity pattern. The values of the entries are randomly generated but symmetrical and the main diagonal is modified such that the matrices are diagonally dominant and therefore SPD (due to the Gershgorin circle theorem). The right-hand sides have been generated randomly too. The default tolerance was used (the machine epsilon associated to the used precision) and all the systems converged in 47 iterations on average with a standard deviation of 1.86.  Once again, we used a Bayesian optimization strategy to deduce the values of the hyperparameters. Those values have been computed for N = 20 000 and n = 200. The computed parameters are listed in Table IV for 3 different architectures. Finally, the wall-clock times are illustrated in Fig. 12 for the 2 layouts and the 3 architectures as a function of the batch size. The wall-clock time using the default parameter values, i.e. m = 8 and the default team and vector size chosen by Kokkos are illustrated in Fig. 12 with a reduced opacity and diamond marks. The auto-tuned hyperparameters result in up to 2.2X speedup.

VII. BATCHED GENERALIZED MINIMUM RESIDUAL SOLVER
The strategy used for the batched GMRES is the same as the one for the batched CG. Within a team, the GMRES is continued up to the convergence of the GMRES method for every system of the subset. Once the solution of a system has converged, this solution is not updated anymore while the GMRES is continued  dot calls. On top of that, a left batched Jacobi preconditioner has been implemented. The results presented in this section uses the left batched Jacobi preconditioner and the classical Gram-Schmidt orthogonalization. As this approach waits for the slowest solution to converge among the m systems within a given team, the grouping of the systems within subsets influences the performance of the algorithm. It is best to group together systems that will require about the same number of iterations to reduce the overall wallclock time. However, the number of required iterations of the Krylov method is not known a priori and, even if it was the case, grouping systems together would require some reordering of the memory associated to the systems. These memory reordering can be expensive and the associated overhead cost might not be compensated by the reduced wall-clock time of the batched Krylov method. In the following results we tried two different orders, the natural order, i.e. the order in which the data set is provided and the sorted order, i.e. an order computed by sorting the batched indices such that the numbers of iterations of the used Krylov method decrease with the index. The goal here is not to motivate the use of a grouping strategy but more to illustrate the variability in the wall-clock time with respect to the variability in the input data within a same team.
One of the main differences with the batched CG is that the batched GMRES requires more memory to be allocated to store the temporary data such as the previously computed Arnoldi vectors. The amount of allocated memory depends on the maximum number of iterations of the GMRES, the number of rows per system, and the batched size or the number of systems per team. There are two ways to allocate the memory: 1) use the scratch pad memory space to allocate one view per team that stores the temporary variable associated to the m systems of the team. 2) allocate a view large enough to store all the temporary variable associated to all the teams; a team will only use a subview of the large view. The first approach is more efficient on GPUs for smaller problems such as the gri30 test case. However, when the problem is larger such as the isooctane test case, the second approach is more efficient. For smaller systems, the usage of the scratch pad enables better memory access pattern by not reading temporary data from the main memory. For larger systems, not relying on the scratch pad allows to use larger number of systems m per team and avoid overhead cost associated to the scratch pad usage; only one allocation call is done and it can be reused from multiple batched GMRES calls.
Once again, we used a Bayesian optimization strategy to deduce the values of the hyperparameters. Those values have been computed for N = 20 160 and N = 16 128 for gri30 and isooctane matrices respectively. The computed parameters are listed in Table V  The best performance is reached with the Fortran layout as in the batched CG. Both the Ginkgo and the Kokkos Kernels GMRES used a relative stopping criterion with a tolerance of 10 −8 . The gri30 and isooctane systems converged in up to 7 and 17 iterations respectively. The impact of the sorting is stronger on the isooctane matrices than on the gri30 matrices; this is explained by the fact that the GMRES converged about 2 times slower for the isooctane matrices which increases the effect of the code divergence and the sensitivity of the performance with respect to the grouping.

VIII. CONCLUSIONS AND FUTURE WORK
In this paper, we have presented performance portable batched sparse linear solvers and one of their most important kernels: the batched SPMV. We have discussed their implementation and their performance on different architectures with the two most common data layouts.
As future work, we would like to investigate more autotuning strategies. For now, the autotuning has been done offline to deduce values of hyperparameters that lead to good performance for specific problems. Therefore, those values are probably sub-optimal for other use cases. As an alternative, we would like to investigate new strategies to optimize the values of the hyperparameters for ranges of problems instead of specific ones; the hyperparameters would be efficient on average on the considered range. This would require two steps, the automatic choice of ranges/domains and the optimization for each of them. We also plan to implement other sparse batched solvers and other batched preconditioners such as polynomial preconditioners.