Fast Iterative Solution of Multiple Right-Hand Sides MoM Linear Systems on CPUs and GPUs Computers

The method of moments discretization of boundary integral equations typically leads to dense linear systems. When a single operator is used for multiple excitations, these systems have the same coefficient matrix and multiple right-hand sides (RHSs). Direct solution methods are computationally attractive to use for the solution because they require one-round forward/backward substitution for each RHS, whereas the iterative solution necessitates a full-round iteration for each separate vector. However, direct methods become impractical to use even on modern parallel computers when the dimension <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula> is large, due to their <inline-formula> <tex-math notation="LaTeX">${\mathcal{O}}(n^{2})$ </tex-math></inline-formula> memory and <inline-formula> <tex-math notation="LaTeX">${\mathcal{O}}(n^{3})$ </tex-math></inline-formula> computational complexity. In this article, we present experiments with a block iterative Krylov subspace method that solves the entire set of systems at once. The proposed method performs a block size reduction during the construction of the Krylov subspace and combines spectral preconditioning and eigenvalue recycling techniques to mitigate memory costs and reduce the number of iterations of conventional block Krylov solvers. Our experiments demonstrate the potential of our method for efficiently solving linear systems with many RHSs arising from integral equation-based engineering applications fast and efficiently on both CPUs and graphics processing units (GPUs).


Fast Iterative Solution of Multiple Right-Hand Sides
MoM Linear Systems on CPUs and GPUs Computers Bruno Carpentieri , Maurizio Tavelli , Dong-Lin Sun , Ting-Zhu Huang , and Yan-Fei Jing Abstract-The method of moments discretization of boundary integral equations typically leads to dense linear systems.When a single operator is used for multiple excitations, these systems have the same coefficient matrix and multiple right-hand sides (RHSs).Direct solution methods are computationally attractive to use for the solution because they require one-round forward/backward substitution for each RHS, whereas the iterative solution necessitates a full-round iteration for each separate vector.However, direct methods become impractical to use even on modern parallel computers when the dimension n is large, due to their O(n 2 ) memory and O(n 3 ) computational complexity.In this article, we present experiments with a block iterative Krylov subspace method that solves the entire set of systems at once.The proposed method performs a block size reduction during the construction of the Krylov subspace and combines spectral preconditioning and eigenvalue recycling techniques to mitigate memory costs and reduce the number of iterations of conventional block Krylov solvers.Our experiments demonstrate the potential of our method for efficiently solving linear systems with many RHSs arising from integral equation-based engineering applications fast and efficiently on both CPUs and graphics processing units (GPUs).
Ting-Zhu Huang and Yan-Fei Jing are with the School of Mathematical Sciences/Institute of Computational Science, University of Electronic Science and Technology of China, Chengdu 611731, China (e-mail: tingzhuhuang@126.com;yanfeijing@uestc.edu.cn).
Color versions of one or more figures in this article are available at https://doi.org/10.1109/TMTT.2023.3345478.

I. INTRODUCTION
T HE method of moments (MoM) discretization of integral equation models where a single operator is used for multiple excitations generally leads to the solution of multiple right-hand sides (RHSs) linear systems of the form where the coefficient matrix A ∈ C n×n is typically dense, B = [b 1 , b 2 , . . ., b p ] is the n × p matrix of the p RHS vectors b i and X = [x 1 , x 2 , . . ., x p ] is the unknown matrix of the p solution vectors x i that need to be determined.In electromagnetic scattering and radiation analysis, the typical multiple RHS problem arises from the calculation of the monostatic radar cross section (RCS) of electrically large bodies resulting from a single incident wave with various incident angles [1].Other examples include the synthesis of large antenna arrays [2] and multifrequency acoustic scattering applications [3].Some low-complexity direct solvers have been proposed for the fast numerical solution of (1) in this context, see [4], [5].After performing the LU or L L ⊤ decomposition, computing the solution requires only a one-round forward/backward substitution for each RHS, in contrast with conventional iterative methods that necessitate a full-round iteration for each separate vector.This enables the fast direct solution to handle multiple RHS systems very efficiently.Direct methods, however, tend to become impractical or sometimes impossible to use when the dimension n is large (e.g., O(10 6 ) or more), even on modern parallel computers, due to the very large scale of these systems in realistic engineering simulations.In this case, iterative methods may be the only viable option.The solution to a linear system Ax = b belongs to the so-called Krylov subspace where r 0 = b − Ax 0 is the initial residual and m is the degree of the minimal polynomial of matrix A. If one can ensure that the minimal polynomial has a moderate degree, an iterative Krylov method can find the solution of Ax = b in a few steps.
In particular, when the coefficient matrix A is diagonalizable and nonsingular, the dimension of K m (A; r 0 ) is given by the number of distinct eigenvalues of A. A Krylov subspace method solves an n × n matrix problem in N iter O(M-V ) flops per RHS, where we denote by O(M-V ) the computational complexity of the M-V product operation and N iter is the number of iterations necessary to achieve a certain required accuracy, which depends on various factors, including the characteristics of the integral operator and of the geometry.The coupling of conjugate gradients (CGs) or generalized minimum residual (GMRES) algorithms [6] with the multilevel fast multipole algorithm (MLFMA), which performs a dense M-V product with MoM matrices in O(n log n) flops, is a common solution technique.Therefore, iterative solvers have the potential to overcome the memory issues of direct methods for solving boundary integral equations.
Extending preliminary analysis and experiments presented at the IEEE MTT-S Numerical Electromagnetic and Multiphysics Modeling and Optimization 2023 conference (Winnipeg, MB, Canada) [7], we consider the simultaneous solution of sequences of MoM systems with multiple RHSs of the form (1) using block variants of Krylov methods.Our iterative solutions belong to the family of block GMRES algorithms outlined in Section II.They incorporate and combine an initial residual deflation strategy, described in Section III, spectral preconditioning techniques discussed in Section IV, and an eigenvalue recycling procedure presented in Section V, to mitigate costs and accelerate the convergence of conventional Krylov methods for MoM linear systems.The results of our experiments reported in Section VII demonstrate the good performance of the new method in realistic simulation scenarios, on both CPU and graphics processing unit (GPU) computer systems.

II. BLOCK KRYLOV SUBSPACE METHODS
A block Krylov subspace method searches for the solution of the p linear systems (1) into the so-called block Krylov subspace defined as where R 0 = B − AX 0 and X 0 are the n × p initial block residual matrix and initial approximation, respectively.After m steps, the user-supplied initial guess X 0 is updated in the form where , while Y m is determined according to different criteria that depend on the specific method.K m (A, R 0 ) includes the Krylov spaces K m (A, R 0 (:, i)) generated by the p separate initial residual vectors R 0 (:, i) (1, . . ., p) and linear combinations of the vectors therein.Owing to the large size of K m (A, R 0 ), the p linear systems can be solved simultaneously.An additional interest in the use of block Krylov methods is that they can be implemented using level-3 BLAS (matrixmatrix) operations instead of level-2 BLAS (matrix-vector) operations, maximizing computational efficiency on today's cache-based computer hardware.
In this article, we consider block variants of the GMRES method (shortly BGMRES) [6], [8], [9].In BGMRES, the block Krylov basis V m ∈ C n×mp is computed by performing one cycle of the block Arnoldi procedure presented in Algorithm 1. First, an n × p matrix V 1 with orthonormal Compute W j = AV j 4: for i = 1, 2, . . ., j do 5: end for 8: Compute the QR decomposition W j = V j+1 H j+1, j 9: end for 10: Denote by columns is computed as the result of a QR factorization , where R is an p × p upper triangular and full rank matrix.At the end of the cycle, the block Arnoldi factorization is obtained, where and with H m ∈ C mp×mp a block Hessenberg matrix consisting of m blocks H i, j of size p × p, while E m is a matrix containing the last p columns of I mp .In BGMRES, matrix Y m appearing in (3) is determined by minimizing the norm of the block residual Because of (4), the minimization can be written as the reduced least-squares problem with In Fig. 1(a), we show the convergence histories of standard BGMRES for solving a sequence of dense linear systems arising from the MoM discretization of the electric field integral equation (EFIE) with an increasing number of RHSs vectors.The set of Rao-Wilton-Glisson basis functions [10] are utilized both for the discretization and the testing.The underlying problem is a monostatic RCS scattering analysis from a sphere geometry with a radius of 1 m, illuminated at 223 MHz of frequency.The pertinent linear system has size 2430 × 2430.Each RHS corresponds to an incident wave at a given orientation angle.We see that the convergence is rather slow using a large block Krylov space of dimension m = 200.For ten RHSs, the initial residual matrix norm is reduced by less than five orders of magnitude after 2000 iterations.The number of iterations rapidly increases with the number of RHSs.
Convergence can be accelerated by preconditioning, which transforms problem (1) into a new one that has the same solutions but exhibits better spectral properties, i.e., with the majority of the eigenvalues well clustered close to one.The transformed system has the form depending on one precondition from the left or from the right; the matrix M ≈ A −1 is called the preconditioner matrix.
It has been shown that the iteration count of Krylov solvers may scale as O(n 0.5 ) on EFIE [11].Thus, preconditioning is a critical component of the iterative solution.The design of robust preconditioners for EFIE is a research issue on its own hand.We refer the reader to [12] for a discussion on trends and problems on this topic.For the sphere mesh, in Fig. 1(b), we show the convergence histories of BGMRES preconditioned by a sparse approximate inverse with a pattern prescribed in advance.We compute the preconditioner M by minimizing the Frobenius-norm of the error matrix where A is an approximation of A, e j is the jth vector of the standard basis of C n , and m j• is the jth row of M. In the case of right preconditioning, an analogous relation holds.The sparsity pattern of the approximation A is prescribed in advance by selecting the k largest entries in each column of A [14].We observe that preconditioning significantly accelerates the BGMRES convergence.However, the number of iterations still tends to increase rapidly with the number of RHSs.In Section III, we present a variant of BGMRES that handles the case of approximate linear dependence of the initial residuals, in an effort to reduce the heavy algorithmic and memory costs of the BGMRES solution.

III. INITIAL DEFLATION STRATEGY
In Fig. 2, we plot the distribution of singular values of the matrix B of the 360 RHSs obtained by illuminating the object with the same incident wave at 360 angles, each differing by 1 • , for the sphere problem introduced in Section II.Only 31 out of 360 singular values have a magnitude greater than 10 −6 .Starting the iterative solution from X 0 = 0, the presence of small singular values indicates that the initial block residuals matrix R 0 = B − AX 0 = B has an approximate rank deficiency.
Assuming to use a preconditioner M ≈ A −1 , we solve the linear systems (7), which is much better conditioned than (1).At step 1 of the block Arnoldi process, a singular value decomposition of the upper triangular matrix T resulting from the QR factorization where matrix Q ∈ C n× p has orthonormal columns, matrix T ∈ C p× p is upper triangular and full-rank, matrices U, V ∈ C p× p are unitary and S ∈ C p× p is diagonal.We select the p d dominant singular values σ ℓ (T ) of matrix T that satisfy the condition σ ℓ (T ) > ϵ d tol for ℓ = 1, 2, . . ., p d , where 0 < ϵ d ≤ 1 is the deflation threshold and tol is the convergence tolerance used in the BGMRES stopping criterion.At this stage, we decompose the block true residual as the sum of two contributions, one associated with the p d most linearly independent directions of M R 0 and the other associated with the remaining p − p d directions.More precisely, on decomposing S as where We apply the standard BGMRES method presented in Section II to solve simultaneously the linear systems with p d RHSs corresponding to the most linearly independent columns of the preconditioned initial residual matrix M R 0 .Beginning with the matrix V 1 = QU + , the block Arnoldi procedure computes at each step j an n × ( j + 1) p d matrix V j+1 with orthonormal columns and a ( j +1) p d × j p d block Hessenberg matrix H j satisfying the block Arnoldi-like factorization At the end of the cycle, after m steps, the initial guess X 0 is updated as where over the space X 0 +range(V j Y S + V H + ).In this article, we take a similar approach to that proposed in [15], but instead of using the block quasi-residual to detect convergence, we compute the block true residual norm, making our stopping criterion completely preconditioner-independent.

IV. SPECTRAL PRECONDITIONING
When the maximum dimension of the block Krylov subspace is reached during the iterative process, the deflated BGMRES algorithm presented in Section III is restarted using the most recent approximation as the new initial guess, until convergence is achieved.Restarting, though, may prevent an accurate approximation of eigenvalues of M A close to zero in the Arnoldi process, undermining the superlinear convergence of Krylov methods and sometimes causing stagnation [16].We address this convergence problem by updating the user-supplied preconditioner M with an invariant subspace associated with the eigenvalues of M A closest to zero, adapting a formulation proposed by Carpentieri et al. [17].
We introduce some preliminary notation.We assume that a preconditioner M is provided, so we solve the equivalent linear systems (7) which is better conditioned than (1).We denote by {λ 1 , . . ., λ n } the spectrum of the preconditioned matrix M A. We define a matrix V k ∈ C n×k that spans a k-dimensional right invariant subspace of M A associated with its k smallest eigenvalues {λ 1 , . . ., λ k } and we complete V k with its orthogonal complement V ⊥ k .At this stage, we can write the spectral decomposition and blocks J k and F have eigenvalues {λ 1 , . . ., λ k } and {λ k+1 , . . ., λ n }, respectively.

A. Additive Spectral Preconditioning Formulation
In the formulation for additive spectral preconditioning, M is updated in the additive form where and W is any matrix chosen such that the projection A = W H AV k is nonsingular.Thus, we have k) .The last two equations can be written as where the expression denoted by the asterisk is irrelevant.The following conclusion is proven.Proposition 1: The eigenvalues of M add A have the following distribution: The additive spectral preconditioning update M has the effect of translating the k selected smallest eigenvalues to near point one in the spectrum of the spectrally preconditioned matrix M add A.

B. Multiplicative Spectral Preconditioning Formulation
It is simple to present the formulation for multiplicative spectral preconditioning by referencing standard multigrid methods.Typically, in a two-grid algorithm, a few iterations are performed to effectively dampen the high-frequency components of the error (presmoothing phase).Low-frequency components that are not effectively damped are then represented on an appropriate small-dimensional space where they Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
are captured by solving a small-dimensional error equation, and then they are interpolated back into the initial space.Finally, some extra iterations are applied to filter any spurious high-frequency components that could have been reintroduced into the error during the coarse grid correction operation (postsmoothing phase).
By analogy, in the multiplicative spectral preconditioning formulation the coarse subspace is defined as the eigenspace associated with the k smallest eigenvalues of M A. After a few presmoothing steps, the residual is projected onto the coarse space using the projection operator W H .The coarse space error equation is defined by the k × k matrix A = W H AV k using a Petrov-Galerkin formula and is solved exactly.The residual is projected back to the initial space using the interpolation operator V k , and finally, it is postsmoothed again.A sketch of the preconditioning operation is presented in Algorithm 2, where the preconditioner is denoted for simplicity as M mul (A, M).The term multiplicative spectral formulation is used since the preconditioner matrix M mul has the following expression, derived from established results of multigrid theory: where, as in Section IV-A, The spectral decomposition of the preconditioned matrix M mul A can be obtained by observing that, by induction on j with a k × (n − k) matrix P c .By combining the previous expressions, we get The precise value of the block denoted by an asterisk is irrelevant.The conclusion is summarized in the proposition below.
Proposition 2: The eigenvalues of M mul A have the following distribution: The action of M mul on the spectrum is twofold: it shifts the smallest eigenvalues of M A to exactly point one of the spectrum of M mul A, and it clusters more tightly those eigenvalues of M A that were already clustered near to one.Therefore, we can anticipate a further decrease in the number of BGMRES iterations.

V. EIGENVALUE RECYCLING
The eigenvectors V k 's that appear in the definition of spectral preconditioning updates formulated in Section IV are chosen to be harmonic Ritz vectors.We recall that, given a subspace W of C n and a matrix B ∈ C n×n , a harmonic The scalar λ is referred to as harmonic Ritz value and vector y is the corresponding harmonic Ritz vector [18].The harmonic Ritz vectors can be obtained from the block Arnoldi process at low complexity using the following result.
Lemma 1: If we define W = span{V m }, where V m is the orthonormal basis for the block Krylov subspace computed by the block Arnoldi process at the end of a cycle, the harmonic Ritz pairs (θ i , g i ) associated with V m satisfy the following relation: where V m g i is the harmonic Ritz vectors associated with the corresponding harmonic Ritz values θ i .Proof: See [19].
From definition (5), we derive from (18) the relation ) that can be used to compute the k targeted harmonic Ritz pairs.In the definition of the additive and multiplicative spectral preconditioners M add and M mul , we define and W = V k .Although the harmonic Ritz vectors computation adds some extra costs to the preconditioning setup, these costs can be offset by the reduced number of iterations, as proven by our numerical experiments presented in Section VII.The complete spectrally preconditioned and initially deflated BGMRES method (hereafter referred to as SPID-BGMRES) is sketched in Algorithm 3.

VI. IMPLEMENTATION ASPECTS
The most time-consuming part of the SPID-BGMRES algorithm is the matrix-matrix multiplication involved at every Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Algorithm 3 Pseudocode of the SPID-BGMRES Method for Solving Multiple RHSs Linear Systems Require: The coefficient matrix A ∈ C n×n , the RHS matrix B ∈ C n× p , the starting matrix X 0 ∈ C n× p , the deflation threshold ϵ d , the dimension m of the block Krylov subspace, a user-supplied initial preconditioner M, a convergence tolerance tol and the maximum number of iterations maxiter for the stopping criterion.Ensure: The n × p solution matrix X .
2: for i = 1, . . ., maxiter do 3: Compute the QR factorization M R 0 = QT of the preconditioned residual matrix M R 0 , where Q is n × p with orthonormal columns and T is p × p upper triangular.

4:
Compute the singular value decomposition of the upper triangular factor T as T = U SV H .

5:
Select the p d singular values of T that satisfy the criterion σ ℓ (T ) > ϵ d tol for all 1 ≤ ℓ ≤ p d .

6:
Set V 1 = QU (:, 1 : p d ) ∈ C n× p d as the initial orthonormal matrix to start the block Arnoldi process.

7:
Apply m steps of the block Arnoldi procedure (Algorithm 1) starting from matrix V 1 .At step j, block Arnoldi generates matrices V j+1 with orthonormal columns (i.e., V H j+1 V j+1 = I ( j+1) p d ) and matrices H j satisfying the block Arnoldi factorization M AV j = V j+1 H j .

8:
Solve the least squares minimization problem (12) to compute matrix Y m .9: Update the solution as (11), and compute the block residual R m = B − AX m .

10:
Check for convergence: if the convergence test fails, the method needs to be restarted.Set R 0 = R m and X 0 = X m .

11:
Compute the k harmonic Ritz pairs (θ i , g i ) from the matrix H H m H m + E m H H m+1,m H m+1,m E H m , and store vectors g i as columns of the matrix G k . 12: Update the preconditioner M using the additive (13) or the multiplicative ( 14) spectral preconditioning formulation.14: end for iteration.Although this operation is very demanding from the computational point of view, it has also the potential to enhance computational efficiency because level-3 BLAS implementations can effectively exploit block operations and data locality on modern cache-based computer hardware.This becomes especially critical when a GPU device is involved in the computation since data transfer is very slow if compared with the computational power of the device.This problem appears at two different levels: when the data is moved from the host to the device's main memory and when the data are moved from the device's main memory to the streaming-multiprocessor (SM) cache memory.Block Krylov algorithms are straightforward to combine with fast integral equation solvers, such as MLFMA, which carry out approximate matrix-matrix multiplications with low complexity.In order to evaluate the numerical performance of SPID-BGMRES, however, we perform exact products in our experiments, and this section describes how we efficiently manage the iterative solution on both CPU and GPU computers.
In the first part, we assume that the entire matrix, RHSs, and resulting vectors can be stored in the main device memory, so only cache transfer needs to be managed in order to increase the arithmetic intensity of the operation.Then, we will assume that the main matrix is too big to be stored in the device memory.In this case, a multiblock and multistream approach will be presented.Both techniques can then be combined to improve the arithmetic intensity, and hence, the performance.In the following examples, the main matrix is assumed to be a dense complex matrix with double precision number representation.

A. Cache Optimization on Graphics Processing Units
We recall that a GPU device is composed of a group of streaming multiprocessors (SMs).Every SM has its own registers and a very fast cache with shared memory within all the threads in the SM.In our matrix-matrix operation, say A • b = x, let us assume that the entire matrix A can be stored in the device memory as well as the matrix b and x.A subportion of the matrix A and b can be loaded in the SM cache.The idea is then to perform all the operations on that data portion before loading a new subportion.In Fig. 3, we show an example of four blocks subpartition of A and b.Instead of a classical row-column multiplication, we can load small blocks of dimension B s × B s and perform all the operations on that.This allows us to perform B 3 s operations before loading a new block.The product is then performed in parallel using the threads in the SM.Note that only a local synchronization is needed at the SM level in order to guarantee the consistency of the cache data (i.e., the new load requires that all the threads finish the product).
More precisely, let us write the matrix-matrix operation componentwise, where we adopt the Einstein convention on repeated indexes The cache-local value of A can be extracted for every subblock as A lr αβ = A B s (l−1)+α,B s (r −1)+β , where α, β = 1, . . ., B s and l, r = 1, . . ., N /B s .The same extraction can be used for the matrix b.Then, the matrix-matrix multiplication can be written as the sum of local blocks, see Fig. 4. Note that we have essentially written A and b as a two-level hierarchical matrix.The standard row-column product is applied by every SM but the basic operation now involves a dense matrix-matrix instead of a scalar product.By choosing a proper dimension in order to fit the cache, we can maximize the use of the data.In our practical examples, we will use B s = 32 or Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.B s = 16 but different parameters can be used according to the available architecture.Note that if we consider only one RHS (i.e., N b = 1), then b and x becomes a column vector of dimension N .In this case, we can still load a subpartition of A and b, but this allows us to compute B 2 s multiplications before a new cache-load is needed, resulting in a lower arithmetic intensity.
We will further use the multidimensional structure of the blocks on the device so that every thread (α, β) would compute the product A αβ with only a local synchronization before and after the cache loading.
In Section VI-B, we will consider the case when the matrix A is large and cannot be stored entirely in the device memory.Note that in this case, the problem is similar to the one considered here since we need first to optimize the use of data on the device memory before a new loading from host to device.We can rewrite the matrices A, b, and x as a three-level hierarchical matrix so that the first-level block moves from the host to the main device memory, the second-level block moves from the device memory to the SM cache and the third level contains the dense operation performed at the SM level.As we will see using this approach, in a synchronous way, we will still be limited by the transfer time of data from host to device.So we will combine this strategy with an asynchronous multistream approach.

B. Three-Level Hierarchical Matrix
Since data transfer between host and device is very timeconsuming, it is always better to store the data on the device's memory and avoid data transfer.In our application, we have three objects needed for the matrix-matrix multiplication, namely, the following holds.
1) The matrix A, that is of dimension N x × N x .
2) A second rectangular matrix b of dimension N x × N b .
3) The resulting product can be stored in a matrix x of dimension N x × N b .Usually, the dimension of the matrix is much larger than the number of considered RHSs, i.e., N b ≪ N x , so that we can assume that the matrix b and x can always be stored in the device memory.On the contrary, it is easy to generate matrices A that do not fit the device memory, which is very fast but limited in capability.The proposed SPID-BGMRES algorithms involve a matrix-matrix multiplication and a matrix-vector multiplication inside each internal iteration.Those operations represent the most called and time-consuming operations of the solver.A naive implementation of data transfer from the host to the device would easily vanish the gain obtained using GPUs since the time needed to copy data from the host to the device can be very large due to the low-speed connection between the two.However, it can be noted that some computations can be done during the data transfer if some data is already loaded in the device memory.In this section, we provide a strategy to mitigate data-transfer issues using a multistream approach.A CUDA stream contains a set of instructions that have to be executed in a certain order; however, different streams run asynchronously in a concurrent way.In our case, the required data has to be loaded before performing operations on that (ordered operation queue) but a stream can load the data, while a second stream is performing computation on a different portion of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the matrix (concurrently).In order to do that, we take a partition of the matrix rows into m-streams and columns into n-Blocks with 1 < n, m < N x and we assume for simplicity, mod(n, N x ) = mod(m, N x ) = 0.In this way, we can define N b x = N x /n and N s x = N x /m as the constant number of elements per block and stream, respectively.Let us call A αβ i j the matrix portion associated with stream α ∈ [1, m] and block β ∈ [1, n] (see the blue blocks in Fig. 5).Note that A αβ i j represents the first level partition of A that will be the constitutive matrix for the cache-level optimization described in Section VI-A.Since b fits the device memory, we assume that all the streams can access that data, as well as for x that reside in the device memory.This partition is shown in Fig. 5. Due to the chosen partition, we can say that every stream has to manage a certain range of indexes [i L (α), i R (α)] with i L = (α − 1)N s x + 1 and i R = α N s x and every block would manage a range of column indexes The vector b is read-only during the operation, while vector x is updated at every block operation.Device streams are able to run concurrently and allow asynchronous data transfer between the host and device.We define a three-rank tensor M = M s i j where i = 1, . . ., N s x , j = 1, . . ., N b x and s = 1, . . ., m that represents a single block-column (i.e., B 1 × {S 1 , . . ., S m }).In other words, M s i j can be seen as a sliding representation of the matrix on the device.Note that the dimension of |M| = |A|/n, so we can modify n in order to reduce the memory usage on the device.
For a given stream α, the ordered operations to be computed in the stream are as follows.
1) For β = 1, . . ., n, the following holds.a) Load M α = A α,β (host to device memory operation).b) Compute ) (device to host memory operation).Since every stream simply updates the value of x in the range [i L (α), i R (α)] × N b , it does not require any synchronization.In this way, the loading-computation is pipe-lined and computation can start as fast as a small block is loaded into the device memory since streams run concurrently and do not require the entire tensor M but only a portion of it, i.e., M α .This procedure shows the first level of loading.If we move forward, we can see that exactly the same problem appears when we try to perform a single-block operation [i.e., point (2) in the device to host memory operation described above].In this case, the matrices M α and b([ j L (β), j R (β)], [1, N b ]) are too large to fit the cache memory of an SM.So the partition strategy can be used again, in this case, we can further use the multidimensional structure of the threads and partition both M and b into 2-D blocks and using the procedure explained in Section VI-A.This strategy is potentially able to mitigate the data transfer time and is consistent with construction.However, there are some assumptions in order to obtain some computational advantage.First, we assume that the data transfer time is comparable with the computation time.If the computation time is too small (i.e., low arithmetic intensity), we will only see the transfer time.Then, at the beginning of the algorithm, we need to wait for the loading of the first block into the device memory.In the SPID-BGMRES algorithm, we can easily initialize M α = A α,1 at the beginning of the solver.Later, after the completion of the matrix-matrix operation, the first value can be loaded in an asynchronous way in order to mitigate the first loading time.
Next, we evaluate the performance of the proposed block Krylov method for solving multiple RHSs MoM discretized Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.linear equations using the implementation described in this section.

VII. NUMERICAL EXPERIMENTS
In order to facilitate rapid prototyping and evaluation of the numerical properties of the method, the initial set of experiments is carried out on two small MoM linear systems of size n = 1701 and n = 2430, respectively, arising in scattering analysis.The geometries are illustrated in Fig. 6 is the monostatic RCS calculation.These first experiments are performed in MATLAB (version R2022b) on a laptop equipped with an 11th Gen Intel i7-1165G7 @2.80 GHz and 16 GB of memory.The initial guess for the iterative solution is set to X 0 = 0 ∈ C n× p and iterations are stopped when or when the number of iterations reaches maxiter = 2000.In Figs.7-10 and Tables I-IV, we compare the following solvers.1) BGMRES is the standard block variant of the GMRES method described in Section II.2) ID-BGMRES is the BGMRES method enhanced with the initial residual deflation strategy discussed in Section III. 3) SPID-BGMRES-M add is the BGMRES variant (Algorithm 3) enhanced with initial residual deflation and spectral preconditioning based on the additive formulation presented in Section IV-A.4) SPID-BGMRES-M mul is the BGMRES variant (Algorithm 3) enhanced with initial residual deflation and spectral preconditioning based on the multiplicative formulation presented in Section IV-B.The built-in MATLAB function eigs is used in our experiments with SPID-BGMRES-M add and SPID-BGMRES-M mul to compute the harmonic Ritz vectors using the recycling technique presented in Section V, namely, from (19).
The convergence histories of the standard BGMRES method against the spectrally preconditioned and initially deflated variants are shown in Fig. 7 for the sphere and Fig. 8 for the satellite problem, respectively.The numerical results for these experiments are reported in Tables I and II.We solve for 20 and 50 RHSs, we set the dimension of the block Krylov subspace equal to 50, and we recycle 15 small eigenvalues from the block Arnoldi factorization.The deflation parameter ϵ d is set equal to 0.1, the BGMRES convergence tolerance for the stopping criterion is 1e − 6.The initial preconditioner M is a Frobenius-norm minimization approximate inverse method [13] described in Section I.The advantages of using deflation and eigenvalue recycling on the convergence of BGMRES are evident in terms of fewer iterations and shorter solution times.As clarified in Section VI, in our runs we perform exact products with dense A, and these are particularly expensive in MATLAB.The forward errors reported in Table II for the experiments illustrated in Figs.7(b) and 8(b), for the case of 50 RHSs, demonstrate the remarkable robustness of the SPID-BGMRES method.These results compare iterative solutions (denoted as X approx in Table II) achieved with a convergence tolerance of 1e − 6 to accurate direct solver solutions (X ).When the convergence tolerance is reduced to 1e − 12, the forward errors for SPID-BGMRES-M mul in our experiments are decreased to 2.5545e − 12 for the sphere problem and to 1.295e − 12 for the satellite problem.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.In our experiments reported in Fig. 9(a) and Table III, we deteriorate the quality of the initial preconditioner M on the satellite problem, so that a significant number of outlier eigenvalues close to the origin appear in the spectrum of M A. As we can see in Fig. 9(a), in this case, the spectral preconditioning technique is crucial for convergence.The new variants enable us to achieve highly accurate solutions in contrast with the standard BGMRES method.This aspect can be observed in Fig. 10, where a low convergence tolerance equal to 1e − 12 is used for the iterative solution on both the sphere and the satellite problems.Finally, in the experiments reported in Table IV on the satellite, we compute the eigenvector matrix V k for the spectral preconditioning updates in preprocessing, that is before starting the iterative solution, using the built-in MATLAB function eigs that calls the ARPACK package applied to M A, instead of recycling the harmonic Ritz vectors from the block Arnoldi procedure, as described in Section V.Although this eigencomputation is slightly more expensive, it enables us to correct and enrich the preconditioner immediately from the beginning of the iterative solution, and thus, it can have a beneficial effect on convergence, to some extent, as shown in Table IV.

A. Numerical Experiments on GPUs
In this section, we use the implementation described in Section VI to solve some larger test cases on a computer system equipped with an Intel i9 12900KS and NVIDIA RTX A4000, and with 16 Gb of dedicated memory.We consider a sphere geometry discretized with N = 12 000 degrees of freedom.The problem is small enough to be loaded entirely on the device memory, but it represents a suitable setup for performance testing.
1) Performance of the Single Operation: In the first test, we evaluate the performance of a single-matrix-vector and matrix-matrix multiplication.The preload of the initial block M is activated so it is not measured.We fix the number of RHSs to N b = 40 for the matrix-matrix operation and the number of streams m = 32 for all the cases.In Table V, we report the times obtained for a single operation in µs.The time obtained using a single stream/single block is reported only to show the stability of the system and can be considered constant through all the simulations.As we can see, the matrix-vector multiplication is dominated by the transfer time of 94 µs that is the 87.57% of the overall operation time.On the contrary, the matrix-matrix multiplication with multiple RHS results in a higher computational intensity as expected: 284 µs for the computation that is the 74.27% of the total operation time.If the matrix can be stored in the device memory (i.e., n = 1), we recover the computation time (with no transfer time) as expected.However, if we increase the number of blocks we see that the multistream time increases with a factor that is proportional to the loading time of the remaining matrix, that is 94•(n −1)/n.On the contrary, if the computational time is sufficiently large, this approach is able to hide the transfer time, since this time is spent mostly during computation.Furthermore, the use of the multistreams/multiblock approach seems to speed-up the computation when a larger number of blocks are used.In order to reduce the bottleneck shown for matrix-vector multiplications, the operation is piped in the matrix-matrix stream when necessary so that it would increase the computational cost with the same data loading resulting in a higher computational intensity of the merged operation.
2) SPID-BGMRES Implementation With Pipes: We test here the multistream approach for a complete run of the SPID-BGMRES algorithm.In this case, the data transfer of the first block at every inner iteration is included and contributes to the overall measured time.We run the simulation using N b = 10 and N b = 40.We use ten cores and several blocks on the device.The simulation using the entire preloaded matrix (i.e., single stream/single block) is also performed for comparison.In Table VI, we can see the difference between CPU and GPU in terms of computational time with several pipes.It can be observed that by using concurrent multiple streams, we can operate with data loading with essentially no penalties if the computational effort is large enough.This approach seems then to be well suitable for the multi-RHS framework that is assumed in this article.We underline that in the case n = 96, only a fraction 1/n of the matrix is stored in the device.This means that we can run on a single   machine a matrix that is 96 times larger using the same resources.
3) Experiments on the Sphere Problem: Here, we consider a sphere geometry whose associated linear system has dimension N x = N = 30 720.The system itself requires 15.3 GB of free space and, since the operating system requires 1 GB of graphic memory, the effective available memory on the device is not sufficient to store the matrix and the result vectors.For this test, we first build a preconditioner M with a sparsity ratio of 3.9% based on Ã that is again a sparse approximation of A with a ratio of 19.5%.For this test, we run with 32 streams and 32 blocks and with several RHS from N b = 50 to N b = 360.In the case of N b = 200, the memory needed to store the result and data matrices x and b is N • 16/1e9 = 0.197 GB.The memory allocated on the device using this block partition becomes N 2 • 16/1e9/32 = 0.471 GB.In order to allow asynchronous data transfer between the host and the device, the matrix needs to be saved in a pinned memory region that is limited by 32 GB in the used system but can be increased according to the RAM memory at the host level.Finally, we set a restart of 50 iterations and a tolerance of As we can see, we obtain a gain factor that ranges from 2.63 to 2.99 in terms of computational time even if the data transfer is activated; however, the high arithmetic intensity of the matrix-matrix operation allows us to mitigate this effect and still maintain the speedup.It is worth to be mentioned that for the test N b = 200 on GPU, the total time spent for matrix-matrix multiplication is around 3177.9 s; that is, the 97% of the overall computational time.This confirms that the matrix-matrix, matrix-vector operation is the critical part where parallelization and optimization become very important.This is confirmed even when the complete problem is solved (i.e., N b = 360).In this case, the matrix-matrix/matrix-vector process would require 7017.23 s on GPUs that corresponds to the 97.44% of the overall computational time.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.4) Experiments on the Aircraft Problem: In Fig. 12, we plot the convergence histories of the SPID-BGMRES method using CPU and GPU computers for solving sequences of dense MoM linear systems using a varying number of RHSs (N b = 10, . . ., 80) on an Airbus aircraft prototype mesh (N = 23 676).The mesh is depicted in Fig. 13.This problem is tough to solve because of the irregular mesh that contains many fine geometric details.Hence, we set the restart to 200.As a result, slow convergence of Krylov subspace methods is observed compared with more regular geometries [13].The preconditioner M used in our experiments is a sparse approximate inverse method with density 10.1% computed from a sparse approximation A to the dense coefficient matrix A of density 84.5%.We can see that SPID-BGMRES is remarkably robust on this difficult problem as it converges using 80 RHSs, whereas the standard BGMRES method using ten RHSs after 2000 iterations reduces the initial residual norm by less than two orders of magnitude.

B. Characteristics of the Preconditioner Implementation
As shown in [14], the construction of a sparse preconditioner M for a matrix A can be very expensive when the matrix A is full.So the computation of M is based not to A, but on a sparse approximation of A, namely, A. The sparsity ratio for the approximation A and M can be the same for very simple geometries.This choice can lead to a deterioration of the preconditioner for a general matrix due to a lack of  information in the approximation of A. One advantage of the Frobeniuus-norm minimization strategy is that it results in a series of independent least-squares problems that can be solved in parallel for every row of M. In this process, we have two time-consuming parts that are involved; first, the columns involved in the least square problem have to be extracted from A. This process is based only on the sparse indexes of M and can be done independently for every row i = 1, . . ., N .The solution of the least square problem is then based on the extracted matrix and is independent for every i, see [13].In order to solve the LLS problem, we can use the well-optimized double complex general least squares (ZGELS) routine that is part of the Intel math kernel libraries (MKL) package.On the contrary, the columns extraction process can easily be performed on GPUs.Since both operations are independent with respect to i, the parallelization both on CPUs and GPUs is trivial.
In Table VIII, we report the gain of using GPU computing with respect to CPU when the sparsity ration of A and M are equal to 200 and 600, that for the sphere with N = 12 000 correspond to 1.67% and 5%, respectively.

C. Experiments on Sparse Matrices
The SPID-BGMRES method can also be applied to the solution of sparse linear systems.We consider a sparse matrix from Quantum Chromodynamics, namely, the conf5-8 × 8-05 matrix available at https://math.nist.gov/MatrixMarket/.The matrix has dimension N × N with N = 49 152 but with only 1.916.928nonzero elements.The structure of the matrix is depicted in Fig. 14.The matrix is then represented using the compressed sparse row (CSR) storage format.As a preconditioner, we use a simple incomplete LU decomposition.We perform the simulation both on CPU (ten cores) and GPU using in both cases N b = 360.We set the restart to 50 and the tolerance to ϵ = 1e − 6.Both simulations converge in 99 iterations, the CPU run requires 33.74 s, while the GPU simulation takes 27.60 s with a speedup of 1.22.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

VIII. CONCLUSION
We have presented numerical results with a block iterative Krylov subspace method for solving dense linear systems arising from the MoMs discretization of boundary integral equations with many RHSs given at once.The proposed method combines a block size reduction of the initial residual vectors, spectral preconditioning, and eigenvalue recycling techniques to mitigate memory costs and reduce the number of iterations of conventional block Krylov solvers.This article introduces the theoretical foundation of the block iterative method and provides supporting evidence through numerical performance and parameter settings analyses.The size of the boundary element matrices examined in this study is small in comparison with practical engineering test cases, in order to facilitate rapid prototyping and evaluation of the numerical properties of the method.However, consider that on the sphere discretized with N = 30 720 degrees of freedom, a direct solver would take around 15 Gb of memory to store the factorization using double-precision complex arithmetic.Porting a dense direct solver to GPU processors has its own set of challenges due to frequent synchronizations and irregular memory accesses that are difficult to optimize on a GPU [20].On the other hand, as a block iterative solver, our method is based on matrix-matrix operations that can be efficiently implemented on both multi-CPU and GPUaccelerated hardware.This enabled us to develop an efficient GPU implementation that, for the 30 720 × 30 720 problem, requires an extra memory cost allocated on the device of only 0.471 GB in the case of 200 RHSs using 32 streams and 32 blocks.For solving large-scale boundary integral equation systems of the order of O(10 6 ) unknowns or more, our method can be easily integrated with fast integral equation solvers, such as, e.g., MLFMA, hierarchical matrices, or panel clustering, that perform fast approximate matrix-vector operations in O(N log N ) memory and arithmetic complexity per RHS.This is a fundamental advantage over a direct method.This combination requires some implementation effort, and thus, it will be investigated further in a separate study.The results of our experiments show the good performance of the new method on both CPU and GPU computer systems.Hence, we believe that it can be useful for solving problems in integral equation-based engineering applications, such as those arising, e.g., in computational electromagnetics and acoustics.

Manuscript received 9
September 2023; revised 27 November 2023; accepted 7 December 2023.Date of publication 29 December 2023; date of current version 7 August 2024.This work was supported in part by the Provincia autonoma di Bolzano/Alto Adige-Ripartizione Innovazione, Ricerca, Università e Musei, under Contract 19/34; and in part by the Gruppo Nazionale per il Calcolo Scientifico (GNCS), Istituto Nazionale di Alta Matematica (INdAM), under Project CUP_E53C22001930001.The work of Bruno Carpentieri was supported by the Free University of Bozen-Bolzano under Grant IN200Z SmartPrint.The work of Dong-Lin Sun was supported in part by the National Natural Science Foundation of China under Grant 12001059.The work of Yan-Fei Jing was supported in part by the National Natural Science Foundation of China under Grant 12071062 and in part by the Science Fund for Distinguished Young Scholars of Sichuan Province under Grant 2023NSFSC1920.This article was presented in part at the IEEE MTT-S Numerical Electromagnetic and Multiphysics Modeling and Optimization (NEMO) 2023 Conference, Winnipeg, MB, Canada, June 28-30, 2023 [DOI: 10.1109/NEMO56117.2023.10202284].(Corresponding author: Bruno Carpentieri.)

Fig. 2 .
Fig. 2. Distribution of singular values for the RHS matrix B on the sphere problem (n = 2430).

Algorithm 2 1 :
Multiplicative Spectral Preconditioning Operation p = M mul • r Eliminate the residual components associated with the large eigenvalues of M A (presmoothing phase).2: e 0 = 0 3: for i = 1, . . ., α do 4: e i = e i−1 + M(r − Ae i−1 ) 5: end for 6: Correct the residual components associated with the k smallest eigenvalues of M A. 7: e α = e α + V k A −1 W H (r − Ae α ), with A = W H AV k .8: Filter again the residual components associated with the large eigenvalues of M A (postsmoothing phase).9: for i = 1, . . ., β do 10: e α+i = e α+i−1 + M(r − Ae α+i−1 ) 11: end for 12: p = e α+β Ritz pair of B with respect to W is defined as any pair (λ , y) ∈ C × W such that By − λ y ⊥ BW(16)or, equivalently w H By − λ y = 0 for all w ∈ Range(BW).

Fig. 6 .
Fig. 6.Geometries used for the MATLAB experiments.(a) Case I: a sphere with a radius of 1 m illuminated at 223 MHz of frequency.(b) Case II: a satellite illuminated at 57 MHz.

Fig. 9 .
Fig. 9. Performance of the BGMRES method and its variants ID-BGMRES and SPID-BGMRES-M add on the satellite problem using a low accurate sparse approximate inverse preconditioner M. (a) Case I: convergence histories.(b) Case II: distribution of eigenvalues of M A.

Fig. 10 .
Fig. 10.Convergence histories of the BGMRES method and its variants ID-BGMRES and SPID-BGMRES using low convergence tolerance 1e − 12. (a) Case I: on the sphere problem.(b) Case II: on the satellite problem.

Fig. 12 .
Fig. 12. Convergence histories of the SPID-BGMRES method using CPU and GPU computers on the Airbus aircraft problem (N = 23 676).

Fig. 13 .
Fig. 13.Geometry used for the experiments on the aircraft problem.Courtesy of Airbus company.
Algorithm 1 Block Arnoldi Procedure Based on the Modified Block Gram-Schmidt Orthonormalization 1: Compute a reduced QR-decomposition R 0 = V 1 R where R ∈ C p× p is a full rank upper triangular matrix and V 1 ∈ C n× p is a matrix with orthonormal columns.2: for j = 1, 2, . . ., m do

TABLE I NUMERICAL
RESULTS FOR THE EXPERIMENTS REPORTED IN FIGS.6 AND 8

TABLE II FORWARD
ERRORS FOR THE SPID-BGMRES METHOD FOR THE EXPERIMENTS REPORTED IN FIGS.7 AND 8 TABLE III NUMERICAL RESULTS FOR THE EXPERIMENTS REPORTED IN FIG.9(a)

TABLE IV NUMERICAL
EXPERIMENTS ON THE SATELLITE USING TWO DIFFERENT STRATEGIES FOR COMPUTING THE SPECTRAL INFORMATION (MATRIX V k ) FOR THE PRECONDITIONING UPDATES M add AND M mul

TABLE V MULTISTREAM
PERFORMANCE COMPARISON

TABLE VII TIME
COMPARISON BETWEEN CPU AND GPU ON THE SPHERE PROBLEM (N = 30 270).THE NUMBER OF ITERATIONS FOR THE GPU AND CPU EXPERIMENTS IS EXACTLY THE SAME ϵ = 1.e − 8.The resulting number of iterations as well as the measured computational time is reported in Fig. 11.The number of iterations as well as the time required for CPU (ten cores) and GPU computing is reported in Table VII.

TABLE VIII COMPARISON
BETWEEN CPU AND GPU TIME FOR THE COMPUTATION OF THE INDEXES