An Efficient and Error-Controllable Angular Sweep Algorithm for Electromagnetic Wave Scattering and its Analysis

The angular sweep of electromagnetic wave scattering is formulated as a matrix equation with multiple right-hand sides (RHSs). Although the low-rank approximation of an RHS matrix is a popular choice for reducing the computational costs of multiple RHSs, only a small amount of research has been conducted to explore how this approximation impacts the solution quality. Furthermore, there has not been sufficient research on the quality of the solution as a function of the accuracy of the iterative solver. We present an error analysis of the approximated solution considering both the reduced number of RHSs and the tolerance of the iterative solver. Based on the error analysis, a new angular sweep algorithm is proposed with fine-tuned tolerances of the iterative solver for individual singular vectors. The different tolerances for each singular vector increase the efficiency of the proposed algorithm. Another benefit of the proposed algorithm is that the error can be bounded by a user-defined global tolerance. In addition, a variant of the generalized conjugate residual method for multiple RHSs is introduced to accelerate iterative solvers. Finally, numerical validation is conducted with three examples in which the discontinuous Galerkin surface integral equation method is applied. The experiments support two conclusions: tight upper and lower bounds of the solution error exist, and fine-tuning the tolerances reduces the computational costs.


I. INTRODUCTION
F ULL-WAVE electromagnetic simulation has been used in a variety of applications [1], [2], [3] due to the development of successful algorithms and the expansion of processing capability over the past few decades. For these simulation methods, multiple angular responses of electromagnetic wave scattering are a classic challenge in the area of computational electromagnetics. A right-hand side (RHS) matrix can be used to construct multiple incident fields, with each column vector representing an excitation vector for a particular incident angle. Standard direct solvers, such as lower-upper (LU) decomposition [4], are favored for solving multiple RHSs; in general, however, because of their complexity, advanced methods are necessary. The hierarchical matrices approach [5] is one of the direct solvers for addressing the abovementioned problems, even for electrically large targets. Zhou and Jiao [6] and Guo et al. [7] applied hierarchical matrix methods in combination with the finite-element method (FEM) and integral equation, respectively, although the computational complexity of the work in the latter article is expected to be O(N 1.5 log N). Alternatively, block Krylov subspace methods are a viable option because several RHSs can be solved simultaneously. Various examples are introduced, such as the block generalized minimal residual (BGMRES) [8] and the block generalized conjugate residual with optimal truncation [BGCROT(m,k)] [9]. Another effort to speed up iterative solvers for multiple RHSs is the generalized conjugate residual with inner orthogonality and deflation restarting [GCRO-DR(m,k)] [10], [11]. In addition to block iterative solvers, other efforts are underway, which reduce the dimension of multiple RHSs itself. Low-rank approximation with singular value decomposition (SVD) [12], asymptotic waveform evaluation (AWE) [13], [14], and model-based parameter estimation (MBPE) [15], [16] are well-known methods. In addition, Peng et al. [11] introduced a novel method that uses Fourier harmonics to construct excitation matrices. This method has two advantages: the accuracy of the excitation matrix can be estimated and the sampling size can be adaptively increased. However, this approach is limited because the tolerance of SVD is not systematically chosen, and also, the SVD is computationally expensive. Adaptive cross approximation (ACA) is one of the remedies in [17] and [18]. In particular, Kazempour and Gürel [17] applied a recompressed ACA (RACA) to further compress the suboptimal factorization of ACA, although this can be criticized because of the necessity of the a priori choice of both tolerances of ACA and iterative solver. In [19], interpolative decomposition (ID) was used to shrink the dimension of incident vectors with a given tolerance. It is important for this article to discuss the error control of the algorithm considering both the accuracy of ID and indirect analysis of the This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ relationship between surface currents and radar cross sections. Nonetheless, the error analysis was not rigorous because the numerical error of the iterative solver for each skeleton vector was overlooked. Recently, experimental approaches [20], [21] have been proposed with compressive sensing [22]. These articles show that compressive sensing methods can reduce the dimensionality of multiple incident field vectors, although the practical advantages of these methods are debatable due to a lack of both error analysis and systematic selection of the initial number of incident angles.
Based on the proper selection of angular samples and the low-rank approximation of the RHS matrix, this article introduces an optimal and error-controllable algorithm for fast angular sweeps. To minimize the computational costs, distinct tolerances are chosen automatically by the algorithm for the reduced RHS vectors rather than using the same tolerance for all vectors. One major benefit of the algorithm is that the error of the recovered solution is bounded by the user-defined tolerance, as demonstrated later. Also, the proposed algorithm can be applied to both dense and sparse matrices, although surface integral (SIE) methods are used as an example in this article. In addition, a simple variant of the block generalized conjugate residual (GCR) is also introduced to speed up the iterative solution procedure.
There are three major contributions of this article: 1) the various tolerances for the iterative solver; 2) the theoretical error bound due to the compression; 3) the precise error controllability according to the error analysis. The following is a breakdown of how this article is structured. In Section II, the preliminaries are presented, followed by extensive explanations of the algorithm. Also, the main results are presented in Section II-D (Theorem 1) with error analysis. In Section III, we present the numerical validation of error bounds as well as the efficiency of our method with three different targets, one of which is electrically large. Finally, the conclusions are presented in Section IV.

A. Preliminaries
In this article, a boldface capital letter and a boldface small letter denote a matrix and a vector, respectively. I j is an index set defined by I j := {1, . . . , j }. In addition, a matrix norm can be defined by the vector norm [23], [24], i.e., for an arbitrary Unless otherwise specified, the matrix norm refers to this vector-induced matrix norm.

B. Problem Statements
To begin, consider the following definition.
is said to be solved with a given tolerance δ if X is a numerical solution of X.
In Definition 1, A ∈ C N ×N is a system matrix expanded with N basis functions. An excitation matrix B is defined by B = [b 1 , . . . , b M ] ∈ C N ×M , and each b i is an N-dimensional column vector of the plane wave excitation at a particular incident angle. Both A and B can be regarded as matrices after applying preconditioners. In addition, X = [x 1 , . . . , x M ] ∈ C N ×M is an unknown solution matrix whose column vectors, x i , are an N-dimensional unknown surface current vector for the corresponding incident vector, b i . r is a relative error used to evaluate the accuracy of the solution, and δ is a predetermined global tolerance. The iterative solver uses (3) as the convergence criteria.
The objective of our proposed algorithm is to find a numerical solution X = [x 1 , . . . , x M ] ∈ C N ×M such that (3) is satisfied. Ideally, the best choice for estimating the error of the solution (E) is that every iteration computes the vector-induced matrix norm; nevertheless, the computational cost is excessive because it requires SVD. As a result, we later propose a method for estimating the matrix norm with respect to the tolerances of the iterative solver for each singular vector.

C. Proposed Algorithm
The algorithm is made up of the following four steps. 1) Computation of Incident Vectors: For a given target, the number of samples shall be appropriately determined to ensure to capture the desired angular responses and to minimize the computation. According to various articles [19], [25], [26], the angular steps can be limited by where d 0 is the number of digits of the accuracy, k 0 is the freespace wavenumber, and D is the length of a cube that encloses the target. Also, the unit of both ϕ and θ is radians. Then, we can determine the number of incident vectors by where · is the ceiling function and ϕ start and ϕ end are the start and stop azimuthal angles, respectively. Similarly, θ start and θ end are denoted as the start and stop elevation angles, respectively. We assume that all incident vectors shall be normalized by its vector norm; therefore, b i = 1, ∀i ∈ I M .
2) Low-Rank Approximation: Even with the optimal sampling number, the incident vectors can be easily rank-deficient, as shown in Section III-A. Therefore, the second step is the compression of the given RHS matrix to check the linear dependencies. The proposed algorithm can be used with various low-rank approximations. One common approach is the SVD to produce a low-rank approximation of the excitation matrix. Namely, B is represented by Here, U = [u 1 , . . . , u M ] ∈ C N ×M is a matrix with left singular vectors and is a diagonal matrix with descending order singular values, = di ag{σ 1 , . . . , σ M } ∈ R M×M . In addition, is a matrix with right singular vectors. The superscript "H " denotes the conjugate transpose. When determining appropriate k values, one can compress the original excitation matrix into a rank k matrix having k singular vectors with a preferred accuracy. In the proposed algorithm, a relative singular value is used to choose k. Namely, a numerical rank k [27] is identified such that As shown later in Theorem 1, the relative singular value, σ k+1 /σ 1 , is a lower bound of r in (3).
To minimize the computational cost, a randomized algorithm, e.g., principal component analysis (PCA) [28], [29], [30], can be utilized. Although the conventional SVD subroutine is used to validate the theoretical error bounds in Section III-C, the PCA process is applied in Section III-E using the subsampled randomized Fourier transform (SRFT) presented in [30] due to its favorable complexity, with O(N M log L). As mentioned before, the ACA-related approach [17], [18] is an alternative choice.
3) Solve With Singular Vectors: The next step is to solve for Z 1:k = [z 1 , . . . , z k ] ∈ C N ×k with k singular vectors and zero-out all other solution vectors. Namely, z i with u i is solved as follows: and set There is nothing to do with (9) when implementing the algorithm because solving only (8) implies that (9) has already been satisfied. A critical parameter for solving (8) is the stopping criteria if an iterative solver is chosen. This tolerance is essential because it is directly related to the desired solution quality without paying excessive computational cost.
To solve (8), we set the tolerance of each singular vector as for i ∈ I k . The most important idea for the proposed algorithm is that each singular vector is solved with "different tolerance" δ i in (10). The convergence criteria can be relaxed, instead of having the same tolerance for all singular vectors [11], by assigning the larger tolerance to the vectors with smaller singular values. In other words, the computational costs can be reduced compared to solving the same tolerance for all singular vectors. The advantage of choosing different tolerances is shown in Section III-D.

4) Recover Original Solution:
Finally, the solution can be recovered with first k singular values of , and V 1:k is an M × k matrix with the first k right singular vectors of V . The numerical solution (11) guarantees (3) according to the error bound derivation in Section II-D.
The proposed algorithm is summarized in Algorithm 1.

Algorithm 1 Efficient and Error-Controllable Algorithm
Given: A N ×N and δ Result: X Step 1: Determine M with (5) Step 2: B ← Normalized incident vectors Step 3: [U, , V H ] ← Apply SVD or PCA to B Step 4: Determine k with (7) Step 5: Solve with k singular vectors for i ← 1 to k do Solve Az i = u i for z i with a tolerance δ i in (10) end for Step 6: X ← Z 1:k 1:k V H 1:k

D. Error Bounds
In this section, the analysis of the error bounds of r is presented. With Lemma 1, the error bounds are stated and proved in Theorem 1 that the maximum relative error is between the predetermined tolerance δ and the truncation criteria in (7). In addition, Corollary 1 provides an insight into the numerical error for individual incident vectors with the given tolerance. The derivation starts from the following two notations for i ∈ I M : (12) and Note that B i ∈ C N ×M and X i ∈ C N ×M are rank-one matrices. Also, Lemma 1: Let (8) be solved with δ i defined in (10). Then, Proof: For any index i ∈ I k , let Consider solving the matrix equation Az i = u i with the tolerance δ i defined in (10). When any iterative solver converges to target tolerance δ i , it is trivial that With (16), one can write where 1:k and V 1:k are defined in (11). Then, the norm of (18) is bounded by where · F is the Frobenius norm. The second inequality is valid due to the fact that W ≤ W F , for an arbitrary matrix W ∈ C N ×M [23], [24]. Finally, the last equality of (19) can be proven by replacing δ i in (10). Theorem 1: Let (8) be solved with δ i defined in (10) and set (9). Then, Proof: Multiplying a normalization factor B = σ 1 to both sides, (20) can be rewritten as For the upper bound, the following equation provides a starting point: In addition, X i for i ∈ I M − I k can be regarded as a zero matrix with (9). Thus, the second term on the RHS of (22) is Then, with Lemma 1 and (23), (22) can be rewritten as The lower bound is a simple result of the Eckart-Young-Mirsky theorem [31], [32]. According to the theorem, where the summation of B i can be regarded as a rank-k approximation of B and W ∈ C N ×M is an arbitrary matrix having with a rank of at most k. The lower bound is obviously valid because AX has a rank of at most k. One thing to note is that the RHS of (22) can be divided into two parts. The first part is controllable with tolerance δ i . The second term is fixed when k is determined in (7). With this view, the lower bound (21) can also be justified. Also, the relative error r approaches the lower bounds whenever (8) is solved more accurately. In other words, it should be emphasized that no matter how precisely the singular vectors are solved, even direct solver, the accuracy of the solution is limited by the lower bounds. Thus, for engineering applications, the tolerance is needed to be optimized. Consequently, Algorithm 1 proposes an optimal error tolerance in (10) to minimize the computational resources while also ensuring the bounds of the relative error of the target.
Even if the error of the matrix equation r is bounded, one can argue that r does not represent the error for each incident vector. One possible upper bound for all incident vectors is due to the definition of (1). This fact can be easily proven with (24) and a canonical basis vector, e.g.,ê 2 = [0, 1, . . . , 0] T . Note that the i th column vector of any matrix W ∈ C N ×M can be extracted by multiplication ofê i ∈ C M to W. However, this approach is not tight enough for practical usage, especially when the compression rate is high (in that, k/M is small). Alternatively, the trend can be expected with root-mean-square (rms) errors for all incident vectors. Here, the upper and lower bounds of the rms errors are introduced by Corollary 1. Corollary 1: Let (8) be solved with δ i defined in (10) and set (9). Then, where rms is a quadratic mean of relative errors defined by Proof: For an arbitrary matrix, W ∈ C N ×M , it is easy to prove that the rms of the norms of the column vectors can be rewritten by the Frobenius norm divided by √ M. Namely, The lower bound is easy to prove according to the properties of the matrix norm [23], [24]. Namely, Using the lower bound of (21), (30) can be rewritten as The upper bound can be derived with the similar idea of the proof in Lemma 1 The second equality is valid because the Frobenius norm is invariant under unitary operation. For example, where := ξ 1 · · · ξ k 1:k . The last inequality is from ξ i ≤ δ i in (17). Finally, the upper bounds of (27) can be proven by dividing (32) by √ M.

E. Acceleration With a Variant of GCR for Multiple RHSs
Even with the optimal choice of the number of RHSs, further speedup can be achieved with the block version of the Krylov subspace methods. A simple variant block GCR for multiple RHSs is suggested to validate the extra speedup, as shown in Algorithm 2. In the algorithm, α, β, and γ ∈ C. In addition, all vectors z, r, and s ∈ C N . The superscripts and subscripts represent the indices for singular vectors and iterations, respectively. The notation < x, y > represent the inner products. Basically, Algorithm 2 constructs the search vectors (s j and v j ) to update the numerical solutions (z i j ) and residual vectors (r i j ). A common choice [33], [34] for the initial guesses and initial search vectors is zero vectors (r i 0 = u i 0 ) and s j = r j −1 , respectively. One difference is that one residual vector having the largest norm becomes the next search vector. Thus, the algorithm is allowed to conduct only one matrix-vector multiplication (MVM) for every iteration (v j ← As j ). Because all solution and residual vectors are updated with the same Krylov subspace, the proposed block GCR can reduce the number of required MVMs.
The truncation of Krylov subspace is needed for the block GCR due to limited computational resources. We propose

Algorithm 2 Variant Block GCR for Multiple RHSs
Given: A N ×N , u i , and δ i , ∀i ∈ I k , in (8) and (10) Result: z i , ∀i ∈ I k Initialize z i 0 = 0 and r i 0 = u i , ∀i ∈ I k Initialize v j = 0 and s j = 0, ∀ j ∈ I j max Set s 1 where v j is the current search vector and v l are the stored previous search vectors. The vector v q is the most similar to the new search vector. The effect of the block GCR with truncation is shown in Section III-D.

III. NUMERICAL VALIDATION
This section gives the numerical evidence used to validate Theorem 1 and Corollary 1. We computed and compared the upper and lower bounds and the actual relative error r with different tolerances. In addition, the numerical examples indicate the efficiency of the computation of angular responses with the proposed algorithm. Throughout the experiments, we applied the algorithm to the discontinuous Galerkin integral equation (IEDG) method [35] with the multilevel fast multipole method (MLFMM) [36]. The targets were discretized by nonconformal and mixed triangles and quadrilaterals. All computations were conducted using workstations in the Owens Cluster at the Ohio Supercomputer Center (OSC) [37]. The first two targets were computed with nodes with two 14-core Intel Xeon E5-2680 v4 (2.40 GHz) processors and 128 GB memory. The last fighter jet example was computed with nodes with four 12-core Intel Xeon E5-4830 v3 (2.10 GHz) processors and 1536 GB memory. The three targets defined in this section have different electrical sizes, geometrical complexities, and materials. Fig. 1, the first example is plotted. The target is a perfect electric conductor (PEC) cuboid cavity with an open left side. The frequency is 600 MHz, and the size of the cube that surrounds the cavity is 6λ 0 , where λ 0 is the free-space wavelength. The mesh is prepared with size λ 0 /6, and the number of unknowns is 30 704. The desired angular responses are −180 • ≤ ϕ ≤ 180 • and θ = 90 • , with fixed polarization angles 0 • . We set the number of incident angles M as 181 with ϕ = 2 • and θ = 0, whose values meet (4). The tolerance r is set to the values ranging from 10 −7 to 10 −3 .

1) Perfect Electric Conductor Cuboid Cavity: In
2) Finite-Conductivity Cylindrical Cavity: The next example is more difficult to solve; it is a cylindrical cavity made of real metal, as described in Fig. 2. This target is a high-resonance structure with thin-aperture slots located on the front of the cylinder (0.508 × 50.8 mm). The height is 609.6 mm, and the inner radius is 101.6 mm. The thickness of the metal is 6.35 mm. The metal has a finite conductivity, σ = 2.6 × 10 7 [S/m], so we apply the impedance boundary condition (IBC) [38], [39], [40], [41], [42] to approximate the imperfect electric conductivity. The first theoretical resonance frequency is at 1129.391 MHz for transverse magnetic (TM) mode. In this experiment, the operating frequency is 1132.4207 MHz, which is the experimental resonance frequency with the given mesh and simulation setup. The cylinder is discretized with a maximum mesh size of approximately λ 0 /15, and the number of unknowns is 28 944. In addition, the size of the cube enclosing the cylinder in wavelengths is approximately 2.35λ 0 . For this example, the 2-D angular responses are computed. The interesting angular sector is −60 • ≤ ϕ ≤ 60 • and 70 • ≤ θ ≤ 110 • with a 0 • polarization angle. With (4), we set ϕ = 4 • and θ = 4 • for the azimuthal and elevation angles, respectively. Thus, the total number is M = 341. The tolerance r also ranges from 10 −7  3

) PEC F-16 Fighter Jet:
To show the robustness of the error control of the proposed algorithm, we choose an electrically large target, the PEC F-16 fighter jet (Figs. 3 and 4), which has a large electrical size and a complicated geometry. Taking advantage of IEDG, all targets can be divided into 45 parts and meshed independently. As a result, a nonconformal mesh is prepared, and the total number of unknowns is 7 700 888, with an average mesh size of λ 0 /5.
Note that the intake is closed with the flat surface in this example. In this study, the operating frequency is 5 GHz; accordingly, the longest dimension of the target is approximately 250.33λ 0 . In addition, the desired angle of the incident fields is fixed at θ = 90 • , and −10 • ≤ ϕ ≤ 10 • is swept, with a 0 • polarization angle. The optimal angular step we choose is 0.1 • according to (4); therefore, a total of 201 incident vectors are used. The tolerance r varies from 10 −4 to 10 −2 .   Table I presents the number of singular vectors k, which is determined by (7), in terms of the given tolerance δ.

C. Error Bounds
As shown in Fig. 6, the error bounds of Theorem 1 are numerically validated. In Fig. 6, the actual error r and the theoretical bounds are provided with the given tolerances on  the results, it is numerically confirmed that the actual errors reside between the error bounds.
In addition, in Fig. 7, the semilogarithmic plots provide the individual errors of all incident vectors and their rms values in terms of the global tolerance δ. Fig. 8 shows the rms values with the theoretical upper and lower bounds as a function of the global tolerance δ on logarithmic scales. All errors and rms values are obtained by actual computation of E from the recovered solution X. The actual rms error, the upper bounds, and the lower bounds are shown in a similar way to that of Fig. 6. As shown in Fig. 8, the numerical examples satisfy Corollary 1.
As shown in this section, both the actual r and rms are bounded, as proven in Theorem 1 and Corollary 1. Hence, our suggested approach can readily estimate and control the error of the solutions with a predetermined global tolerance. Section III-D numerically validates the benefit of the proposed algorithm, i.e., different tolerances for different singular vectors, as well as the block version of the GCR.

D. Computational Costs
In this section, the computational costs are provided by comparing the wall time, the number of MVMs, and the required memory for the iterative solver. The experiment is designed with the following four cases. The first case (Case 1) is regarded as a baseline with the conventional way to solve with multiple RHSs. Namely, after assembling the system matrix, each incident vector is solved individually. To make a fair comparison, the tolerances for each incident vector are taken from the actual data of b i −Ax i in Fig. 7. For the PEC F-16 target, we estimate the number of MVMs and the wall time for the solution process using the 20 angles data samples due to the observation that the number of iterations is nearly identical for all incident angles. The next case (Case 2) uses a rank-k approximation of B in (7), but the tolerance of each singular vector is set to for i ∈ I k . This is a reasonable choice to ensure that the accuracy between the second and third cases is similar because from (19) and (22). In addition, we use Algorithm 1 for the third (Case 3) and the last cases (Case 4). The conventional GCR and the modified block GCR introduced in Algorithm 2 are used for the third and fourth cases, respectively. Note that the data discussed in Sections III-B and III-C were collected from the third case. Also, the number of truncations of the Krylov subspace for both the conventional GCR and the modified block GCR for MRHSs ( j max ) is fixed 100. Fig. 9 shows the number of MVMs with respect to the target tolerances for each case. By comparing Cases 1 and 2 in Fig. 9, we can observe a significant reduction of the number of RHSs with a low-rank approximation as expected. According to Case 3, the recommended tolerance (δ i ) for each singular vector in (10)   during the iterative solution process. Note that the wall time in Table II

E. Application: Monostatic Radar Cross Section of F-16
In this section, a practical but challenging example is considered, the computation of the 2-D monostatic radar cross section (RCS) of an F-16 with a deep cavity intake. Basically, the problem geometry is similar to that of the PEC F-16 in Section III-A3; however, there are three differences. First, the operating frequency is 8 GHz, which is the typical starting  frequency of the X-band [43]. With the higher frequency, the target size is approximately 400.53λ; hence, the number of basis functions is 23 194 584 with λ/5 meshes. Second, different from the closed intake in Section III-A3, the engine intake is modeled as a deep concave structure, which is the resonance structure shown in Fig. 10. One can expect that the convergence behavior is deteriorated due to the resonance structure. Finally, the required number of angular responses is significantly larger according to (4). In our experiment, ϕ and θ are both 0.05 • . The angular responses of interest are −2 • ≤ ϕ ≤ 2 • and 88 • ≤ θ ≤ 91 • with a 0 • polarization angle. As a result, the total number of RHS is M = 5001. Because of the batch limits at the OSC, the 5001 RHSs were divided into 12 blocks; hence, each block typically consisted of 420 RHSs. Since we observed that each block behaves similarly, the data of the first block (a typical block) are presented as follows. The incident angles for the first block are −2.0 • ≤ ϕ ≤ −1.05 • and 88.0 • ≤ θ ≤ 89 • . Fig. 11 shows the distribution of the first 20 singular values of the block. According to Algorithm 1, the first 18 singular vectors were retained to solve from 420 RHS vectors in this block. In addition, the singular values obtained by the conventional SVD (σ i ) and PCA (σ i ) are plotted in this figure. Furthermore, the relative difference, e i := (σ i − σ i )/σ i , is presented in Fig. 11 to validate that the PCA approach is as accurate as the conventional SVD. Fig. 12 shows the monostatic RCS of PEC F-16 with a deep cavity intake. As a self-consistency, the RCS plot is symmetric with respect to φ = 0 • . One interesting point is that the peak points are not located at the head-on incident direction (ϕ = 0 • and θ = 90 • ) but slightly lower and to the side of it (ϕ = 21.7 • and θ = 90.75 • ). Finally, Table IV summarizes the wall time for the entire 12 blocks (5001 RHS vectors). In this table, the "naive approach" is quite similar to case 1 in Section III-D, but the tolerance is fixed to 10 −2 . In addition, the total solution time is estimated by the solution time for the first 20 RHS vectors. On the other hand, the "approach in this article" used Algorithms 1 and 2, which is the same as Case 4 in Section III-D. As shown in Table IV, the speedup is almost 73.6 times if PCA is utilized. For both approaches, the peak memory is the same, at 1110.35 GB, because the majority of the memory required is for the system matrix (1041.22 GB) and 100 Krylov vectors (69.13 GB). The comparison between the proposed factorization using the conventional SVD and using PCA is also included in Table IV. Wall time for PCA for the entire 5001 RHSs is about 7.86 h, which is a significant improvement compared to 98.4 h. Therefore, the speedup of the factorization process is about 12.51. Note that the factorization time for the entire 12 blocks with conventional SVD is estimated based on 8.2 h of the first (typical) block of 420 RHS vectors.

IV. CONCLUSION
In this article, an efficient and error-controllable algorithm has been proposed. Also, it was proved and numerically validated that the numerical error is bounded with the algorithm. In addition, the block version of GCR can effectively accelerate the solution time with a limited increase in memory. Although the proposed method was applied to electromagnetic scattering via the IEDG framework, the algorithm can be extended to any other method with multiple RHSs, for example, FEM-based algorithm [44], [45], [46], [47].
ACKNOWLEDGMENT This article describes objective technical results and analysis. Any subjective views or opinions that might be expressed in this article do not necessarily represent the views of the U.S. Department of Energy or the United States Government. The authors would like to thank their colleague, Dr. Haobo Yuan, National Key Laboratory of Antennas and Microwave Technology, Xi'an, China, for useful discussions.