A Waveform Relaxation Solver for Transient Simulation of Large-Scale Nonlinearly Loaded Shielding Structures

This article introduces an algorithm for transient simulation of electromagnetic structures loaded by lumped nonlinear devices. The reference application is energy-selective shielding, which adopts clipping devices uniformly spread along shield apertures to achieve a shielding effectiveness that increases with the power of the incident field, thereby blocking high-power interference while allowing low-power communication. Transient simulation of such structures poses a number of challenges, related to their large-scale and low-loss nature. In this work, we propose a waveform relaxation (WR) scheme based on decoupling the linear electromagnetic structure from its nonlinear terminations. In a preprocessing stage, the electromagnetic subsystem is characterized in the frequency domain and converted into a behavioral rational macromodel. Transient simulation is performed by refining estimates of the port signals through iterations. The proposed scheme combines a time partitioning approach with an inexact Newton–Krylov solver. This combination provides fast convergence also in those cases where standard WR schemes fail due to a strong mismatch at the decoupling sections. Numerical results on several test cases of increasing complexity with up to 1024 ports show that the proposed approach proves as reliable as HSPICE in terms of accuracy, with a speedup ranging from one to three orders of magnitude.


I. INTRODUCTION
T HIS work focuses on the transient simulation of energyselective shielding structures [1], [2], [3]. In contrast to regular shields [4], [5], [6], such structures have the ability of protecting sensitive equipment or devices from high-power electromagnetic fields, while still allowing low-power communication. This power-dependent shielding characteristics are usually achieved by connecting nonlinear devices to shielding enclosures at a number of lumped ports, which are spread across the shield apertures [3], [7], [8]; see Fig. 1 for a schematic illustration.
The electrodynamic behavior of energy-selective shields must be analyzed in time domain due to the presence of nonlinear terminations. To this end, transient full-wave electromagnetic solvers can be adopted, such as time-domain integral equation [9], finite-difference time-domain method [10], timedomain finite-element method [11], and partial element equivalent circuit [12]. Nonetheless, nonlinear interactions are only lumped at the ports where discrete nonlinear devices are connected. Removing such components leaves an unloaded enclosure whose behavior is governed by linear Maxwell's equations, and which can be characterized both in time and frequency domain. This consideration enables hybrid simulation approaches [13], [14] based on partitioning the system into the linear electromagnetic (yellow box and dark strips in Fig. 1) and nonlinear (red blocks) subdomains, which are characterized separately. Indeed, the unloaded box can be regarded as a (passive) linear multiport network, which can be generally represented by a multiport transfer matrix H(s). This frequency response can be approximated with a lumped behavioral macromodel computed through a rational approximation using a vector fitting engine [15], [16]. Such macromodel can be synthesized as This approach is particularly appealing for the design of energy-selective structures, which may require repeated transient analyses aimed at selecting the best types of lumped terminations for specific shielding applications [3], to be solved in time domain for various different incident field excitation configurations. For these problems, the electromagnetic characterization does not need to be repeated and can be reused in subsequent circuit simulations.
One of the objectives of this work is to show that much better alternatives are available for such repeated transient simulations with respect to general-purpose circuit solvers such as SPICE. Building on the decoupling between linear electromagnetic and lumped nonlinear blocks [13], a natural approach is to apply a waveform relaxation (WR) algorithm [17], [18], that starting from an initial estimate of the transient port signals to be computed over a given time interval, iteratively refines such estimates until convergence. The procedure stabilizes if the iteration operator is a contraction, regardless of the selected formulation [18], [19], [20], [21]; the number of iterations can be reduced by optimizing the WR decoupling parameters [22], [23], [24]. A general procedure that allows such optimizations is the definition of some reference impedance level, used to build the decoupling network so that exchange of signals between iterations is performed through equivalent scattering waves referenced to this optimal impedance. Impedance matching, constant (real) or time-varying, for the considered application is nearly impossible. On one hand, the considered shielding enclosures are nearly lossless, with highly reactive driving-point impedances and very pronounced resonance peaks. On the other hand, nonlinear terminations such as clipping diodes or diode pairs behave at different time instants almost as short circuits (during conduction) or open circuits (during cutoff). For this reasons, any standard WR scheme would require an enormous number of iterations. An example is provided in Fig. 2, where decoupling through a constant resistance R 0 = 50 Ω is used to solve for port signals of a nonlinearly loaded enclosure, as in Fig. 1 but with P = 100 ports and excited by an incident plane wave pulse (see Section V-A for a full description). The number of WR iterations that are required to reach convergence exceeds several hundreds, making the overall scheme highly inefficient.
In this article, we present an iterative solver that is able to overcome the above WR convergence issues. The proposed solver draws the main idea of WR, which is however achieved through an inexact Newton-Krylov iteration. The approach is similar to [25]. This iteration is combined with a second (outer) iteration that processes sequentially different time subintervals, with an appropriate reinitialization of any required initial conditions. We show that this time partitioning process is able to drastically reduce number of iterations per window and overall runtime. Numerical results on several test cases of increasing complexity with up to 1024 nonlinearly loaded ports show that the proposed approach proves at least as reliable as HSPICE in terms of accuracy, with a speedup in overall runtime ranging from one to three orders of magnitude.

II. NOVEL CONTRIBUTIONS AND MANUSCRIPT ORGANIZATION
This work finalizes a series of papers that analyzed separately various aspects of this challenging numerical simulation problem. In particular, the proposed transient solver heavily depends on a set of prerequisite results and formulations that form a comprehensive and self-consistent method. One of the novel aspects of this work is indeed the combination of all these ingredients into a complete framework: if one ingredient is removed, the effectiveness of the entire approach is hindered. Section III itemizes all this required background material. Based on all these prerequisites, Section IV presents the proposed transient simulation algorithm, whereas numerical results, validation, and performance assessment are presented in Section V. Finally, Section VI concludes this article.
The presented WR-based solver starts from the results of [13], which provides a macromodeling-based decoupling scheme to solve the electromagnetic problem, and extends the stateof-the-art circuit solver [25] by addressing the problem of a poor matching condition at the interface ports. With respect to [25], a time-partitioning strategy [18] is embedded in the solver formulation to improve the convergence property of the WR-NGMRES scheme and to handle the large number of ports of the presented application. A different approach, still within the adopted decoupling and WR framework, was introduced in [14]. Nevertheless, the presented solver significantly differs from [14] for several reasons.
1) The basic WR scheme was modified in [14] with an iteration-dependent strategy that changes the macromodel reference impedance (i.e., R0) to match a time-varying impedance level at the decoupling ports. This choice requires the generation of a possibly large number of scattering macromodels referenced to different port impedances, increasing the runtime requirements of the preprocessing phase, whereas the proposed approach is based on a single (compressed) macromodel and does not include any iteration-dependent parameter.
2) The proposed approach takes advantage of the WR-based iteration of the WR-NGMRES scheme [25], while Wendt et al. [14] solve for the basic WR-LP iteration.
3) The presented strategy splits the simulation interval in different chunks (time windowing) to improve the solver convergence rate. The strategy of [14] is applied to the entire simulation window. 4) Even if the same application is used in both strategies (i.e., energy-selective shielding enclosures), the presented strategy demonstrates its efficiency on a significantly larger size examples (maximum number of ports 1024) with respect to [14]. We remark that the focus of this work is the transient numerical simulation problem rather than the engineering application and the design of energy-selective enclosures, which are comprehensively developed in [3]. Therefore, the target readers for this article are R&D experts in numerical simulation both in the academia and in the private sector, specifically field solver and CAD tool developers.

III. BACKGROUND FRAMEWORK
We start setting the basic notation used in this manuscript. We denote scalars with normal font x, vectors with lower case bold fonts x, and matrices with upper case bold fonts X, with I n being the identity matrix of size n. The transpose of the matrix X will be indicated with X T , while Re{X} and Im{X} extract its real and imaginary part, respectively.
With reference to Fig. 1, the objective of this work is to compute the transient electromagnetic fields scattered by a metallic shielding structure loaded with lumped nonlinear loads. This goal is achieved by first computing the transient port voltages and currents at each of the lumped terminations. The framework discussed in [13] is then applied to exploit knowledge of these signals to evaluate the transient electromagnetic field at any desired location, through a postprocessing stage. This formulation is not repeated here, the reader is referred to [13].

A. Decoupling and Characterization
Assuming an incident electric field excitation e inc (t) and defining the incident scattering waves a(t) ∈ R P at the P ports, we can define the reflected waves b(t) at these ports as

b(t) = h(t) * a(t) + ϑ(t)
(1) where v oc (t) ∈ R P collects the open-circuit voltages induced by the external incident field, L{h oc (t)} = H oc (s) ∈ C P is a suitable set of linear transfer functions, where L is the Laplace transform operator, ϑ(t) ∈ R P represents the contribution of e inc (t) under port matching conditions, and L{h(t)} = H(s) ∈ C P ×P is the scattering matrix of the unloaded box.
A frequency-domain characterization of the unloaded shielding structure is performed through a full wave solver based on the method of moments (MoM) [26]. The result is a set of tabulated frequency data of the scattering matrixH(jω ) and the open-circuit transfer function vectorH oc (jω ) for = 1, . . . , L. It should be noted that samplesH(jω ) are invariant once the geometry is fixed, whereasH oc (jω ) should be recomputed if the orientation and polarization of the incident field e inc (t) are varied.

B. Data Conditioning
A necessary condition to obtain a reliable model of the system is a well-defined set of training samples covering the full bandwidth of interest, including low frequency and dc. Well-known MoM or FEM field solver limitations may provide incorrect, inaccurate, or incomplete characterizations. For this reason, a fundamental step in the overall proposed flow is the data conditioning phase presented in [27], which involves enriching the original datasetH(jω ) with a set of self-consistent low-frequency samples obtained through an extrapolation (and regularization) procedure.

C. Compressed Macromodeling
The next step is to approximate this data through behavioral models obtained by a rational fitting procedure [15], [28], [29], where and similarly for H oc (s). For present application, this step is particularly challenging due to the large number of input-output ports P (which can reach several hundreds or even thousands) and a large expected dynamic ordern to properly capture all resonances within the required modeling bandwidth. For this reason, we exploit the framework introduced in [27] and [30], which computes a low-complexity macromodel by rational fitting a compressed dataset obtained by a modified (causality-preserving) singular value decomposition (SVD). This entire procedure is well-documented in [27] and [30], including application to the very same shielding enclosures considered in this work, and is not repeated here. This optimized macromodel generation flow includes of course a dedicated passivity verification and enforcement, based on an iterative perturbation algorithm [31] of passivity violations identified through the adaptive sampling process [32].

D. Transient Analysis and Waveform Relaxation
Several alternatives are available for time-domain simulation of a rational macromodel in form (5), possibly loaded with nonlinear devices. A popular approach realizes the model as a state-space system (a system of coupled ordinary differential equations (ODEs), which is then converted into an equivalent (behavioral) circuit. Transient simulation is thus performed using a standard circuit solver of the SPICE class. Modern circuit solvers (e.g., HSPICE [33]) allow direct specification of behavioral blocks in the pole-residue format (5), usually denoted as Foster form, making the state-space realization and circuit synthesis unnecessary. The Foster format is particularly convenient since the numerical integration of pole-residue models can be carried out via recursive convolutions in significantly reduced runtime. This aspect will be discussed in detail in the following.
The proposed approach for the evaluation of the transient port voltages stems from a scattering-based WR [17], [18] formulation. In particular, we exploit the longitudinal partitioning (LP) decoupling scheme [20] depicted in Fig. 3, which splits linear and nonlinear parts through decoupling networks including relaxation sources, the latter defined in terms of voltage-normalized scattering waves. Splitting and relaxation are discussed in more detail below.
The WR-LP scheme solves the nonlinear system associated with the simulation problem depicted in Fig. 1 where ν is the iteration index, the operator H denotes convolution with the macromodel impulse response matrix, and G is a purely algebraic operator collecting the nonlinear characteristics of the loads expressed in terms of scattering variables. Upon a suitable initialization (usually a 0 = 0), the two equations are evaluated alternatively using the estimate of the port signals computed from previous iteration. Iterations are stopped when a suitable convergence threshold is reached where the worst-case ∞-norm is used, corresponding to the maximum magnitude among all signal components of its vector argument. The aforementioned scheme is effective if 1) the scheme converges; 2) both operators H and G can be efficiently evaluated; 3) few iterations are sufficient to reach convergence. The first two aspects will be discussed as follows. The rest of this work is devoted to attain the third goal.
1) Convergence: A fixed-point iteration such as WR-LP converges if the operator that describes the map between two successive iterations is a contraction. This condition is easily established in case of linear terminations, that fit the aforementioned WR framework by replacing the operator G with a linear scattering convolution operator. Using a frequency-domain description, convergence holds if, where Γ = Γ(jω) ∈ C P ×P are the scattering matrix of the linear terminations, and where ρ max is the spectral radius. A sufficient condition for convergence [20] is the strict passivity of both shielding enclosure and terminations Intuitively, this condition states that the more dissipative are the two systems at a specific frequency, the faster will be the convergence of the WR scheme. In our framework, H(jω) is passive by construction, and also, the nonlinear elements (diodes or diode pairs) to be used as terminations are passive. However, 1) in case of a low-loss and highly resonant shielding structure, we usually have H(jω) 2 1 so that dissipation in the shield and radiation losses are minimal; 2) diodes or diode pairs nearly behave as short or open circuits in their conduction or cutoff state, with a slightly dissipative resistive behavior during switching. Replacing such elements with equivalent linear circuits in conduction or cutoff states would lead to a worst-case scenario where Γ(jω) 2 = 1, so that almost no dissipation occurs in the termination. Under these conditions, the spectral radius of the iteration operator ρ max 1 and the associated convergence rate is so poor not to justify any sort of WR approach. We will see later that, even in this challenging situation, convergence can be attained by a suitable reformulation.
2) Operators Evaluations: To obtain a fast evaluation of the linear operator H, we adopt the same approach of [13] and [14]. Since both h(t) and h oc (t k ) are the impulse responses of systems modeled in a pole-residue form (5), the evaluation of their convolution can be efficiently computed with an infinite impules response filter (equivalently, as recursive convolutions); see [34]. Details will be provided next. Note that the source term ϑ(t) can be precomputed before starting iterations. Regarding the nonlinear operator G, in this work, we consider only nonlinear, static, identical, and uncoupled terminations at all ports, for which G can be represented as a set of identical input-output maps to be applied componentwise. In our implementation, this evaluation is performed by a single lookup table obtained as a fine sampling of the load characteristic in the scattering representation as in [14]. The extension to more complex terminations requires a suitable definition of the operator G and a procedure for its fast evaluation. For instance, more sophisticated diode models including dynamic terms and parasitics would make G a linear and dynamic operator, requiring a dedicated (small-scale) ODE solver, including SPICE as a standard alternative. The proposed decoupling framework is thus applicable regardless The aforementioned discussion pointed that a direct application of a WR scheme to our problem is likely to fail due to poor convergence properties. This motivates the proposed numerical approach, which is based on the following two main ingredients: 1) a time-partitioning approach is exploited, to improve convergence of the basic WR-LP scheme; 2) instead of a pure WR scheme, we adopt a more general Newton-Krylov solver [25]. As in basic WR iterations, a concurrent estimate of chuncks of time samples of the port signals is computed at each iteration. A SPICE-like solver would apply Newton iterations to evaluate all signals at a single time step, within an outer time-stepping loop. The proposed approach reverses the loops and applies an approximate Newton iteration to refine the estimate of a vector of time samples within a given time interval. An outer loop processes disjoint time intervals until the entire transient simulation window is complete.

A. Time Windowing
Time partitioning or windowing approaches are well known in the WR context [18], [35], with proven effectiveness in algorithm parallelization [19]. The most relevant advantage of time windowing that we exploit in this work is a drastic improvement in convergence rate and consequent reduction in overall runtime. Fig. 4 demonstrates the impact of windowing on convergence through a simple example. A shielding enclosure with P = 100 ports is terminated by antiparallel diode pairs at each port and excited by an incident plane wave. Details of the simulation setup are provided in Section V-A. The time interval over which the transient port signals are required is t max ≈ 15.5 ns. The basic WR-LP scheme of (6) is executed, and the convergence through iterations is observed by restricting the evaluation of the solution error to different portions of this time interval [0, σt max ] with σ ∈ {0.25, 0.5, 0.75, 1}. The evolution of the error through iterations ν is depicted in Fig. 4, both referred to the exact solution (top panel) and to the solution estimate available at the previous iteration ν − 1 (bottom panel). Both panels clearly show that the number of iterations required to attain an accuracy below a given threshold drops when reducing the size of the observation window.
These results indicate the opportunity of breaking a long simulation interval into chunks to be solved individually within an outer iteration. In this work, we adopt a uniform subdivision induced by a set of M control points such that each subinterval includes a fixed numberk of uniformly spaced time samples, with T m − T m−1 =kΔt. This is not a restriction but is convenient for implementation purposes. In the limitk = 1 (equivalently, M = K), only one time sample is evaluated at each outer iteration, similarly to common timestepping ODE solvers and SPICE. In the latter case, Newton iterations are used to evaluate the unknowns at the considered time step. On the other end, the limitk = K or equivalently M = 1 considers a single time window for which all time samples are computed concurrently through WR iterations. This corresponds to the red dots in Fig. 4. The proposed approach attempts to find a compromise between these two opposite situations, by solving concurrently fork > 1 time samples within a WR loop, and at the same time, limiting this number of concurrent samples to enhance the convergence speed. Since different time windows are to be solved iteratively, initial conditions are required in order to properly restart WR iterations on each subinterval (except the first, for which zero initial conditions can be safely adopted). The proposed strategy for handling initial conditions will be discussed later, after introducing the iterative solver that we proposed to use instead of plain WR.

B. Inexact Newton-Krylov Iterations
Applying a uniform time discretization t k = kΔt with k = 0, . . . ,k, the continuous-time WR-LP system of Section III can be conveniently rewritten in terms of discrete algebraic operators. Defining a vector a = vec (a(t 1 ), . . . , a(tk)) ∈ Rk P (11) collecting all time samples of the port signals a(t), and similarly, for b(t) and ϑ(t), leads to the following discretized system of equations: where operator H denotes discrete convolution, and G evaluates the nonlinear characteristic of the terminations component-wise, i.e., at each port and for each independent time step. The evaluation of the discrete convolution operator is performed as follows. Individual contributions of model poles in (5) are accounted for by defining auxiliary state vectors x n (t) ∈ R P , whose discrete-time samples evaluations are approximated as time-domain recursive convolutions, ∀k = 1, . . . ,k : where α n = e p n Δt and β j,n defined according to the adopted discretization strategy; see [34]. Time samples of the output signals are then evaluated by superposition of these state variables Therefore, although the first row in (12) formally states discrete convolution as a multiplication by a (sparse) matrix H, the actual numerical evaluation applies (13) to compute the auxiliary state variables at all time steps, and then, uses the results to evaluate the output samples based on (14). These samples are finally assembled in the output vector b. Both (13) and (14) need to be initialized by suitable initial conditions at t = 0 for the input signal a 0 = a(t = 0) and auxiliary state vector x 0 collecting as components all x n (t = 0). These initial conditions are essential parts for the characterization of the discrete convolution operator so that they are specified in the operator symbol H( x 0 , a 0 ) in (12). Instead of applying the WR fixed-point iteration to (12), we reformulate the problem in terms of computing the solution of a nonlinear multivariate system. Elimination of the output variables from (12) leads to the equivalent formulation F( a) = 0, F( a) = a − G( H( x 0 , a 0 ) · a + ϑ) (15) where F : R Pk → R Pk is a nonlinear multivariate map providing the discrete time samples of the residual. Applying the classical Newton method to find the solution of (15) provides the following update equation: where J( a η ) is the Jacobian of the system and a η is the solution estimate at the iteration η. Since the evaluation of the large-scale Jacobian is a very expensive operation, we propose as in [25] to use an inexact Newton condition by replacing the Jacobian solve with an approximation constructed through a suitable Krylov subspace. We closely follow the implementation of [36], which is based on the generalized minimal residual (GMRES) method. The resulting scheme is further enhanced by precomputing an initial condition to setup the inexact Newton iterations through a (small) number ν i of WR-LP (fixed-point) iterations. The resulting scheme is code-named WR-NGMRES as in [25], to which we refer the reader for additional details. The condition for stopping iterations is associated to the nonlinear residual error. Iterations are stopped when where τ r and τ a are two control parameters accounting for the relative and the absolute accuracy, respectively. The main advantages of this formulation are as follows.
1) The convergence of the scheme is always guaranteed if a suitable initial condition is provided to the GMRES. This is the main reason for running ν i iterations of WR; we have verified that ν i = 1 is adequate in all tested examples. 2) The NGMRES approach requires a much smaller number of iterations with respect to the classical WR scheme. This will be demonstrated on several test cases in Section V. On the other hand, one should consider the following: 1) many function evaluations may be needed to build an adequate basis of the Krylov subspace; 2) the nonlinear problem (15) does not scale favorably with the size of a. For both these reasons, we suggest a combination with a time windowing and restart process, discussed next.

C. Time-Windowed WR-NGMRES Iterations
In this section, the proposed simulation scheme is presented by combining the inexact Newton iteration presented in Section IV-B with the time partitioning technique of Section IV-A. The result is a flexible and general framework for extending WR solvers to large-scale simulation problems for which convergence cannot be ensured with proper matching conditions at the decoupling sections. The proposed approach is summarized in Algorithm 1 and discussed as follows.
Considering the time partition induced by the control points (10) withk samples in each subinterval I m = [T m−1 , T m ], we introduce the following double-subscript notation for all discretized signals: where ν denotes the inner (WR or inexact Newton) iteration index and m denotes the outer iteration index referring to the mth time interval I m . The time samples in (18) are to ensure that all subintervals are adjacent so that the last time sample within I m−1 coincides with the initial time step of I m . Algorithm 1 requires on input the description of all model coefficients that are necessary to evaluate the discrete convolutions and the nonlinear termination operators, as well as all algorithm control parameters related to time partitioning and convergence thresholds. Initial states and initial solution vector are initialized to zero-valued vectors (lines 1 and 2). The outer loop over time windows is then started (line 3). Before starting iterations on the current mth time window, allk samples of the solution vector are set to the same constant value a m−1 , which at the first iteration is identically zero (line 4). The recursive convolution operator is configured with the proper initial conditions (line 5). Note that we use the shorthand notation H m = H( x m−1 , a m−1 ) to Algorithm 1: Pseudocode of the Proposed Scheme.
Require: ϑ, {α n , β 0,n , β 1,n }, {R n }, G Require: M ,k, ν i , , ν max , τ r and τ a 1: Initialize states x 0 = 0 ∈ R Pn 2: Initialize solution a 0 = 0 ∈ R P 3: for m = 1 to M do 4: Initialize a 0,m = (1k ⊗ a m−1 ) 5: Initialize operator H m = H( x m−1 , a m−1 ) 6: for ν = 1 to ν i do 7: Apply convolution b ν,m = H m · a ν−1,m + ϑ 8: Apply terminations a ν,m = G( b ν,m ) 9: if a ν,m − a ν−1,m ∞ < then 10: Break 11: end if 12: end for 13: while ν < ν max do 14: ν ← ν + 1 15: The inner loop over WR-NGMRES iterations is then started in line 6. For the first ν i iterations, a basic WR scheme is used, by alternating the evaluation of the discrete convolution operator and the termination characteristics Iterations are stopped if there is no significant update (line 10). A maximum number ν max NGMRES iterations are performed by solving  Fig. 1) is a cubic enclosure (perfectly conducting, edge length 500 mm), with one square aperture (side length 250 mm) on one face. The aperture is covered by a regular grid of P = p × p ports, grouped in p series-connected blocks through 2-mm wide metal strips. Port numbering is consistent with the notation of Fig. 1. Each port is loaded by two antiparallel diodes, each described by the standard characteristic i D = I s (e v D /V T − 1) with saturation current I s = 1 nA and thermal voltage V T ≈ 25 mV. Therefore, the voltage-current characteristic of each load is i = The structure is excited by an incident plane wave (normal incidence, electric field polarized along the same direction of the diodes), with a time-domain electric field waveform defined by a Gaussian modulated pulse where f c is the center frequency of the signal, with standard deviation (pulsewidth) σ, delay t 0 , and amplitude E. Different testing situations are obtained by tuning the shape of the excitation waveform, in particular, amplitude (to stress the influence of the nonlinearities of the loads and assess the energy selective shielding effectiveness of the loaded enclosure) and center frequency.
Four types of numerical investigations are presented. First, a comparison of several implementations of WR iterations is reported in Section V-A, confirming that the proposed approach provides the best performance with respect to reference schemes. Then, the proposed scheme is compared to the reference SPICE results solver in terms of accuracy and efficiency in Section V-B. This section will show that a proper WR iteration may outperform the state of the art SPICE solvers. Section V-C summarizes the results of 50 simulations obtained changing the frequency and amplitude of the excitation field e inc (t), completing the comparison of the proposed iteration with respect to SPICE through an extensive numerical simulation campaign. To conclude, Section V-D provides scalability results for large-scale models, up to P = 1024 ports and 72 704 macromodel states.
All test cases in this section were executed by setting only ν i = 1 iteration of standard WR-LP and ν max = 100 maximum inexact Newton iterations (an upper bound that was never reached; see in the following). All numerical results were obtained using a prototypal MATLAB code on a workstation based on Core i9-7900X CPU running at 3.3 GHz with 64 GB of RAM, using only one computing thread.

A. Comparison of WR Schemes
In this section, a comparison of several WR schemes is presented using a 100-port (with a 10 × 10 arrangement) shielding enclosure as running example; see Fig. 1. The initial characterization of the structure was performed as in [13] and Section III: Compressed macromodels were then computed to set up the discrete convolution operator H. For this investigation, the incident field waveform e inc (t) in (22) is defined with center frequency f c = 450 MHz, pulsewidth σ ≈ 0.415 ns, and unit amplitude E = 1. All transient simulations were run up to t max ≈ 15.5 ns with a time step Δt = 2.22 ps.
A common MATLAB-based simulation setup was developed to validate the proposed scheme of Section IV-C with respect to the standard WR as implemented in [20], the WR scheme equipped with a quasi-Newton iteration solver (the WR-NGMRES presented in [25], without time windowing), and the basic WR scheme with the time windowing discussed in Section IV-A. All schemes have been run by setting the convergence thresholds ε = τ r = τ a = 10 −6 . This setup enables us to determine the relative importance of the individual building blocks in the proposed algorithm.
The performance for all four implementations in terms of number of iterations required for convergence and overall runtime is summarized in Table I. From this table, we can draw the  following conclusions. 1) The largest number of iterations occurs for the basic WR scheme, as expected, with a long runtime (650 s) required by the many sequential evaluations of discrete convolution and termination operators. 2) Including quasi-Newton iterations in the WR-NGMRES implementation drastically reduces iterations, but with no advantage in runtime, which is even larger (779 s). This is due to the large number of evaluations required for the construction of the Krylov subspace basis used to solve the quasi-Newton iterations. Indeed, about 99.6% of the overall elapsed time is required for the 23 iterations of NGMRES. required on average for each time window, and the overall runtime is the smallest (15.6 s). 5) Reducing the number of windows to M = 10 results in a reduced overall efficiency and increased runtime. For this reason, the choice of M = 100 will be used as reference in the following investigations. As a confirmation of the accuracy of all methods, one of the resulting termination voltages is reported in Fig. 5 and compared to a reference SPICE simulation.

B. Benchmarking Against SPICE
This section provides a systematic comparison between the proposed iterative solver and SPICE, both in terms of accuracy and runtime. The structure that we use for this assessment is a 400-port box, excited by a Gaussian incident field e inc (t) = e g (t), with standard deviation of σ ≈ 0.47 ns and field amplitude E = 10 4 to exacerbate the nonlinear characteristic of the loads and further stress the limits of both simulation engines. All simulations have been run until t max = 20 ns with a time step Δt = 25 ps for the proposed scheme.
We used a recent version of an industry standard SPICE solver (HSPICE-L-2016.06-SP2-1 win64), and we performed a set of transient simulations with different configuration options, in order to investigate their effect on the solution. In particular, we remark the following.
1) Advanced solvers like HSPICE allow embedding macromodels both as synthesized equivalent circuits [37] and in pole-residue (Foster) forms (5). Both representations were considered in this investigation. 2) Numerical results of HSPICE were detected to be highly sensitive to a number of control parameters of the transient solver. In particular, the parameter OPTION.ACCURATE, which, when set, allows to increase the accuracy, and the parameter OPTION.DELMAX, which imposes a cap in the time step. 3) HPSICE automatically selects the integration time step, while our implementation of the WR scheme relies on a recursive convolution approach that requires a fixed time step Δt; this partly affects the comparison of the two methods in terms of accuracy, since a postprocessing interpolation is required to align the time samples of the signals for evaluating the errors. An additional error contribution induced by this interpolation should be considered. The results of this investigation are summarized in Fig. 6, where HSPICE results are compared to proposed scheme in terms of accuracy (vertical axis), runtime (text annotations), and equivalent relative speedup of the proposed scheme with respect to HSPICE (size of circles). HSPICE was run in four different configurations, represented by the four vertical bars in the figure as follows: 1) simulating the model in a state-space representation, synthesized as a standard equivalent circuit [34], [37], with default transient simulation options (red bar); 2) with default simulation options, but using a pole-residue (Foster) form to embed the model (green bar, case a); 3) as in case (a) but enabling OPTION.ACCURATE for improving HSPICE accuracy (cyan bar, case b); 4) as in case (b) and capping internal HSPICE time step to OPTION.DELMAX=10 −11 (purple bar, case c). The proposed scheme was applied by setting all convergence thresholds to 10 −4 . The conclusions supported by these results are the following.
1) Proposed solver is always faster than HSPICE, with a relative speedup ranging from 21× up to 185×. 2) HSPICE results are extremely sensitive on how the transient simulation is set up. In general, the more aggressive is the HSPICE accuracy control, the closer are HSPICE results to proposed scheme. This suggests the surprising conclusion that, at least for this specific case, the proposed WR implementation should be considered as a reference for HSPICE rather than the opposite. 3) HSPICE simulations based on Foster macromodels are always more accurate and faster than with an equivalent circuit model synthesis. The two panels of Fig. 7 report a selected transient response (voltage at port 399) by comparing the proposed approach to HSPICE Foster results with two different settings (corresponding to (a) and (b) in Fig. 6). These results confirm some localized waveform differences affecting HSPICE, which depend on the transient solver configuration. Based on these results, all HSPICE results that follow in next sections are set up as in case (b), e.g., using a Foster realization and setting OP-TION.ACCURATE.

C. Systematic Performance Comparison
In this section, a systematic comparison of the proposed transient solver with HSPICE is performed on the same 100-port enclosure of Section V-A, by changing amplitude and center frequency of the incident field pulse (22). A total of 25 simulations are performed on a 5 × 5 grid, choosing five linearly spaced values of the center frequency f c ∈ [100, 800] MHz and five logarithmically spaced values of the amplitude E ∈ [1, 10 4 ]. For each individual frequency, the pulsewidth σ in (22) is adapted so that the pulse shape is invariant except for a time-domain stretching/compression. As a result, also the pulse spectrum is compressed or stretch, remaining centered at f c .
The proposed WR solver is configured with a convergence threshold ε = 10 −6 . The integration time step has been defined as Δt = 0.001/f c to guarantee a proper sampling of the open circuit voltages v oc induced by the incident field, according to the selected center frequency. All simulations are run for K ≈ 7000 time steps. Results are compared with HSPICE using the two available macromodel representations (equivalent circuit and Foster), leading to an overall number of 50 transient results.
A summary of the results is reported in Fig. 8. Each test case is represented by a circle in terms of runtime for the proposed WR algorithm (horizontal axis) and HSPICE (vertical axis), with the size of each circle representing the worst-case rms error between the results of HSPICE and proposed solver among all port voltages. The figure also includes dashed lines corresponding to constant speedup of proposed solved with respect to HSPICE.
The figure confirms that proposed solver and HSPICE are generally in good agreement, with some cases where some differences are visible, especially for equivalent circuit macromodel embedding in HSPICE. Selected port voltages corresponding to the two worst-case rms error (largest circles in Fig. 8) are reported in Fig. 9. The proposed solver and HSPICE-Foster are always superimposed, where the HSPICE simulations based on equivalent circuit realizations exhibit some notable differences. These results confirm the observations of Section V-B.
About runtime, Fig. 8 clearly shows that the proposed method always provided a speedup factor of at least 10× with respect to HSPICE-Foster simulations, reaching almost 1000×  for HSPICE simulations based on equivalent circuits. In most cases, the proposed method provided a speedup factor ranging from 10× to 25× with respect to HSPICE-Foster and from 100× to 250× with respect to the HSPICE with equivalent circuits.

D. Assessing Scalability
In this section, some additional examples are reported to further demonstrate the effectiveness of the proposed iterative solver, in terms of scalability with the number of ports where lumped nonlinear terminations are connected. In particular, three shielding enclosures with P = 100, 400, and 1024 ports are considered, and both macromodel generation and transient simulation steps are discussed.
1) Macromodel Generation: A model of each unloaded enclosure has been built starting from tabulated frequency data from the MoM solver. Fig. 10(a) shows the accuracy of a passive model with P = 1024 ports built following the proposed macromodeling procedure. The complete model extraction flow, fully documented in [27] involves a number of steps in order to properly handle all challenges posed by this structure: a large-scale system with more than 10 6 responses to be fitted concurrently, bandlimited data from the MoM solver, and especially very low losses making the port transfer function only marginally passive. We recall that in order to guarantee numerically stable transient simulations, model passivity is a necessary condition, requiring that σ max {H(jω)} ≤ 1 must hold ∀ω ∈ [0, ∞). For this case, the maximum singular values of the model responses σ max {H(jω)} before and after passivity enforcement are reported in Fig. 10(c). To obtain these results, about 119 s were required for the model generation (including data compression) and about 44.3 h for the passivity enforcement, confirming that the latter still remains the most challenging step in any macromodeling flow. Obtaining a model with this complexity (1024 ports with common 71 poles, corresponding to an overall number of 72 704 states) with a standard tool (e.g., [38]) is practically unfeasible, unless complexity (hence, accuracy) is reduced by decreasing the number of poles. A passive model for Fig. 11. Simulation results of a 400-port shielding enclosure. The panels depict selected sets of voltages and currents on interface ports (top two panels); bottom panels report voltage and current on port 336, corresponding to the worst-case rms error (3.67 · 10 −3 ). the 400-port enclosure with 79 poles (31 600 states) has been obtained after a total of ≈4.3 h, considering all procedure steps (data compression, model fitting, and passivity enforcement), while a 100-port passive model with 79 poles has been generated in ≈ 10 min. A discussion on large-scale macromodel generation and related challenges is available in [27].
2) Transient Simulation: We finally document the transient simulation results on the two largest-scale nonlinearly loaded enclosures with P = 400 and P = 1024 ports, in all cases excited by an incident field pulse (22) with center frequency f c = 400 MHz and amplitude E = 10 4 . Fig. 11 depicts a set of selected responses of the 400-port structure, demonstrating a small error between SPICE and the proposed WR scheme. The worst-case rms error computed after interpolation of the SPICE results is 3.67 · 10 −3 . The same level of accuracy (rms= 1.88 · 10 −3 ) was attained in the 1024-port case, whose results are depicted in Fig. 12.
In all cases, the proposed approach provided a major speedup with respect to HSPICE, as summarized in Table II. The speedup factor increases from 37× to 89× when increasing from P = 100 to P = 1024, confirming better scalability of proposed WR iteration. The table also confirms the general agreement between proposed solver and HSPICE with a worst-case rms deviation always in the order of 10 −3 .
We further remark the following:  1) HSPICE has been run only in its best possible configuration, using a pole-residue representation of the model; 2) the proposed scheme could be executed on a normal office notebook (16 GB of RAM, Core i5-10210 U CPU) to solve the most demanding case with P = 1024, requiring about 3 min using four computational threads in MATLAB. The same simulation could not run using HSPICE due to memory limitations.

VI. CONCLUSION
This article presented an iterative transient simulation approach based on the WR framework, specifically designed to solve possibly large-scale linear electromagnetic structures loaded at several ports with nonlinear devices. The proposed solver was based on the availability of a rational macromodel representing the port behavior of the electromagnetic structure, including a nonhomogeneous contribution due to internal sources or incident field excitation. The main advantages of proposed approach arise from a combination of inexact Newton-Krylov iterations for guaranteeing convergence, and time-windowing in order to further speedup the convergence rate. We had demonstrated that the resulting scheme outperforms several other iterative WR approaches. More accurate results than industry-standard circuit solvers (HSPICE) were obtained, with speedup factors always exceeding 10× and reaching in some cases almost 1000×. Results were documented for shielding enclosures with up to P = 1024 nonlinearly loaded ports.
Several open issues and directions for future investigations remain. On the algorithmic side, parallelization for multicore or GPU architecture according to [19] or [39] is expected to provide a further advantage in terms of runtime, given the wide availability of low-cost multiprocessors. Moreover, the possibly automated and dynamic (at runtime) optimization of number and duration of time windows in proposed approach remains an open issue. On the formulation side, this work needs to be extended to more general types of nonlinear devices, including dynamic elements/contributions, overcoming the current limitation of static nonlinearities.